- ceph-users - lists.ceph.io

Re: using non client.admin user for ceph-iscsi gateways

by Jason Dillaman

On Fri, Sep 6, 2019 at 12:00 PM Wesley Dillingham <wdillingham(a)godaddy.com> wrote: > > the iscsi-gateway.cfg seemingly allows for an alternative cephx user other than client.admin to be used, however the comments in the documentations says specifically to use client.admin. Hmm, can you point out where this is in the docs? Originally, tcmu-runner didn't support the ability to change the user id, but that has been available for about a year now [1]. > Other than having the cfg file point to the appropriate key/user with "gateway_keyring" and giving that client read caps on the mons and full access to the pool configured to be used for iscsi are any other particular steps / settings / actions needed? Just use "profile rbd" for your caps to keep it simple. > It seems prudent to not use client.admin but I don't want to have unstable behavior or untested setup. > > Thanks. > > Respectfully, > > Wes Dillingham > wdillingham(a)godaddy.com > Site Reliability Engineer IV - Platform Storage / Ceph > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io [1] https://github.com/open-iscsi/tcmu-runner/commit/c85ccdcfb7f4b17926eda1df89… -- Jason

4 years, 8 months

2
1
0 0

using non client.admin user for ceph-iscsi gateways

by Wesley Dillingham

the iscsi-gateway.cfg seemingly allows for an alternative cephx user other than client.admin to be used, however the comments in the documentations says specifically to use client.admin. Other than having the cfg file point to the appropriate key/user with "gateway_keyring" and giving that client read caps on the mons and full access to the pool configured to be used for iscsi are any other particular steps / settings / actions needed? It seems prudent to not use client.admin but I don't want to have unstable behavior or untested setup. Thanks. Respectfully, Wes Dillingham wdillingham(a)godaddy.com Site Reliability Engineer IV - Platform Storage / Ceph

4 years, 8 months

1
0
0 0

Followup: weird behaviour with ceph osd pool create and the "crush-rule" parameter (suddenly changes behaviour)

by aoanla＠gmail.com

So, whilst debugging the behaviour in the first thread I created, I needed to create and then destroy pools (to avoid running out of placement groups). So, I did something like: ceph osd pool create ec2pool 2048 2048 erasure glasgow-eci-test ec2pool 0 ceph osd pool create ec3pool 2048 2048 erasure glasgow-eci-test2 ec3pool 0 (for two different types of ecpool) and then removed them with ceph osd pool rm ec2pool ec2pool --yes-i-really-really-mean-it ceph osd pool rm ec3pool ec3pool --yes-i-really-really-mean-it Now, however, something seems to have broken, as if I attempt: ceph osd pool create ec4pool 2048 2048 erasure glasgow-eci-test3 ec4pool 0 it fails with Error ENOENT: specified rule ec4pool doesn't exist (which, of course, it does not, as the whole point of the syntax is that ceph should build the crush rule for me and name it appropriately; and this worked for all the previous times). ceph health returns HEALTH OK still. Any suggestions? I've googled around a bit on this, but I can't seem to find anyone discussing it... Sam

4 years, 8 months

4
6
0 0

RGW bucket check --check-objects -fix failed

by EDH - Manuel Rios Fernandez

Hi, Were at 14.2.2 We just found a broken bucket index, trying to repair with the common commands ]# radosgw-admin bucket check --check-objects fix finish instantly, but bucket should have near 60-70TB info. [root@CEPH-MON01 home]# radosgw-admin bucket check --check-objects --bucket BUCKETNAME --debug_rgw=10 2019-09-05 20:08:13.908 7fdc8db37580 10 cannot find current period zonegroup using local zonegroup 2019-09-05 20:08:13.910 7fdc8db37580 10 Cannot find current period zone using local zone 2019-09-05 20:08:13.933 7fdc8db37580 2 all 8 watchers are set, enabling cache 2019-09-05 20:08:13.942 7fdc56ffd700 2 RGWDataChangesLog::ChangesRenewThread: start 2019-09-05 20:08:13.948 7fdc8db37580 10 cache get: name=default.rgw.data.root++BUCKETNAME : miss 2019-09-05 20:08:13.950 7fdc8db37580 10 cache put: name=default.rgw.data.root++BUCKETNAME info.flags=0x16 2019-09-05 20:08:13.950 7fdc8db37580 10 adding default.rgw.data.root++BUCKETNAME to cache LRU end 2019-09-05 20:08:13.950 7fdc8db37580 10 updating xattr: name=ceph.objclass.version bl.length()=42 2019-09-05 20:08:13.950 7fdc8db37580 10 cache get: name=default.rgw.data.root++BUCKETNAME : type miss (requested=0x11, cached=0x16) 2019-09-05 20:08:13.950 7fdc8db37580 10 cache put: name=default.rgw.data.root++BUCKETNAME info.flags=0x11 2019-09-05 20:08:13.950 7fdc8db37580 10 moving default.rgw.data.root++BUCKETNAME to cache LRU end 2019-09-05 20:08:13.950 7fdc8db37580 10 cache get: name=default.rgw.data.root++.bucket.meta.BUCKETNAME:48efb8c3-693c-4fe0-bbe4- fdc16f590a82.16313306.1 : miss 2019-09-05 20:08:13.951 7fdc8db37580 10 cache put: name=default.rgw.data.root++.bucket.meta.BUCKETNAME:48efb8c3-693c-4fe0-bbe4- fdc16f590a82.16313306.1 info.flags=0x16 2019-09-05 20:08:13.951 7fdc8db37580 10 adding default.rgw.data.root++.bucket.meta.BUCKETNAME:48efb8c3-693c-4fe0-bbe4-fdc16 f590a82.16313306.1 to cache LRU end 2019-09-05 20:08:13.951 7fdc8db37580 10 updating xattr: name=ceph.objclass.version bl.length()=42 2019-09-05 20:08:13.951 7fdc8db37580 10 updating xattr: name=user.rgw.acl bl.length()=145 2019-09-05 20:08:13.951 7fdc8db37580 10 updating xattr: name=user.rgw.lc bl.length()=467 2019-09-05 20:08:13.951 7fdc8db37580 10 cache get: name=default.rgw.data.root++.bucket.meta.BUCKETNAME:48efb8c3-693c-4fe0-bbe4- fdc16f590a82.16313306.1 : type miss (requested=0x13, cached=0x16) 2019-09-05 20:08:13.951 7fdc8db37580 10 cache put: name=default.rgw.data.root++.bucket.meta.BUCKETNAME:48efb8c3-693c-4fe0-bbe4- fdc16f590a82.16313306.1 info.flags=0x13 2019-09-05 20:08:13.951 7fdc8db37580 10 moving default.rgw.data.root++.bucket.meta.BUCKETNAME:48efb8c3-693c-4fe0-bbe4-fdc16 f590a82.16313306.1 to cache LRU end 2019-09-05 20:08:13.951 7fdc8db37580 10 updating xattr: name=ceph.objclass.version bl.length()=42 2019-09-05 20:08:13.951 7fdc8db37580 10 updating xattr: name=user.rgw.acl bl.length()=145 2019-09-05 20:08:13.951 7fdc8db37580 10 updating xattr: name=user.rgw.lc bl.length()=467 2019-09-05 20:08:13.951 7fdc8db37580 10 chain_cache_entry: cache_locator=default.rgw.data.root++BUCKETNAME 2019-09-05 20:08:13.951 7fdc8db37580 10 chain_cache_entry: cache_locator=default.rgw.data.root++.bucket.meta.BUCKETNAME:48efb8c3-693c-4 fe0-bbe4-fdc16f590a82.16313306.1 2019-09-05 20:08:13.951 7fdc8db37580 10 cache get: name=default.rgw.data.root++.bucket.meta.BUCKETNAME:48efb8c3-693c-4fe0-bbe4- fdc16f590a82.16313306.1 : type miss (requested=0x16, cached=0x13) 2019-09-05 20:08:13.952 7fdc8db37580 10 cache put: name=default.rgw.data.root++.bucket.meta.BUCKETNAME:48efb8c3-693c-4fe0-bbe4- fdc16f590a82.16313306.1 info.flags=0x16 2019-09-05 20:08:13.952 7fdc8db37580 10 moving default.rgw.data.root++.bucket.meta.BUCKETNAME:48efb8c3-693c-4fe0-bbe4-fdc16 f590a82.16313306.1 to cache LRU end 2019-09-05 20:08:13.952 7fdc8db37580 10 updating xattr: name=ceph.objclass.version bl.length()=42 2019-09-05 20:08:13.952 7fdc8db37580 10 updating xattr: name=user.rgw.acl bl.length()=145 2019-09-05 20:08:13.952 7fdc8db37580 10 updating xattr: name=user.rgw.lc bl.length()=467 2019-09-05 20:08:13.952 7fdc8db37580 10 cache get: name=default.rgw.data.root++.bucket.meta.BUCKETNAME:48efb8c3-693c-4fe0-bbe4- fdc16f590a82.16313306.1 : hit (requested=0x11, cached=0x17) 2019-09-05 20:08:13.952 7fdc8db37580 10 cls_bucket_list_ordered BUCKETNAME[48efb8c3-693c-4fe0-bbe4-fdc16f590a82.3856921.7] start _multipart_[] num_entries 1001 2019-09-05 20:08:13.960 7fdc8db37580 2 removed watcher, disabling cache Any recomendations for forcé recreate bucket index? Regards Manuel

4 years, 8 months

1
1
0 0

Heavily-linked lists.ceph.com pipermail archive now appears to lead to 404s

by Florian Haas

Hi, is there any chance the list admins could copy the pipermail archive from lists.ceph.com over to lists.ceph.io? It seems to contain an awful lot of messages referred elsewhere by their archive URL, many (all?) of which appear to now lead to 404s. Example: google "Set existing pools to use hdd device class only". The top hit is a link to http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/029078.html: $ curl -IL http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/029078.html HTTP/1.1 301 Moved Permanently Server: nginx/1.10.3 (Ubuntu) Date: Thu, 29 Aug 2019 12:48:13 GMT Content-Type: text/html Content-Length: 194 Connection: keep-alive Location: https://lists.ceph.io/pipermail/ceph-users-ceph.com/2018-August/029078.html Strict-Transport-Security: max-age=31536000 HTTP/1.1 404 Not Found Server: nginx Date: Thu, 29 Aug 2019 12:48:14 GMT Content-Type: text/html; charset=utf-8 Content-Length: 3774 Connection: keep-alive X-Frame-Options: SAMEORIGIN Vary: Accept-Language, Cookie Content-Language: en Or maybe this is just a redirect rule that needs to be cleverer or more specific, rather than the apparent catch-all .com/.io redirect? Cheers, Florian

4 years, 8 months

5
7
1 0

disk failure

by solarflow99

One of the things i've come to notice is when HDD drives fail, they often recover in a short time and get added back to the cluster. This causes the data to rebalance back and forth, and if I set the noout flag I get a health warning. Is there a better way to avoid this?

4 years, 8 months

4
6
0 0

Re: CephFS+NFS For VMWare

by Maged Mokhtar

this is an old thread, but could be useful for others, i found out the discrepancy in VMware vmotion speed under iSCSI is probably due the "emulate_3pc" config attribute for the LIO target. if set to 0, then yes VMWare will issue io in 64KB blocks, so the bandwidth will indeed be around 25 MB/s. If emulate_3pc is set to 1, this will trigger VMWare to use vaai extended copy, which activates LIO's xcopy functionality which uses 512KB block sizes by default. We also bumped the xcopy block size to 4M (rbd object size) which gives around 400 MB/s vmotion speed, the same speed can also be achieved via Veeam backups. /Maged On 02/07/2018 14:36, Maged Mokhtar wrote: > > Hi Nick, > > With iSCSI we reach over 150 MB/s vmotion for single vm, 1 GB/s for > 7-8 vm migrations. Since these are 64KB block sizes, latency/iops is a > large factor, you need either controllers with write back cache or all > flash . hdds without write cache will suffer even with external wal/db > on ssds, giving around 80 MB/s vmotion migration. Potentially it may > be possible to get higher vmotion speeds by using fancy striping but i > would not recommend this unless your total queue depths in all your > vms is small compared to the number of osds. > > Regarding thin provisioning, a vmdk provisioned as lazy zeroed does > have an "initial" large impact on random write performance, could be > up to 10x slower. If you are writing a random 64KB to an un-allocated > vmfs block, vmfs will first write 1MB to fill the block with zeros > then write the 64KB client data, so although a lot of data is being > written the perceived client bandwidth is very low. The performance > will gradually get better with time until the disk is fully > provisioned. It is also possible to thick eager zero the vmdk disk at > creation time. Again this is more apparent with random writes rather > than sequential or vmotion load. > > Maged > > On 2018-06-29 18:48, Nick Fisk wrote: > >> This is for us peeps using Ceph with VMWare. >> >> My current favoured solution for consuming Ceph in VMWare is via >> RBD’s formatted with XFS and exported via NFS to ESXi. This seems to >> perform better than iSCSI+VMFS which seems to not play nicely with >> Ceph’s PG contention issues particularly if working with thin >> provisioned VMDK’s. >> >> I’ve still been noticing some performance issues however, mainly >> noticeable when doing any form of storage migrations. This is largely >> due to the way vSphere transfers VM’s in 64KB IO’s at a QD of 32. >> vSphere does this so Arrays with QOS can balance the IO easier than >> if larger IO’s were submitted. However Ceph’s PG locking means that >> only one or two of these IO’s can happen at a time, seriously >> lowering throughput. Typically you won’t be able to push more than >> 20-25MB/s during a storage migration >> >> There is also another issue in that the IO needed for the XFS journal >> on the RBD, can cause contention and effectively also means every NFS >> write IO sends 2 down to Ceph. This can have an impact on latency as >> well. Due to possible PG contention caused by the XFS journal updates >> when multiple IO’s are in flight, you normally end up making more and >> more RBD’s to try and spread the load. This normally means you end up >> having to do storage migrations…..you can see where I’m getting at here. >> >> I’ve been thinking for a while that CephFS works around a lot of >> these limitations. >> >> 1.It supports fancy striping, so should mean there is less per object >> contention >> >> 2.There is no FS in the middle to maintain a journal and other >> associated IO >> >> 3.A single large NFS mount should have none of the disadvantages seen >> with a single RBD >> >> 4.No need to migrate VM’s about because of #3 >> >> 5.No need to fstrim after deleting VM’s >> >> 6.Potential to do away with pacemaker and use LVS to do active/active >> NFS as ESXi does its own locking with files >> >> With this in mind I exported a CephFS mount via NFS and then mounted >> it to an ESXi host as a test. >> >> Initial results are looking very good. I’m seeing storage migrations >> to the NFS mount going at over 200MB/s, which equates to several >> thousand IO’s and seems to be writing at the intended QD32. >> >> I need to do more testing to make sure everything works as intended, >> but like I say, promising initial results. >> >> Further testing needs to be done to see what sort of MDS performance >> is required, I would imagine that since we are mainly dealing with >> large files, it might not be that critical. I also need to consider >> the stability of CephFS, RBD is relatively simple and is in use by a >> large proportion of the Ceph community. CephFS is a lot easier to >> “upset”. >> >> Nick >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users(a)lists.ceph.com <mailto:ceph-users@lists.ceph.com> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >

4 years, 8 months

1
0
0 0

Best osd scenario + ansible config?

by Yoann Moulin

Hello, I am deploying a new Nautilus cluster and I would like to know what would be the best OSD's scenario config in this case : 10x 6TB Disk OSDs (data) 2x 480G SSD previously used for journal and can be used for WAL and/or DB Is it better to put all WAL on one SSD and all DBs on the other one? Or put WAL and DB of the first 5 OSDs on the first SSD and the 5 others on the second one. A more general question, what is the impact on an OSD if we lose the WAL? The DB? Both? I plan to use EC 7+5 on 12 servers and I am OK if I lose one server temporarily. I have spare servers and I can easily add another one in this cluster. To deploy this cluster, I use ceph-ansible (stable-4.0). I am not sure how to configure the playbook to use SSD and disks with LVM. https://github.com/ceph/ceph-ansible/blob/master/docs/source/osds/scenarios… Is this good? osd_objectstore: bluestore lvm_volumes: - data: data-lv1 data_vg: data-vg1 db: db-lv1 db_vg: db-vg1 wal: wal-lv1 wal_vg: wal-vg1 - data: data-lv2 data_vg: data-vg2 db: db-lv2 db_vg: db-vg2 wal: wal-lv2 wal_vg: wal-vg2 Is it possible to let the playbook configure LVM for each disk in a mixed case? It looks like I must configure LVM before running the playbook but I am not sure if I missed something. Is wal_vg and db_vg can be identical (on VG per SSD shared with multiple OSDs)? Thanks for your help. Best regards, -- Yoann Moulin EPFL IC-IT

4 years, 8 months

5
12
0 0

ceph fs crashes on simple fio test

by Frank Schilder

I need to harden our ceph cluster to satisfy the following properties: Assuming all hardware is functioning properly, 1) Cluster health has highest priority. Heartbeats have priority over client requests. 2) The cluster does not accept more IO than the OSDs can handle. The only exception might be a configurable burst option. 3) Client IO is accepted as long as it does not compromise 1. 4) Ideally, there is fair sharing of the cluster's IO budget between clients (like deadline or completely fair scheduling). Rogue clients should not get priority just because they push a lot. Unfortunately, with default settings our (any?) cluster prefers client IO over cluster health, which opens up for a simple but serious non-privileged client attack on cluster health: I observed a serious issue on our cluster when running a simple fio test that does 4K random writes on 100GB files (see details below). What I observe is that within a few seconds the cluster goes to health_warn with the MDS reporting slow meta data IO and behind on trimming. What is not shown in ceph health detail is, that all OSDs report thousands of slow ops and the counter increases really fast (I include some snippets below). This goes rapidly to the point where OSDs start missing heartbeats and start flapping, some PGs become inactive+degraded and start peering regularly. CPU load, memory and even network load were rather low (one-figure % CPU load). Nothing out of the ordinary. The mons had no issues. After the fio test completed, the OSDs slowly crunched through the backlog and managed to complete all OPS. The complete processing of the ops of a 30 second fio test took ca. 10 minutes. However, even then the cluster did not come back healthy as essential messages between daemons seem to have been lost. It is not a long stretch to assume that one can destroy a ceph fs beyond repair when running this test or an application performing the same IO pattern from multiple clients for several hours. We have a 500 node cluster as clients and I'm afraid that even ordinary IO might trigger this scenario in unlucky circumstances. Since our cluster is only half-trusted (we have root access to clients, but no control over user IO patterns), we are in need to harden our cluster against such destructive IO patterns as much as possible. To me it looks like the cluster is accepting way more IO than the OSDs can handle. Ideally, what I would like to do is configure effective rate limiting on (rogue) clients depending on how much OPS they have in flight already. I would expect that there are tunables for MDS/OSD daemons that control how much IO requests a client can submit/OSD will accept before throttling IO. In particular, I would like to prioritize heartbeats to prevent load-induced OSD flapping. How can I tune the cluster to satisfy the conditions outlined at the top of this e-mail? There were recent threads with similar topics, in particular, "MDS failing under load with large cache sizes" and others reporting unstable MDS daemons under load. However, I believe they were mostly related to cache trimming issues due to large amounts of files created. This is not the case here, its just 4 files with lots of random IO from a single client. A bit of information about our cluster and observations: The cluster is bluestore-only with 8+2 EC fs data pool on spinning disks and an 3(2) replicated fs meta data pool on SSD. We have 8 OSD hosts with 2 shards per host. Each host has 4SSDs and 12 10TB HDDs SAS 12GB with 4k block size. Network is 2x10G bonded for client and 2x10G bonded for replication. The replication network will be extended to 4x10G soon. We are aware that currently the network bandwidth greatly exceeds what the spinning disks can handle. It is dimensioned for adding more disks in the future. The fio job script is: [global] name=fio-rand-write directory=/home/fio # /home is on ceph fs filename_format=tmp/fio-$jobname-${HOSTNAME}-$jobnum-$filenum rw=randwrite bs=4K numjobs=4 time_based=1 runtime=30 [file1] size=100G ioengine=sync Its one of the examples in the fio source repo with small modifications (file pattern, run time). Shortly after starting this fio job from a client connected with single 10G line, all OSDs start reporting slow ops. Picking one, the log messages look like this: Aug 22 12:06:22 ceph-09 ceph-osd: 2019-08-22 10:06:22.151 7f399d1bd700 -1 osd.125 3778 get_health_metrics reporting 146 slow ops, oldest is osd_op(client.1178165.0:1670940 5.3fs0 5:fcc65fe3:::10000f403bc.0000063e:head [write 307200~4096,write 315392~8192,write 380928~12288,write 405504~12288,write 421888~4096,write 458752~4096,write 466944~4096,write 475136~4096,write 487424~8192,write 512000~4096,write 524288~4096,write 565248~4096,write 589824~8192,write 622592~4096,write 651264~4096,write 724992~12288] snapc 12e=[] ondisk+write+known_if_redirected e3778) ... Aug 22 12:06:49 ceph-09 journal: 2019-08-22 10:06:49.415 7f399d1bd700 -1 osd.125 3779 get_health_metrics reporting 4595 slow ops, olde st is osd_op(client.1178165.0:1686216 5.acs0 5:354e96d5:::10000f403bc.00000bc1:head [write 2359296~4096,write 2375680~4096,write 23838 72~4096,write 2404352~4096,write 2428928~8192,write 2469888~4096,write 2490368~8192,write 2514944~4096,write 2527232~4096,write 253542 4~4096,write 2588672~4096,write 2600960~4096,write 2621440~8192,write 2658304~4096,write 2715648~8192,write 2727936~4096] snapc 12e=[] ondisk+write+known_if_redirected e3778) ... Aug 22 12:12:57 ceph-09 journal: 2019-08-22 10:12:57.650 7f399d1bd700 -1 osd.125 3839 get_health_metrics reporting 8419 slow ops, oldest is osd_op(client.1178165.0:2009417 5.3fs0 5:fcdcf2bd:::10000f47e2d.00001501:head [write 2236416~4096,write 2256896~4096,write 2265088~4096,write 2301952~4096,write 2322432~4096,write 2355200~4096,write 2371584~4096,write 2387968~4096,write 2449408~4096,write 2486272~4096,write 2547712~8192,write 2617344~4096,write 2809856~4096,write 3018752~4096,write 3194880~4096,write 3223552~4096] snapc 12e=[] ondisk+write+known_if_redirected e3839) Aug 22 12:12:58 ceph-09 ceph-osd: 2019-08-22 10:12:58.681 7f399d1bd700 -1 osd.125 3839 get_health_metrics reporting 8862 slow ops, oldest is osd_op(mds.0.16909:69577511 5.ees0 5.992388ee (undecoded) ondisk+write+known_if_redirected+full_force e3839) ... Aug 22 12:13:27 ceph-09 journal: 2019-08-22 10:13:27.691 7f399d1bd700 -1 osd.125 3839 get_health_metrics reporting 13795 slow ops, oldest is osd_op(mds.0.16909:69577573 5.e6s0 5.d8994de6 (undecoded) ondisk+write+known_if_redirected+full_force e3839) ... Aug 22 12:13:59 ceph-09 ceph-osd: 2019-08-22 10:13:59.762 7f399d1bd700 -1 osd.125 3900 get_health_metrics reporting 12 slow ops, oldest is osd_op(mds.0.16909:69577511 5.ees0 5.992388ee (undecoded) ondisk+retry+write+known_if_redirected+full_force e3875) ... Aug 22 12:14:46 ceph-09 journal: 2019-08-22 10:14:46.569 7f399d1bd700 -1 osd.125 3916 get_health_metrics reporting 969 slow ops, oldes t is osd_op(mds.0.16909:69577511 5.ees0 5.992388ee (undecoded) ondisk+retry+write+known_if_redirected+full_force e3875) Aug 22 12:14:47 ceph-09 ceph-osd: 2019-08-22 10:14:47.617 7f399d1bd700 -1 osd.125 3935 get_health_metrics reporting 1 slow ops, oldest is osd_op(mds.0.16909:69577511 5.ees0 5:7711c499:::10000f4798b.00000000:head [create,setxattr parent (289),setxattr layout (30)] snap c 0=[] RETRY=2 ondisk+retry+write+known_if_redirected+full_force e3875) ... Aug 22 12:14:53 ceph-09 journal: 2019-08-22 10:14:53.675 7f399d1bd700 -1 osd.125 3939 get_health_metrics reporting 1 slow ops, oldest is osd_op(mds.0.16909:69577511 5.ees0 5:7711c499:::10000f4798b.00000000:head [create,setxattr parent (289),setxattr layout (30)] snapc 0=[] RETRY=2 ondisk+retry+write+known_if_redirected+full_force e3875) This is the last log message, the OSD seems to have executed all OPS at this point or shortly after. The cluster state went to health_ok - at least shortly. At about 12:45 (yes, lunch break) we looked at the cluster again and it was back in health_warn with the following status output: [root@ceph-01 ~]# ceph status cluster: id: e4ece518-f2cb-4708-b00f-b6bf511e91d9 health: HEALTH_WARN 1 MDSs report slow metadata IOs 2 MDSs behind on trimming services: mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03 mgr: ceph-01(active), standbys: ceph-02, ceph-03 mds: con-fs-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay osd: 192 osds: 192 up, 192 in data: pools: 7 pools, 790 pgs objects: 9.01 M objects, 16 TiB usage: 20 TiB used, 1.3 PiB / 1.3 PiB avail pgs: 790 active+clean io: client: 1.9 MiB/s rd, 21 MiB/s wr, 60 op/s rd, 721 op/s wr [root@ceph-01 ~]# ceph health detail HEALTH_WARN 1 MDSs report slow metadata IOs; 2 MDSs behind on trimming MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs mdsceph-08(mds.0): 2 slow metadata IOs are blocked > 30 secs, oldest blocked for 3275 secs MDS_TRIM 2 MDSs behind on trimming mdsceph-08(mds.0): Behind on trimming (1778/128) max_segments: 128, num_segments: 1778 mdsceph-12(mds.0): Behind on trimming (1780/128) max_segments: 128, num_segments: 1780 Num_segments was increasing constantly. Apparently, an operation got stuck and never completed. We hunted a little bit and found [root@ceph-mds:ceph-08 /]# ceph daemon mds.ceph-08 objecter_requests { "ops": [ { "tid": 71206302, "pg": "4.952611b7", "osd": 19, "object_id": "200.0001be74", "object_locator": "@4", "target_object_id": "200.0001be74", "target_object_locator": "@4", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "268516s", "attempts": 1, "snapid": "head", "snap_context": "0=[]", "mtime": "2019-08-22 11:12:41.0.565912s", "osd_ops": [ "write 890478~1923" ] }, { "tid": 71206303, "pg": "4.952611b7", "osd": 19, "object_id": "200.0001be74", "object_locator": "@4", "target_object_id": "200.0001be74", "target_object_locator": "@4", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "268516s", "attempts": 1, "snapid": "head", "snap_context": "0=[]", "mtime": "2019-08-22 11:12:41.0.566236s", "osd_ops": [ "write 892401~1931" ] }, { "tid": 71206301, "pg": "5.eeb918c6", "osd": 67, "object_id": "10000f26f67.00000000", "object_locator": "@5", "target_object_id": "10000f26f67.00000000", "target_object_locator": "@5", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "268516s", "attempts": 1, "snapid": "head", "snap_context": "12e=[]", "mtime": "1970-01-01 00:00:00.000000s", "osd_ops": [ "trimtrunc 81854@573" ] }, { "tid": 69577573, "pg": "5.d8994de6", "osd": 125, "object_id": "10000f479c9.00000000", "object_locator": "@5", "target_object_id": "10000f479c9.00000000", "target_object_locator": "@5", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "265042s", "attempts": 5, "snapid": "head", "snap_context": "0=[]", "mtime": "2019-08-22 10:12:05.0.256058s", "osd_ops": [ "create", "setxattr parent (319)", "setxattr layout (30)" ] }, { "tid": 69577598, "pg": "5.deb003e6", "osd": 125, "object_id": "10000f479da.00000000", "object_locator": "@5", "target_object_id": "10000f479da.00000000", "target_object_locator": "@5", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "265042s", "attempts": 5, "snapid": "head", "snap_context": "0=[]", "mtime": "2019-08-22 10:12:05.0.258824s", "osd_ops": [ "create", "setxattr parent (288)", "setxattr layout (30)" ] }, { "tid": 71206300, "pg": "5.5cd5b20b", "osd": 163, "object_id": "10000f01396.00000000", "object_locator": "@5", "target_object_id": "10000f01396.00000000", "target_object_locator": "@5", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "268516s", "attempts": 1, "snapid": "head", "snap_context": "12e=[]", "mtime": "1970-01-01 00:00:00.000000s", "osd_ops": [ "trimtrunc 208782@573" ] } ], "linger_ops": [], "pool_ops": [], "pool_stat_ops": [], "statfs_ops": [], "command_ops": [] } Notice the ops from the 1970's. Checking ops and dump_blocked_ops on osd.19 showed that these lists were empty. So, we decided to restart osd.19 and it cleared out most of the stuck requests, but did not clear the health warnings: [root@ceph-mds:ceph-08 /]# ceph daemon mds.ceph-08 objecter_requests { "ops": [ { "tid": 69577573, "pg": "5.d8994de6", "osd": 125, "object_id": "10000f479c9.00000000", "object_locator": "@5", "target_object_id": "10000f479c9.00000000", "target_object_locator": "@5", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "265042s", "attempts": 5, "snapid": "head", "snap_context": "0=[]", "mtime": "2019-08-22 10:12:05.0.256058s", "osd_ops": [ "create", "setxattr parent (319)", "setxattr layout (30)" ] }, { "tid": 69577598, "pg": "5.deb003e6", "osd": 125, "object_id": "10000f479da.00000000", "object_locator": "@5", "target_object_id": "10000f479da.00000000", "target_object_locator": "@5", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "265042s", "attempts": 5, "snapid": "head", "snap_context": "0=[]", "mtime": "2019-08-22 10:12:05.0.258824s", "osd_ops": [ "create", "setxattr parent (288)", "setxattr layout (30)" ] } ], "linger_ops": [], "pool_ops": [], "pool_stat_ops": [], "statfs_ops": [], "command_ops": [] } Restarting osd.125 finally resolved the health issues. However, the client I run fio on had lost connection to ceph due to this incident, which is really annoying. This client is the head node of our HPC cluster and it was not possible to restore ceph fs access without reboot. This is an additional bad fallout as all users will loose access to our HPC cluster when this happens (/home is on the ceph fs). I dumped the dump_historic_slow_ops of osd.125 in case anyone can use this information. I might be able to repeat this experiment, but cannot promise anything. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

4 years, 8 months

5
14
0 0

bluestore_default_buffered_write

by Fyodor Ustinov

Hi! Сan anybody help me - if I turn on bluestore_default_buffered_write will i get a WriteBack or WriteThrow? According to the documentation, we don’t understand this. And the second question - but in general there is an analog of the writeback in the OSD (I perfectly understand the danger of such a cache). WBR, Fyodor.

4 years, 8 months

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users