February 2020 - ceph-users

Strange speed issues with XFS and very small writes

by Arvydas Opulskis

Hi, Cephers. I would like to hear your ideas about strange situation we have in one of our clusters. It's Luminous 12.2.12 cluster. Recently we added 3 nodes with 10x SSD OSDs to it and dedicated them to SSD pool for our OpenStack volumes. Initial tests went well, IOPS were great, throughput was perfect - all good. Until we got first real usage there. Very limited IOPS (~450), high disk utilization (near 100%) and throughput (less than 1 MB/s) put us into the tears. After some investigation we found, that this situation only occurs when all conditions are met: 1. Disk is RBD (test went fine from same server with local disks) 2. File system is XFS (no problems with ext4) 3. Block size is bigger than write 4. Only one FIO thread (numjobs) is used When at least one of these conditions are not met - we get ~40k IOPS, great throughput, etc. We did tests with fio, testing different values, but it's quite clear: if write size is 4kb (same as block size) iops go up to 40k. If write size is 3kb, then it limits to ~450 iops. From this point, it doesn't matter how small the write is - it's always ~450 iops. After changing block size to 2kb, situation is same - great speed until write is less than 2kb in size. If we rise fio paramether "numjobs" to 10 we get maximum possible iops: ~40k. Which is more than simple 10x increase. Any ideas what is going on and why smaller writes take such a big impact on performance in XFS, but no problems in EXT4? Thank you for all the ideas! Arvydas

4 years, 2 months

1
0
0 0

Ceph MDS ASSERT In function 'MDRequestRef'

by Stefan Kooman

Hi, We hit the following assert: -10001> 2020-02-13 17:42:35.543 7f11b5669700 -1 /build/ceph-13.2.8/src/mds/MDCache.cc: In function 'MDRequestRef MDCa che::request_get(metareqid_t)' thread 7f11b5669700 time 2020-02-13 17:42:35.545815 /build/ceph-13.2.8/src/mds/MDCache.cc: 9523: FAILED assert(p != active_requests.end()) ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7f11bd8e69de] 2: (()+0x287b67) [0x7f11bd8e6b67] 3: (MDCache::request_get(metareqid_t)+0x94) [0x560cde8bb214] 4: (Server::journal_close_session(Session*, int, Context*)+0x9dd) [0x560cde829d1d] 5: (Server::handle_client_session(MClientSession*)+0x1071) [0x560cde82b0f1] 6: (Server::dispatch(Message*)+0x30b) [0x560cde86f87b] 7: (MDSRank::handle_deferrable_message(Message*)+0x434) [0x560cde7e1664] 8: (MDSRank::_dispatch(Message*, bool)+0x89b) [0x560cde7f8c7b] 9: (MDSRankDispatcher::ms_dispatch(Message*)+0xa3) [0x560cde7f92e3] 10: (MDSDaemon::ms_dispatch(Message*)+0xd3) [0x560cde7d92b3] 11: (DispatchQueue::entry()+0xb92) [0x7f11bd9a9e52] 12: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f11bda46e2d] 13: (()+0x76db) [0x7f11bd1d76db] 14: (clone()+0x3f) [0x7f11bc3bd88f] Before we hit this assert there were a few (kernel clients, 5.3.0-26/28) that were not playing nicely: 16:32 < bitrot> mds.mds1 [WRN] client.61994841 isn't responding to mclientcaps(revoke), ino 0x1003846ddc5 pending pAsLsXsFscr issued pAsLsXsFscr, sent 62.342791 seconds ago 16:32 < bitrot> mon.mon1 [WRN] Health check failed: 1 clients failing to respond to capability release (MDS_CLIENT_LATE_RELEASE) We rebooted both clients. After that one of them again had some slow requests. We umounted the file system, slowly after that the MDS hit the assert. Failover went fine this time. This looks like issue: https://tracker.ceph.com/issues/23059 ... but that should already have been resolved. Is this the same issue, and or a regression? We run 13.2.8. Thanks, Stefan -- | BIT BV https://www.bit.nl/ Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / info(a)bit.nl

4 years, 2 months

2
2
0 0

Identify slow ops

by Thomas Schneider

Hi, the current output of ceph -s reports a warning: 2 slow ops, oldest one blocked for 347335 sec, mon.ld5505 has slow ops This time is increasing. root@ld3955:~# ceph -s cluster: id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae health: HEALTH_WARN 9 daemons have recently crashed 2 slow ops, oldest one blocked for 347335 sec, mon.ld5505 has slow ops services: mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 3d) mgr: ld5507(active, since 8m), standbys: ld5506, ld5505 mds: cephfs:2 {0=ld5507=up:active,1=ld5505=up:active} 2 up:standby-replay 3 up:standby osd: 442 osds: 442 up (since 8d), 442 in (since 9d) data: pools: 7 pools, 19628 pgs objects: 65.78M objects, 251 TiB usage: 753 TiB used, 779 TiB / 1.5 PiB avail pgs: 19628 active+clean io: client: 427 KiB/s rd, 22 MiB/s wr, 851 op/s rd, 647 op/s wr The details are as follows: root@ld3955:~# ceph health detail HEALTH_WARN 9 daemons have recently crashed; 2 slow ops, oldest one blocked for 347755 sec, mon.ld5505 has slow ops RECENT_CRASH 9 daemons have recently crashed mds.ld4464 crashed on host ld4464 at 2020-02-09 07:33:59.131171Z mds.ld5506 crashed on host ld5506 at 2020-02-09 07:42:52.036592Z mds.ld4257 crashed on host ld4257 at 2020-02-09 07:47:44.369505Z mds.ld4464 crashed on host ld4464 at 2020-02-09 06:10:24.515912Z mds.ld5507 crashed on host ld5507 at 2020-02-09 07:13:22.400268Z mds.ld4257 crashed on host ld4257 at 2020-02-09 06:48:34.742475Z mds.ld5506 crashed on host ld5506 at 2020-02-09 06:10:24.680648Z mds.ld4465 crashed on host ld4465 at 2020-02-09 06:52:33.204855Z mds.ld5506 crashed on host ld5506 at 2020-02-06 07:59:37.089007Z SLOW_OPS 2 slow ops, oldest one blocked for 347755 sec, mon.ld5505 has slow ops There's no error on services (mgr, mon, osd). Can you please advise how to identify the root cause of this slow ops? THX

4 years, 2 months

2
1
0 0

Changing the failure-domain of an erasure coded pool

by Neukum, Max (ETP)

Hi ceph enthusiasts, We have a ceph cluster with cephfs and two pools: one replicated for metadata on ssd and one with ec (4+2) on hdd. Recently, we upgraded from 4 to 7 nodes and now want to change the failure domain of the erasure coded pool from 'OSD' to 'HOST'. What we did was to create a new crush-rule and changed the rule of our ec pool. It still uses the old profile. Details can be found below. Now there are a couple of questions: 1) Is this equivalent to changing the profile? Below you can see in the profile 'crush-failure-domain=osd' and in the crush-rule '"op": "chooseleaf_indep", "type": "host"'. 2) If we need to change the failure-domain in the profile, can this be done without creating a new pool, which seems troublesome? 3) Finally, if we really need to create a new pool to do this... what is the best way? For the record: our cluster is now (after the upgrade) ~40% full (400TB/1Pb) with 173 OSDs. Cheers, Max some more details: [root@ceph-node-a ~]# ceph osd lspools 1 ec42 2 cephfs_metadata [root@ceph-node-a ~]# ceph osd pool get ec42 erasure_code_profile erasure_code_profile: ec42 [root@ceph-node-a ~]# ceph osd pool get ec42 crush_rule crush_rule: ec42_host_hdd [root@ceph-node-a ~]# ceph osd erasure-code-profile get ec42 crush-device-class= crush-failure-domain=osd crush-root=default jerasure-per-chunk-alignment=false k=4 m=2 plugin=jerasure technique=reed_sol_van w=8 [root@ceph-node-a ~]# ceph osd crush rule dump ec42_host_hdd { "rule_id": 6, "rule_name": "ec42_host_hdd", "ruleset": 6, "type": 3, "min_size": 3, "max_size": 6, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -2, "item_name": "default~hdd" }, { "op": "chooseleaf_indep", "num": 0, "type": "host" }, { "op": "emit" } ] }

4 years, 2 months

2
2
0 0

extract disk usage stats from running ceph cluster

by lists

Hi, We would like to replace the current seagate ST4000NM0034 HDDs in our ceph cluster with SSDs, and before doing that, we would like to checkout the typical usage of our current drives, over the last years, so we can select the best (price/performance/endurance) SSD to replace them with. I am trying to extract this info from the fields "Blocks received from initiator" / "blocks sent to initiator", as these are the fields smartctl gets from the seagate disks. But the numbers seem strange, and I would like to request feedback here. Three nodes, all equal, 8 OSDs per node, all 4TB ST4000NM0034 (filestore) HDDs with SSD-based journals: > root@node1:~# ceph osd crush tree > ID CLASS WEIGHT TYPE NAME > -1 87.35376 root default > -2 29.11688 host node1 > 0 hdd 3.64000 osd.0 > 1 hdd 3.64000 osd.1 > 2 hdd 3.63689 osd.2 > 3 hdd 3.64000 osd.3 > 12 hdd 3.64000 osd.12 > 13 hdd 3.64000 osd.13 > 14 hdd 3.64000 osd.14 > 15 hdd 3.64000 osd.15 > -3 29.12000 host node2 > 4 hdd 3.64000 osd.4 > 5 hdd 3.64000 osd.5 > 6 hdd 3.64000 osd.6 > 7 hdd 3.64000 osd.7 > 16 hdd 3.64000 osd.16 > 17 hdd 3.64000 osd.17 > 18 hdd 3.64000 osd.18 > 19 hdd 3.64000 osd.19 > -4 29.11688 host node3 > 8 hdd 3.64000 osd.8 > 9 hdd 3.64000 osd.9 > 10 hdd 3.64000 osd.10 > 11 hdd 3.64000 osd.11 > 20 hdd 3.64000 osd.20 > 21 hdd 3.64000 osd.21 > 22 hdd 3.64000 osd.22 > 23 hdd 3.63689 osd.23 We are looking at the numbers from smartctl, and basing our calculations on this output for each individual various OSD: > Vendor (Seagate) cache information > Blocks sent to initiator = 3783529066 > Blocks received from initiator = 3121186120 > Blocks read from cache and sent to initiator = 545427169 > Number of read and write commands whose size <= segment size = 93877358 > Number of read and write commands whose size > segment size = 2290879 I created the following spreadsheet: > blocks sent blocks received total blocks > to initiator from initiator calculated read% write% aka > node1 > osd0 905060564 1900663448 2805724012 32,26% 67,74% sda > osd1 2270442418 3756215880 6026658298 37,67% 62,33% sdb > osd2 3531938448 3940249192 7472187640 47,27% 52,73% sdc > osd3 2824808123 3130655416 5955463539 47,43% 52,57% sdd > osd12 1956722491 1294854032 3251576523 60,18% 39,82% sdg > osd13 3410188306 1265443936 4675632242 72,94% 27,06% sdh > osd14 3765454090 3115079112 6880533202 54,73% 45,27% sdi > osd15 2272246730 2218847264 4491093994 50,59% 49,41% sdj > > node2 > osd4 3974937107 740853712 4715790819 84,29% 15,71% sda > osd5 1181377668 2109150744 3290528412 35,90% 64,10% sdb > osd5 1903438106 608869008 2512307114 75,76% 24,24% sdc > osd7 3511170043 724345936 4235515979 82,90% 17,10% sdd > osd16 2642731906 3981984640 6624716546 39,89% 60,11% sdg > osd17 3994977805 3703856288 7698834093 51,89% 48,11% sdh > osd18 3992157229 2096991672 6089148901 65,56% 34,44% sdi > osd19 279766405 1053039640 1332806045 20,99% 79,01% sdj > > node3 > osd8 3711322586 234696960 3946019546 94,05% 5,95% sda > osd9 1203912715 3132990000 4336902715 27,76% 72,24% sdb > osd10 912356010 1681434416 2593790426 35,17% 64,83% sdc > osd11 810488345 2626589896 3437078241 23,58% 76,42% sdd > osd20 1506879946 2421596680 3928476626 38,36% 61,64% sdg > osd21 2991526593 7525120 2999051713 99,75% 0,25% sdh > osd22 29560337 3226114552 3255674889 0,91% 99,09% sdi > osd23 2019195656 2563506320 4582701976 44,06% 55,94% sdj But as can be seen above, this results in some very strange numbers, for example node3/osd21 and node2/osd19, node3/osd8, the numbers are unlikely. So, probably we're doing something wrong in our logic here. Can someone explain what we're doing wrong, and is it possible to obtain stats like these also from ceph directly? Does ceph keep historical stats like above..? MJ

4 years, 2 months

5
7
0 0

Cleanup old messages in ceph health

by Thomas Schneider

Hi, the current outpu of ceph -s reports a warning: 9 daemons have recently crashed root@ld3955:~# ceph -s cluster: id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae health: HEALTH_WARN 9 daemons have recently crashed 2 slow ops, oldest one blocked for 347335 sec, mon.ld5505 has slow ops services: mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 3d) mgr: ld5507(active, since 8m), standbys: ld5506, ld5505 mds: cephfs:2 {0=ld5507=up:active,1=ld5505=up:active} 2 up:standby-replay 3 up:standby osd: 442 osds: 442 up (since 8d), 442 in (since 9d) data: pools: 7 pools, 19628 pgs objects: 65.78M objects, 251 TiB usage: 753 TiB used, 779 TiB / 1.5 PiB avail pgs: 19628 active+clean io: client: 427 KiB/s rd, 22 MiB/s wr, 851 op/s rd, 647 op/s wr The details are as follows: root@ld3955:~# ceph health detail HEALTH_WARN 9 daemons have recently crashed; 2 slow ops, oldest one blocked for 347755 sec, mon.ld5505 has slow ops RECENT_CRASH 9 daemons have recently crashed mds.ld4464 crashed on host ld4464 at 2020-02-09 07:33:59.131171Z mds.ld5506 crashed on host ld5506 at 2020-02-09 07:42:52.036592Z mds.ld4257 crashed on host ld4257 at 2020-02-09 07:47:44.369505Z mds.ld4464 crashed on host ld4464 at 2020-02-09 06:10:24.515912Z mds.ld5507 crashed on host ld5507 at 2020-02-09 07:13:22.400268Z mds.ld4257 crashed on host ld4257 at 2020-02-09 06:48:34.742475Z mds.ld5506 crashed on host ld5506 at 2020-02-09 06:10:24.680648Z mds.ld4465 crashed on host ld4465 at 2020-02-09 06:52:33.204855Z mds.ld5506 crashed on host ld5506 at 2020-02-06 07:59:37.089007Z SLOW_OPS 2 slow ops, oldest one blocked for 347755 sec, mon.ld5505 has slow ops However, any crashed host is up and running. Therefore I would prefer to remove these error messages Can you please advise how to cleanup the error messages? THX

4 years, 2 months

3
2
0 0

[ceph-user] SSD disk utilization high on ceph-12.2.12

by Amit Ghadge

Hello All, We seen one of the Ceph data nodes, all osd's are 90-100% disk utilized , those all are SSD drive and traffic is normal compare to other data nodes. How can we debug it?

4 years, 2 months

1
0
0 0

CephFS hangs with access denied

by Dietmar Rieder

Hi, we sometimes loose access to our cephfs mount and get permission denied if we try to cd into it. This happens apparently only on some of our HPC cephfs-client nodes (fs mounted via kernel client) when they are busy with calculation and I/O. When we then manually force unmount the fs and remount it, everything is working again. This is the dmesg output of the affected client node: <https://pastebin.com/z5wxUgYS> All HPC clients and ceph servers are running CentOS 7.7 with the same kernel: $ uname -a Linux apollo-08.local 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux and all are running ceph version 14.2.7 $ ceph -v ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable) Maybe someone has an idea what goes wrong, and how we can fix/avoid this. Thanks Dietmar -- _________________________________________ D i e t m a r R i e d e r, Mag.Dr. Innsbruck Medical University Biocenter - Institute of Bioinformatics Email: dietmar.rieder(a)i-med.ac.at Web: http://www.icbi.at

4 years, 2 months

3
7
2 0

Re: cephfs slow, howto investigate and tune mds configuration?

by Yan, Zheng

On Wed, Feb 12, 2020 at 6:08 PM Marc Roos <M.Roos(a)f1-outsourcing.eu> wrote: > > > > > > >> > >> Say I think my cephfs is slow when I rsync to it, slower than it > used > >> to be. First of all, I do not get why it reads so much data. I > assume > >> the file attributes need to come from the mds server, so the rsync > >> backup should mostly cause writes not? > >> > > > >Are you running one or multiple MDS? I've seen cases where the > >synchronization between the different MDS slow down rsync. > > One > > >The problem is that rsync creates and renames files a lot. When doing > >this with small files it can be very heavy for the MDS. > > > > Strange thing is that I did not have performance problems with luminous, > after upgrading to nautilus and enabling snapshots on a different tree > of the cephfs. Rsync is taking 10 hours more. > There is also another option, degrading performance on the source. > However it is impossible for me to verify this. > I have increased the mds_cache_memory_limit from 8GB to 16GB, see > what that brings. > how many snapshot are there ? > > > > >> I think it started being slow, after enabling snapshots on the file > >> system. > >> > >> - how can I determine if mds_cache_memory_limit = 8000000000 is > still > >> correct? > >> > >> - how can I test the mds performance from the command line, so I can > > >> experiment with cpu power configurations, and see if this brings a > >> significant change? > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an > > >> email to ceph-users-leave(a)ceph.io > >> > > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

4 years, 2 months

2
1
0 0

PR #26095 experience (backported/cherry-picked to Nauilus)

by Simon Leinen

We have been using RadosGW with Keystone integration for a couple of years, to allow users of our OpenStack-based IaaS to create their own credentials for our object store. This has caused us a fair amount of performance headaches. Last year, Jjames Weaver (BBC) has contributed a patch (PR #26095) that changes the handling of S3 authentication when Keystone is used as a backend for credentials. It was merged to master in March 2019. We run Nautilus on our production clusters, which doesn't include the patch. A few weeks ago, we decided to cherry-pick PR #26095 on top of Nautilus (12.4.5/6/7) and deploy that in production. So far we haven't noticed any issues. Load on our Keystone system has decreased significantly, response times for small requests are now consistently low, and we don't have to re-provision S3 credentials locally anymore to fix performance emergencies. Thanks a lot! Blog post with a few performance graphs: https://cloudblog.switch.ch/2020/02/10/radosgw-keystone-integration-perform… -- Simon.

4 years, 2 months

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users February 2020