May 2023 - ceph-users - lists.ceph.io

quincy 17.2.6 - write performance continuously slowing down until OSD restart needed

by Nikola Ciprich

Hello dear CEPH users and developers, we're dealing with strange problems.. we're having 12 node alma linux 9 cluster, initially installed CEPH 15.2.16, then upgraded to 17.2.5. It's running bunch of KVM virtual machines accessing volumes using RBD. everything is working well, but there is strange and for us quite serious issue - speed of write operations (both sequential and random) is constantly degrading drastically to almost unusable numbers (in ~1week it drops from ~70k 4k writes/s from 1 VM to ~7k writes/s) When I restart all OSD daemons, numbers immediately return to normal.. volumes are stored on replicated pool of 4 replicas, on top of 7*12 = 84 INTEL SSDPE2KX080T8 NVMEs. I've updated cluster to 17.2.6 some time ago, but the problem persists. This is especially annoying in connection with https://tracker.ceph.com/issues/56896 as restarting OSDs is quite painfull when half of them crash.. I don't see anything suspicious, nodes load is quite low, no logs errors, network latency and throughput is OK too Anyone having simimar issue? I'd like to ask for hints on what should I check further.. we're running lots of 14.2.x and 15.2.x clusters, none showing similar issue, so I'm suspecting this is something related to quincy thanks a lot in advance with best regards nikola ciprich -- ------------------------------------- Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax: +420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: servis(a)linuxbox.cz -------------------------------------

9 months, 2 weeks

7
16
0 0

rgw multisite sync not syncing data, error: RGW-SYNC:data:init_data_sync_status: ERROR: failed to read remote data log shards

by Christian Rohmann

Hey ceph-users, I setup a multisite sync between two freshly setup Octopus clusters. In the first cluster I created a bucket with some data just to test the replication of actual data later. I then followed the instructions on https://docs.ceph.com/en/octopus/radosgw/multisite/#migrating-a-single-site… to add a second zone. Things went well and both zones are now happily reaching each other and the API endpoints are talking. Also the metadata is in sync already - both sides are happy and I can see bucket listings and users are "in sync": > # radosgw-admin sync status > realm 13d1b8cb-dc76-4aed-8578-2ce5d3d010e8 (obst) > zonegroup 17a06c15-2665-484e-8c61-cbbb806e11d2 (obst-fra) > zone 6d2c1275-527e-432f-a57a-9614930deb61 (obst-rgn) > metadata sync no sync (zone is master) > data sync source: c07447eb-f93a-4d8f-bf7a-e52fade399f3 (obst-az1) > init > full sync: 128/128 shards > full sync: 0 buckets to sync > incremental sync: 0/128 shards > data is behind on 128 shards > behind shards: [0...127] > and on the other side ... > # radosgw-admin sync status > realm 13d1b8cb-dc76-4aed-8578-2ce5d3d010e8 (obst) > zonegroup 17a06c15-2665-484e-8c61-cbbb806e11d2 (obst-fra) > zone c07447eb-f93a-4d8f-bf7a-e52fade399f3 (obst-az1) > metadata sync syncing > full sync: 0/64 shards > incremental sync: 64/64 shards > metadata is caught up with master > data sync source: 6d2c1275-527e-432f-a57a-9614930deb61 (obst-rgn) > init > full sync: 128/128 shards > full sync: 0 buckets to sync > incremental sync: 0/128 shards > data is behind on 128 shards > behind shards: [0...127] > also the newly created buckets (read: their metadata) is synced. What is apparently not working in the sync of actual data. Upon startup the radosgw on the second site shows: > 2021-06-25T16:15:06.445+0000 7fe71eff5700 1 RGW-SYNC:meta: start > 2021-06-25T16:15:06.445+0000 7fe71eff5700 1 RGW-SYNC:meta: realm > epoch=2 period id=f4553d7c-5cc5-4759-9253-9a22b051e736 > 2021-06-25T16:15:11.525+0000 7fe71dff3700 0 > RGW-SYNC:data:sync:init_data_sync_status: ERROR: failed to read remote > data log shards > also when issuing # radosgw-admin data sync init --source-zone obst-rgn it throws > 2021-06-25T16:20:29.167+0000 7f87c2aec080 0 > RGW-SYNC:data:init_data_sync_status: ERROR: failed to read remote data > log shards Does anybody have any hints on where to look for what could be broken here? Thanks a bunch, Regards Christian

9 months, 2 weeks

3
4
0 0

Re: Ceph Mgr/Dashboard Python depedencies: a new approach

by Casey Bodley

hi Ernesto and lists, > [1] https://github.com/ceph/ceph/pull/47501 are we planning to backport this to quincy so we can support centos 9 there? enabling that upgrade path on centos 9 was one of the conditions for dropping centos 8 support in reef, which i'm still keen to do if not, can we find another resolution to https://tracker.ceph.com/issues/58832? as i understand it, all of those python packages exist in centos 8. do we know why they were dropped for centos 9? have we looked into making those available in epel? (cc Ken and Kaleb) On Fri, Sep 2, 2022 at 12:01 PM Ernesto Puerta <epuertat(a)redhat.com> wrote: > > Hi Kevin, > >> >> Isn't this one of the reasons containers were pushed, so that the packaging isn't as big a deal? > > > Yes, but the Ceph community has a strong commitment to provide distro packages for those users who are not interested in moving to containers. > >> Is it the continued push to support lots of distros without using containers that is the problem? > > > If not a problem, it definitely makes it more challenging. Compiled components often sort this out by statically linking deps whose packages are not widely available in distros. The approach we're proposing here would be the closest equivalent to static linking for interpreted code (bundling). > > Thanks for sharing your questions! > > Kind regards, > Ernesto > _______________________________________________ > Dev mailing list -- dev(a)ceph.io > To unsubscribe send an email to dev-leave(a)ceph.io

9 months, 3 weeks

4
10
0 0

cephadm upgrade 16.2.10 to 16.2.11: osds crash and get stuck restarting

by Zakhar Kirpichenko

Hi, Attempted to upgrade 16.2.10 to 16.2.11, 2 OSDs out of many started crashing in a loop on the very 1st host: Jan 25 23:07:53 ceph01 bash[2553123]: Uptime(secs): 0.0 total, 0.0 interval Jan 25 23:07:53 ceph01 bash[2553123]: Flush(GB): cumulative 0.000, interval 0.000 Jan 25 23:07:53 ceph01 bash[2553123]: AddFile(GB): cumulative 0.000, interval 0.000 Jan 25 23:07:53 ceph01 bash[2553123]: AddFile(Total Files): cumulative 0, interval 0 Jan 25 23:07:53 ceph01 bash[2553123]: AddFile(L0 Files): cumulative 0, interval 0 Jan 25 23:07:53 ceph01 bash[2553123]: AddFile(Keys): cumulative 0, interval 0 Jan 25 23:07:53 ceph01 bash[2553123]: Cumulative compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds Jan 25 23:07:53 ceph01 bash[2553123]: Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds Jan 25 23:07:53 ceph01 bash[2553123]: Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count Jan 25 23:07:53 ceph01 bash[2553123]: ** File Read Latency Histogram By Level [P] ** Jan 25 23:07:53 ceph01 bash[2553123]: debug -10> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: (Original Log Time 2023/01/25-23:07:52.986439) [db_impl/db_impl_compaction_flush.cc:2611] Com paction nothing to do Jan 25 23:07:53 ceph01 bash[2553123]: debug -9> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: (Original Log Time 2023/01/25-23:07:52.986493) [db_impl/db_impl_compaction_flush.cc:2611] Com paction nothing to do Jan 25 23:07:53 ceph01 bash[2553123]: debug -8> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: (Original Log Time 2023/01/25-23:07:52.986500) [db_impl/db_impl_compaction_flush.cc:2611] Com paction nothing to do Jan 25 23:07:53 ceph01 bash[2553123]: debug -7> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: (Original Log Time 2023/01/25-23:07:52.986505) [db_impl/db_impl_compaction_flush.cc:2611] Com paction nothing to do Jan 25 23:07:53 ceph01 bash[2553123]: debug -6> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: [compaction/compaction_job.cc:1676] [O-2] [JOB 9] Compacting 4@0 + 2@1 files to L1, score 1.0 0 Jan 25 23:07:53 ceph01 bash[2553123]: debug -5> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: [compaction/compaction_job.cc:1680] [O-2] Compaction start summary: Base version 33 Base level 0, inputs: [649058(959KB) 649046(1510KB) 649024(1323KB) 649002(1396KB)], [648981(66MB) 648982(52MB)] Jan 25 23:07:53 ceph01 bash[2553123]: debug -4> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1674688072986547, "job": 9, "event": "compaction_started", "compaction_reason": "LevelL0FilesNum", "files_L0": [649058, 649046, 649024, 649002], "files_L1": [648981, 648982], "score": 1, "input_data_size": 129161327} Jan 25 23:07:53 ceph01 bash[2553123]: debug -3> 2023-01-25T23:07:52.990+0000 7f7c8b66e700 -1 bdev(0x5619bf0ce400 /var/lib/ceph/osd/ceph-3/block) _aio_thread got r=-1 ((1) Operation not permitted) Jan 25 23:07:53 ceph01 bash[2553123]: debug -2> 2023-01-25T23:07:52.990+0000 7f7c8b66e700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.11/rpm/el8/BUILD/ceph-16.2.11/src/blk/kernel/KernelDevice.cc: In function 'void KernelDevice::_aio_thread()' thread 7f7c8b66e700 time 2023-01-25T23:07:52.993976+0000 Jan 25 23:07:53 ceph01 bash[2553123]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.11/rpm/el8/BUILD/ceph-16.2.11/src/blk/kernel/KernelDevice.cc: 604: ceph_abort_msg("Unexpected IO error. This may suggest HW issue. Please check your dmesg!") Jan 25 23:07:53 ceph01 bash[2553123]: ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894) pacific (stable) Jan 25 23:07:53 ceph01 bash[2553123]: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x5619b2fc1adc] Jan 25 23:07:53 ceph01 bash[2553123]: 2: (KernelDevice::_aio_thread()+0x1285) [0x5619b3b2a4e5] Jan 25 23:07:53 ceph01 bash[2553123]: 3: (KernelDevice::AioCompletionThread::entry()+0x11) [0x5619b3b357b1] Jan 25 23:07:53 ceph01 bash[2553123]: 4: /lib64/libpthread.so.0(+0x81ca) [0x7f7c978851ca] Jan 25 23:07:53 ceph01 bash[2553123]: 5: clone() Jan 25 23:07:53 ceph01 bash[2553123]: debug -1> 2023-01-25T23:07:52.994+0000 7f7c8b66e700 -1 *** Caught signal (Aborted) ** Jan 25 23:07:53 ceph01 bash[2553123]: in thread 7f7c8b66e700 thread_name:bstore_aio Jan 25 23:07:53 ceph01 bash[2553123]: ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894) pacific (stable) Jan 25 23:07:53 ceph01 bash[2553123]: 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f7c9788fcf0] Jan 25 23:07:53 ceph01 bash[2553123]: 2: gsignal() Jan 25 23:07:53 ceph01 bash[2553123]: 3: abort() Jan 25 23:07:53 ceph01 bash[2553123]: 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x5619b2fc1bad] Jan 25 23:07:53 ceph01 bash[2553123]: 5: (KernelDevice::_aio_thread()+0x1285) [0x5619b3b2a4e5] Jan 25 23:07:53 ceph01 bash[2553123]: 6: (KernelDevice::AioCompletionThread::entry()+0x11) [0x5619b3b357b1] Jan 25 23:07:53 ceph01 bash[2553123]: 7: /lib64/libpthread.so.0(+0x81ca) [0x7f7c978851ca] OSD kept crashing until the host reboot, an OSD restart wouldn't help. This hasn't happened during any previous upgrades, so was a rather unexpected development. Unclear what caused this, but a host reboot seems to have fixed it. It happened to 1 other OSD on another host, exactly the same symptoms, also solved by a reboot. Best regards, Zakhar

9 months, 3 weeks

4
4
0 0

upgrading from 15.2.17 to 16.2.11 - Health ERROR

by xadhoom76＠gmail.com

hi , starting upgrade from 15.2.17 i got this error Module 'cephadm' has failed: Expecting value: line 1 column 1 (char 0) Cluster was in health ok before starting.

10 months

5
9
0 0

Help needed to configure erasure coding LRC plugin

by Michel Jouvin

Hi, As discussed in another thread (Crushmap rule for multi-datacenter erasure coding), I'm trying to create an EC pool spanning 3 datacenters (datacenters are present in the crushmap), with the objective to be resilient to 1 DC down, at least keeping the readonly access to the pool and if possible the read-write access, and have a storage efficiency better than 3 replica (let say a storage overhead <= 2). In the discussion, somebody mentioned LRC plugin as a possible jerasure alternative to implement this without tweaking the crushmap rule to implement the 2-step OSD allocation. I looked at the documentation (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/) but I have some questions if someone has experience/expertise with this LRC plugin. I tried to create a rule for using 5 OSDs per datacenter (15 in total), with 3 (9 in total) being data chunks and others being coding chunks. For this, based of my understanding of examples, I used k=9, m=3, l=4. Is it right? Is this configuration equivalent, in terms of redundancy, to a jerasure configuration with k=9, m=6? The resulting rule, which looks correct to me, is: -------- { "rule_id": 6, "rule_name": "test_lrc_2", "ruleset": 6, "type": 3, "min_size": 3, "max_size": 15, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -4, "item_name": "default~hdd" }, { "op": "choose_indep", "num": 3, "type": "datacenter" }, { "op": "chooseleaf_indep", "num": 5, "type": "host" }, { "op": "emit" } ] } ------------ Unfortunately, it doesn't work as expected: a pool created with this rule ends up with its pages active+undersize, which is unexpected for me. Looking at 'ceph health detail` output, I see for each page something like: pg 52.14 is stuck undersized for 27m, current state active+undersized, last acting [90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647] For each PG, there is 3 '2147483647' entries and I guess it is the reason of the problem. What are these entries about? Clearly it is not OSD entries... Looks like a negative number, -1, which in terms of crushmap ID is the crushmap root (named "default" in our configuration). Any trivial mistake I would have made? Thanks in advance for any help or for sharing any successful configuration? Best regards, Michel

10 months, 1 week

4
20
0 0

CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

by Janek Bevendorff

Hi, Perhaps this is a known issue and I was simply too dumb to find it, but we are having problems with our CephFS metadata pool filling up over night. Our cluster has a small SSD pool of around 15TB which hosts our CephFS metadata pool. Usually, that's more than enough. The normal size of the pool ranges between 200 and 800GiB (which is quite a lot of fluctuation already). Yesterday, we had suddenly had the pool fill up entirely and they only way to fix it was to add more capacity. I increased the pool size to 18TB by adding more SSDs and could resolve the problem. After a couple of hours of reshuffling, the pool size finally went back to 230GiB. But then we had another fill-up tonight to 7.6TiB. Luckily, I had adjusted the weights so that not all disks could fill up entirely like last time, so it ended there. I wasn't really able to identify the problem yesterday, but under the more controllable scenario today, I could check the MDS logs at debug_mds=10 and to me it seems like the problem is caused by snapshot trimming. The logs contain a lot of snapshot-related messages for paths that haven't been touched in a long time. The messages all look something like this: May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200 7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first cap, joining realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1 b1b cps 2 snaps={185f=snap(185f 0x10000000000 'monthly_20221201' 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 0x10000000000 ... May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.396+0200 7f0e6a6ca700 10 mds.0.cache | |______ 3 rep [dir 0x100000218fe.101111101* /storage/REDACTED/| ptrwaiter=0 request=0 child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0 tempexporting=0 0x5607759d9600] May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.452+0200 7f0e6a6ca700 10 mds.0.cache | | |____ 4 rep [dir 0x100000ff904.100111101010* /storage/REDACTED/| ptrwaiter=0 request=0 child=0 frozen=0 subtree=1 importing=0 replicated=0 waiter=0 authpin=0 tempexporting=0 0x56034ed25200] May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.716+0200 7f0e6becd700 10 mds.0.server set_trace_dist snaprealm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 snaps={185f=snap(185f 0x10000000000 'monthly_20221201' 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 0x10000000000 'monthly_20230201' 2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x10000000000 'monthly_20230301' 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24 0x10000000000 'monthly_20230401' ...) len=384 May 31 09:25:36 deltaweb055 ceph-mds[3268481]: 2023-05-31T09:25:36.076+0200 7f0e6becd700 10 mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, leaving realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 snaps={185f=snap(185f 0x10000000000 'monthly_20221201' 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 'monthly_20230101' ...) The daily_*, montly_* etc. names are the names of our regular snapshots. I posted a larger log file snippet using ceph-post-file with the ID: da0eb93d-f340-4457-8a3f-434e8ef37d36 Is it possible that the MDS are trimming old snapshots without taking care not to fill up the entire metadata pool? Cheers Janek

10 months, 2 weeks

3
16
0 0

RGW versioned bucket index issues

by Cory Snyder

Hi all, I wanted to call attention to some RGW issues that we've observed on a Pacific cluster over the past several weeks. The problems relate to versioned buckets and index entries that can be left behind after transactions complete abnormally. The scenario is multi-faceted and we're still investigating some of the details, but I wanted to provide a big-picture summary of what we've found so far. It looks like most of these issues should be reproducible on versions before and after Pacific as well. I'll enumerate the individual issues below: 1. PUT requests during reshard of versioned bucket fail with 404 and leave behind dark data Tracker: https://tracker.ceph.com/issues/61359 2. When bucket index ops are cancelled it can leave behind zombie index entries This one was merged a few months ago and did make the v16.2.13 release, but in our case we had billions of extra index entries by the time that we had upgraded to the patched version. Tracker: https://tracker.ceph.com/issues/58673 3. Issuing a delete for a key that already has a delete marker as the current version leaves behind index entries and OLH objects Note that the tracker's original description describes the problem a bit differently, but I've clarified the nature of the issue in a comment. Tracker: https://tracker.ceph.com/issues/59663 The extra index entries and OLH objects that are left behind due to these sorts of issues are obviously annoying in regards to the fact that they unnecessarily consume space, but we've found that they can also cause severe performance degradation for bucket listings, lifecycle processing, and other ops indirectly due to higher osd latencies. The reason for the performance impact is that bucket listing calls must repeatedly perform additional OSD ops until they find the requisite number of entries to return. The OSD cls method for bucket listing also does its own internal iteration for the same purpose. Since these entries are invalid, they are skipped. In the case that we observed, where some of our bucket indexes were filled with a sea of contiguous leftover entries, the process of continually iterating over and skipping invalid entries caused enormous read amplification. I believe that the following tracker is describing symptoms that are related to the same issue: https://tracker.ceph.com/issues/59164. Note that this can also cause LC processing to repeatedly fail in cases where there are enough contiguous invalid entries, since the OSD cls code eventually gives up and returns an error that isn't handled. The severity of these issues likely varies greatly based upon client behavior. If anyone has experienced similar problems, we'd love to hear about the nature of how they've manifested for you so that we can be more confident that we've plugged all of the holes. Thanks, Cory Snyder 11:11 Systems

10 months, 3 weeks

3
5
0 0

bucket notification retries

by Yuval Lifshitz

Dear Community, I would like to collect your feedback on this issue. This is a followup from a discussion that started in the RGW refactoring meeting on 31-May-23 (thanks @Krunal Chheda <kchheda3(a)bloomberg.net> for bringing up this topic!). Currently persistent notifications are retried indefinitely. The only limiting mechanism that exists is that all notifications to a specific topic are stored in one RADOS object (of size 128MB). Assuming notifications are ~1KB at most, this would give us at least 128K notifications that can wait in the queue. When the queue fills up (e.g. kafka broker is down for 20 minutes, we are sending ~100 notifications per second) we start sending "slow down" replies to the client, and in this case the S3 operation will not be performed. This means that, for example, an outage of the kafka system would eventually cause an outage of our service. Note that this may also be a result of a misconfiguration of the kafka broker, or decommissioning of a broker. To avoid that, we propose several options: * use a fifo instead of a queue. This would allow us to hold more than 128K messages - survive longer broker outages, and at a higher message rate. there should still probably be a limit set on the size of the fifo * define maximum number of retries allowed for a notification * define maximum time the notification may stay in the queue before it is removed We should probably start with these definitions done as topic attributes, reflecting our delivery guarantees for this specific destination. Will try to capture the results of the discussion in this tracker: https://tracker.ceph.com/issues/61532 Thanks, Yuval

10 months, 3 weeks

2
3
0 0

ceph Pacific - MDS activity freezes when one the MDSs is restarted

by Emmanuel Jaep

Hi, we are currently running a ceph fs cluster at the following version: MDS version: ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable) The cluster is composed of 7 active MDSs and 1 standby MDS: RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active icadmin012 Reqs: 73 /s 1938k 1880k 85.3k 92.8k 1 active icadmin008 Reqs: 206 /s 2375k 2375k 7081 171k 2 active icadmin007 Reqs: 91 /s 5709k 5256k 149k 299k 3 active icadmin014 Reqs: 93 /s 679k 664k 40.1k 216k 4 active icadmin013 Reqs: 86 /s 3585k 3569k 12.7k 197k 5 active icadmin011 Reqs: 72 /s 225k 221k 8611 164k 6 active icadmin015 Reqs: 87 /s 1682k 1610k 27.9k 274k POOL TYPE USED AVAIL cephfs_metadata metadata 8552G 22.3T cephfs_data data 226T 22.3T STANDBY MDS icadmin006 When I restart one of the active MDSs, the standby MDS becomes active and its state becomes "replay". So far, so good! However, only one of the other "active" MDSs seems to remain active. All activities drop from the other ones: RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active icadmin012 Reqs: 0 /s 1938k 1881k 85.3k 9720 1 active icadmin008 Reqs: 0 /s 2375k 2375k 7080 2505 2 active icadmin007 Reqs: 2 /s 5709k 5256k 149k 26.5k 3 active icadmin014 Reqs: 0 /s 679k 664k 40.1k 3259 4 replay icadmin006 801k 801k 1279 0 5 active icadmin011 Reqs: 0 /s 225k 221k 8611 9241 6 active icadmin015 Reqs: 0 /s 1682k 1610k 27.9k 34.8k POOL TYPE USED AVAIL cephfs_metadata metadata 8539G 22.8T cephfs_data data 225T 22.8T STANDBY MDS icadmin013 In effect, the cluster becomes almost unavailable until the newly promoted MDS finishes rejoining the cluster. Obviously, this defeats the purpose of having 7MDSs. Is this behavior? If not, what configuration items should I check to go back to "normal" operations? Best, Emmanuel

11 months

5
12
0 0

2024

2023

2022

2021

2020

2019

ceph-users May 2023