September 2023 - ceph-users

Rebuilding data resiliency after adding new OSD's stuck for so long at 5%

by sharathvuthpala＠gmail.com

We have a user-provisioned instance( Bare Metal Installation) of OpenShift cluster running on version 4.12 and we are using OpenShift Data Foundation as the Storage System. Earlier we had 3 disks attached to the storage system and 3 OSDs were available in the cluster. Today, while adding additional disks to the storage cluster, we increased the number of disks from 3 to 9, that is 3 per node. The addition of storage capacity was successful, resulting in 6 new OSDs in the cluster. But, after this operation, we noticed that Rebuilding Data Resiliency is stuck at 5% and not moving forward. At the same time, ceph status shows 65% of objects are misplaced and PGs are not in active+clean state. Here is more information about the ceph cluster: sh-4.4$ ceph status cluster: id: 18bf836d-4937-4925-b964-7a026c1d548d health: HEALTH_OK services: mon: 3 daemons, quorum b,u,v (age 2w) mgr: a(active, since 7w) mds: 1/1 daemons up, 1 hot standby osd: 9 osds: 9 up (since 5h), 9 in (since 5h); 191 remapped pgs rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 305 pgs objects: 2.69M objects, 2.9 TiB usage: 8.8 TiB used, 27 TiB / 36 TiB avail pgs: 4723077/8079717 objects misplaced (58.456%) 188 active+remapped+backfill_wait 114 active+clean 3 active+remapped+backfilling io: client: 679 KiB/s rd, 11 MiB/s wr, 13 op/s rd, 622 op/s wr recovery: 20 MiB/s, 89 keys/s, 22 objects/s sh-4.4$ ceph balancer status { "active": true, "last_optimize_duration": "0:00:00.000276", "last_optimize_started": "Tue Sep 12 17:36:03 2023", "mode": "upmap", "optimize_result": "Too many objects (0.581933 > 0.050000) are misplaced; try again later", "plans": [] } One more thing we observed is that the number of misplaced objects is decreasing and also there is a drop in the percentage. What might be the reason behind Rebuilding Data Resiliency is not moving forward? Any inputs would be appreciated. Thanks

8 months

3
5
0 0

Awful new dashboard in Reef

by Nicola Mori

Dear Ceph users, I just upgraded my cluster to Reef, and with the new version came also a revamped dashboard. Unfortunately the new dashboard is really awful to me: 1) it's no longer possible to see the status of the PGs: in the old dashboard it was very easy to see e.g. how many PGs were recovering, how many scrubbing etc. by clicking on the PG Status widget. Now the interface shows just how many are Ok and how many are working, without details, and I have to go to the command line to understand what's happening (not really comfortable on mobile) 2) The new timeline graphs do not work properly: changing the time frame sometimes produce empty graphs, 3) The instant values in Cluster utilization are refreshed so slowly that I cannot properly monitor the cluster behavior in real time Is it just me or maybe my impressions are shared by someone else? Is there anything that can be done to improve the situation? Thanks, Nicola

8 months, 1 week

4
10
0 0

ceph orchestator pulls strange images from docker.io

by Boris Behrens

Hi, I currently try to adopt our stage cluster, some hosts just pull strange images. root@0cc47a6df330:/var/lib/containers/storage/overlay-images# podman ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES a532c37ebe42 docker.io/ceph/daemon-base:latest-master-devel -n mgr.0cc47a6df3... 2 minutes ago Up 2 minutes ago ceph-03977a23-f00f-4bb0-b9a7-de57f40ba853-mgr-0cc47a6df330-fxrfyl root@0cc47a6df330:~# ceph orch ps NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID mgr.0cc47a6df14e.vqizdz 0cc47a6df14e.f00f.gridscale.dev *:9283 running (3m) 3m ago 3m 10.8M - 16.2.11 de4b0b384ad4 00b02cd82a1c mgr.0cc47a6df330.iijety 0cc47a6df330.f00f.gridscale.dev *:9283 running (5s) 2s ago 4s 10.5M - 17.0.0-7183-g54142666 75e3d7089cea 662c6baa097e mgr.0cc47aad8ce8 0cc47aad8ce8.f00f.gridscale.dev running (65m) 8m ago 60m 553M - 17.2.6 22cd8daf4d70 8145c63fdc44 Any idea what I need to do to change that? -- Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groÃƒ¼en Saal.

8 months, 1 week

4
8
0 0

Questions about PG auto-scaling and node addition

by Christophe BAILLON

Hello, We have a cluster with 21 nodes, each having 12 x 18TB, and 2 NVMe for db/wal. We need to add more nodes. The last time we did this, the PGs remained at 1024, so the number of PGs per OSD decreased. Currently, we are at 43 PGs per OSD. Does auto-scaling work correctly in Ceph version 17.2.5? Should we increase the number of PGs before adding nodes? Should we keep PG auto-scaling active? If we disable auto-scaling, should we increase the number of PGs to reach 100 PGs per OSD? Considering that we use this cluster with a large EC pool (8+3). Thank you for your assistance.

8 months, 1 week

2
2
0 0

osd cannot get osdmap

by Nathan Gleason

Hello, We had a network hiccup with a Ceph cluster and it made several of our osds go out/down. After the network was fixed the osds remain down. We have restarted them in numerous ways and they won’t come up. The logs for the down osds just repeat this line over and over "tick checking mon for new map”. There are osds on the same host that are up so there is connectivity between the osds and mons. Any advice on where to look for a resolution is appreciated. Thanks, Nathan Cluster was built with cephadm Ceph Quincy - 17.2.6 Docker version 23.0.2, build 569dd73 Ubuntu 20.04.6 LTS cluster: id: aa39fa2a-1510-11ee-953a-bd804ec1ea33 health: HEALTH_ERR Failed to apply 1 service(s): nfs.secstorage 1 filesystem is degraded 1 MDSs report slow metadata IOs Module 'cephadm' has failed: Command '['rados', '-n', 'mgr.cphprodc1-11.uuuhug', '-k', '/var/lib/ceph/mgr/ceph-cphprodc1-11.uuuhug/keyring', '-p', '.nfs', '--namespace', 'secstorage', 'rm', 'grace']' timed out after 10 seconds 28 osds down Reduced data availability: 36 pgs stale 2 daemons have recently crashed 1 mgr modules have recently crashed 945514 slow ops, oldest one blocked for 66804 sec, daemons [mon.cphprodc1-10,mon.cphprodc1-11,mon.cphprodc1-13] have slow ops. services: mon: 4 daemons, quorum cphprodc1-10,cphprodc1-11,cphprodc1-12,cphprodc1-13 (age 2h) mgr: cphprodc1-11.uuuhug(active, since 23h), standbys: cphprodc1-10.upwvbg mds: 1/1 daemons up, 1 standby osd: 64 osds: 19 up (since 2d), 47 in (since 23h) data: volumes: 0/1 healthy, 1 recovering pools: 5 pools, 113 pgs objects: 151.91k objects, 592 GiB usage: 840 GiB used, 81 TiB / 82 TiB avail pgs: 65 active+clean 36 stale+active+clean 7 active+clean+scrubbing 5 active+clean+scrubbing+deep osd tree: ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 111.78223 root default -5 27.94556 host cphprodc1-10 1 ssd 1.74660 osd.1 down 1.00000 1.00000 5 ssd 1.74660 osd.5 down 1.00000 1.00000 8 ssd 1.74660 osd.8 down 1.00000 1.00000 12 ssd 1.74660 osd.12 down 1.00000 1.00000 14 ssd 1.74660 osd.14 down 1.00000 1.00000 18 ssd 1.74660 osd.18 down 0 1.00000 22 ssd 1.74660 osd.22 down 0 1.00000 26 ssd 1.74660 osd.26 down 0 1.00000 30 ssd 1.74660 osd.30 down 0 1.00000 34 ssd 1.74660 osd.34 down 1.00000 1.00000 37 ssd 1.74660 osd.37 down 1.00000 1.00000 41 ssd 1.74660 osd.41 down 1.00000 1.00000 45 ssd 1.74660 osd.45 up 1.00000 1.00000 48 ssd 1.74660 osd.48 up 1.00000 1.00000 52 ssd 1.74660 osd.52 up 1.00000 1.00000 56 ssd 1.74660 osd.56 up 1.00000 1.00000 -7 27.94556 host cphprodc1-11 2 ssd 1.74660 osd.2 down 0 1.00000 6 ssd 1.74660 osd.6 down 1.00000 1.00000 10 ssd 1.74660 osd.10 down 1.00000 1.00000 16 ssd 1.74660 osd.16 down 1.00000 1.00000 20 ssd 1.74660 osd.20 down 0 1.00000 24 ssd 1.74660 osd.24 down 0 1.00000 28 ssd 1.74660 osd.28 down 0 1.00000 32 ssd 1.74660 osd.32 down 0 1.00000 36 ssd 1.74660 osd.36 down 1.00000 1.00000 40 ssd 1.74660 osd.40 down 1.00000 1.00000 44 ssd 1.74660 osd.44 down 1.00000 1.00000 50 ssd 1.74660 osd.50 up 1.00000 1.00000 54 ssd 1.74660 osd.54 up 1.00000 1.00000 58 ssd 1.74660 osd.58 up 1.00000 1.00000 60 ssd 1.74660 osd.60 up 1.00000 1.00000 62 ssd 1.74660 osd.62 up 1.00000 1.00000 -3 27.94556 host cphprodc1-12 0 ssd 1.74660 osd.0 down 1.00000 1.00000 4 ssd 1.74660 osd.4 down 1.00000 1.00000 7 ssd 1.74660 osd.7 down 1.00000 1.00000 11 ssd 1.74660 osd.11 down 1.00000 1.00000 15 ssd 1.74660 osd.15 down 1.00000 1.00000 19 ssd 1.74660 osd.19 down 0 1.00000 23 ssd 1.74660 osd.23 down 0 1.00000 27 ssd 1.74660 osd.27 down 0 1.00000 31 ssd 1.74660 osd.31 down 0 1.00000 35 ssd 1.74660 osd.35 down 1.00000 1.00000 38 ssd 1.74660 osd.38 down 1.00000 1.00000 42 ssd 1.74660 osd.42 down 1.00000 1.00000 46 ssd 1.74660 osd.46 up 1.00000 1.00000 49 ssd 1.74660 osd.49 up 1.00000 1.00000 53 ssd 1.74660 osd.53 up 1.00000 1.00000 57 ssd 1.74660 osd.57 up 1.00000 1.00000 -9 27.94556 host cphprodc1-13 3 ssd 1.74660 osd.3 down 1.00000 1.00000 9 ssd 1.74660 osd.9 down 1.00000 1.00000 13 ssd 1.74660 osd.13 down 1.00000 1.00000 17 ssd 1.74660 osd.17 down 1.00000 1.00000 21 ssd 1.74660 osd.21 down 0 1.00000 25 ssd 1.74660 osd.25 down 0 1.00000 29 ssd 1.74660 osd.29 down 0 1.00000 33 ssd 1.74660 osd.33 down 0 1.00000 39 ssd 1.74660 osd.39 down 1.00000 1.00000 43 ssd 1.74660 osd.43 down 1.00000 1.00000 47 ssd 1.74660 osd.47 up 1.00000 1.00000 51 ssd 1.74660 osd.51 up 1.00000 1.00000 55 ssd 1.74660 osd.55 up 1.00000 1.00000 59 ssd 1.74660 osd.59 up 1.00000 1.00000 61 ssd 1.74660 osd.61 up 1.00000 1.00000 63 ssd 1.74660 osd.63 up 1.00000 1.00000

8 months, 1 week

2
1
0 0

RGW multisite logs (data, md, bilog) not being trimmed automatically?

by Christian Rohmann

Hey ceph-users, I am running two (now) Quincy clusters doing RGW multi-site replication with only one actually being written to by clients. The other site is intended simply as a remote copy. On the primary cluster I am observing an ever growing (objects and bytes) "sitea.rgw.log" pool, not so on the remote "siteb.rgw.log" which is only 300MiB and around 15k objects with no growth. Metrics show that the growth of pool on primary is linear for at least 6 months, so not sudden spikes or anything. Also sync status appears to be totally happy. There are also no warnings in regards to large OMAPs or anything similar. I was under the impression that RGW will trim its three logs (md, bi, data) automatically and only keep data that has not yet been replicated by the other zonegroup members? The config option "ceph config get mgr rgw_sync_log_trim_interval" is set to 1200, so 20 Minutes. So I am wondering if there might be some inconsistency or how I can best analyze what the cause for the accumulation of log data is? There are older questions on the ML, such as [1], but there was not really a solution or root cause identified. I know there is manual trimming, but I'd rather want to analyze the current situation and figure out what the cause for the lack of auto-trimming is. * Do I need to go through all buckets and count logs and look at their timestamps? Which queries do make sense here? * Is there usually any logging of the log trimming activity that I should expect? Or that might indicate why trimming does not happen? Regards Christian [1] https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/WZCFOAMLWV…

8 months, 1 week

1
2
0 0

Re: Not able to find a standardized restoration procedure for subvolume snapshots.

by Lokendra Rathour

Hi Team, Facing a similar situation, Any help would be appreciated. Thanks once again for the support. -Lokendra On Tue, Sep 5, 2023 at 10:51 AM Kushagr Gupta <kushagrguptasps.mun(a)gmail.com> wrote: > *Ceph-version*: Quincy > *OS*: Centos 8 stream > > *Issue*: Not able to find a standardized restoration procedure for > subvolume snapshots. > > *Description:* > Hi team, > > We are currently working in a 3-node ceph cluster. > We are currently exploring the scheduled snapshot capability of the > ceph-mgr module. > To enable/configure scheduled snapshots, we followed the following link: > > https://docs.ceph.com/en/quincy/cephfs/snap-schedule/ > > The scheduled snapshots are working as expected. But we are unable to find > any standardized restoration procedure for the same. > > We have found the following link( not official documentation): > https://www.suse.com/support/kb/doc/?id=000019627 > > We have also found a link of cloning a new subvolume from snapshots: > https://docs.ceph.com/en/reef/cephfs/fs-volumes/ > (Section: Cloning Snapshots) > > Is there a standard procedure to restore from a snapshot. > By this I mean, is there some kind of command link maybe > ceph fs subvolume snapshot restore <snapshot-name> > > Or any other procedure please let us know. > > Thanks and Regards, > Kushagra Gupta > -- ~ Lokendra skype: lokendrarathour

8 months, 1 week

1
0
0 0

Ceph services failing to start after OS upgrade

by hansen.ross＠live.com.au

Hi There, I have a ceph cluster running on my proxmox system and it all seemed to upgrade successfully however after the reboot my ceph-mon and my ceph-osd services are failing to start or are crashing by the looks of it. ``` ceph version 17.2.6 (810db68029296377607028a6c6da1ec06f5a2b27) quincy (stable) 1: /lib/x86_64-linux-gnu/libc.so.6(+0x3bfd0) [0x7f10aba5afd0] 2: gf_init_hard() 3: gf_init_easy() 4: galois_init_default_field() 5: jerasure_init() 6: __erasure_code_init() 7: (ceph::ErasureCodePluginRegistry::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::ErasureCodePlugin**, std::ostream*)+0x2b5) [0x55b04c32c605] 8: (ceph::ErasureCodePluginRegistry::preload(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::ostream*)+0x9f) [0x55b04c32cbaf] 9: (global_init_preload_erasure_code(ceph::common::CephContext const*)+0x7c2) [0x55b04bdd9f92] 10: main() 11: /lib/x86_64-linux-gnu/libc.so.6(+0x271ca) [0x7f10aba461ca] 12: __libc_start_main() 13: _start() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. ``` I am still quite new to Ceph and would like advice on how to troubleshoot this and get the services working again. Regards Ross

8 months, 1 week

3
2
0 0

cannot create new OSDs - ceph version 17.2.6 (810db68029296377607028a6c6da1ec06f5a2b27) quincy (stable)

by Konold, Martin

Hi, I want to create a new OSD on a 4TB Samsung MZ1L23T8HBLA-00A07 enterprise nvme device in a hyper-converged proxmox 8 environment. Creating the OSD works but it cannot be initialized and therefore not started. In the log I see an entry about a failed assert. ./src/os/bluestore/fastbmap_allocator_impl.cc: 405: FAILED ceph_assert((aligned_extent.length % l0_granularity) == 0) Is this the culprit? In addition at the end of the logfile there is a failed mount and a failed osd init mentioned. 2023-09-11T16:30:04.708+0200 7f99aa28f3c0 -1 bluefs _check_allocations OP_FILE_UPDATE_INC invalid extent 1: 0x140000~10000: duplicate reference, ino 30 2023-09-11T16:30:04.708+0200 7f99aa28f3c0 -1 bluefs mount failed to replay log: (14) Bad address 2023-09-11T16:30:04.708+0200 7f99aa28f3c0 20 bluefs _stop_alloc 2023-09-11T16:30:04.708+0200 7f99aa28f3c0 -1 bluestore(/var/lib/ceph/osd/ceph-43) _open_bluefs failed bluefs mount: (14) Bad address 2023-09-11T16:30:04.708+0200 7f99aa28f3c0 10 bluefs maybe_verify_layout no memorized_layout in bluefs superblock 2023-09-11T16:30:04.708+0200 7f99aa28f3c0 -1 bluestore(/var/lib/ceph/osd/ceph-43) _open_db failed to prepare db environment: 2023-09-11T16:30:04.708+0200 7f99aa28f3c0 1 bdev(0x5565c261fc00 /var/lib/ceph/osd/ceph-43/block) close 2023-09-11T16:30:04.940+0200 7f99aa28f3c0 -1 osd.43 0 OSD:init: unable to mount object store 2023-09-11T16:30:04.940+0200 7f99aa28f3c0 -1 ** ERROR: osd init failed: (5) Input/output error I verified that the hardware of the new nvme is working fine. -- Regards, ppa. Martin Konold -- Viele Grüße ppa. Martin Konold -- Martin Konold - Prokurist, CTO KONSEC GmbH -⁠ make things real Amtsgericht Stuttgart, HRB 23690 Geschäftsführer: Andreas Mack Im Köller 3, 70794 Filderstadt, Germany

8 months, 1 week

2
3
0 0

MDS crash after Disaster Recovery

by Sasha BALLET

Hi, I'm struggling with a Ceph cluster that doesn't want to end recovery. I suspect there is multiple issues at the same time. So let's start with the most obvious, the MDS daemons are crashing and blocking me from mounting my CephFS. The cluster has been deployed on Debian Buster and later upgraded to Bullseye. There is 4 nodes of 64 OSDs of 12TB. I can't understand which Ceph version it is on because multiple services have different versions (another clue about what could be wrong). There is some 16.2.4, 16.2.5, 16.2.13 and even some 17.2.6 and I think I saw some 15.x.x on the OSDs. I have this log from the service: > journalctl -xe -u ceph-938952d4-7775-11eb-9f42-bc97e19a216a(a)mds.storage.cbi-storage-02.oaelhv.service [...] /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.4/rpm/el8/BUIL> ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f850b1ed59c] 2: /usr/lib64/ceph/libceph-common.so.2(+0x2767b6) [0x7f850b1ed7b6] 3: (MDCache::journal_cow_dentry(MutationImpl*, EMetaBlob*, CDentry*, snapid_t, CInode**, CDentry::linkage_t*)+0x1215) [0x55768a6ad6b5] 4: (MDCache::journal_dirty_inode(MutationImpl*, EMetaBlob*, CInode*, snapid_t)+0x105) [0x55768a6ad965] 5: (Locker::check_inode_max_size(CInode*, bool, unsigned long, unsigned long, utime_t)+0xc39) [0x55768a78fb09] 6: (RecoveryQueue::_recovered(CInode*, int, unsigned long, utime_t)+0x92d) [0x55768a76377d] 7: (MDSContext::complete(int)+0x56) [0x55768a8c0d46] 8: (MDSIOContextBase::complete(int)+0xa3) [0x55768a8c1073] 9: (Filer::C_Probe::finish(int)+0xb5) [0x55768a974af5] 10: (Context::complete(int)+0xd) [0x55768a5b4b6d] 11: (Finisher::finisher_thread_entry()+0x1a5) [0x7f850b28c9d5] 12: /lib64/libpthread.so.0(+0x814a) [0x7f8509f8f14a] 13: clone() debug 0> 2023-09-11T12:55:25.765+0000 7f84fc580700 -1 *** Caught signal (Aborted) ** in thread 7f84fc580700 thread_name:MR_Finisher ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable) 1: /lib64/libpthread.so.0(+0x12b20) [0x7f8509f99b20] 2: gsignal() 3: abort() 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f850b1ed5ed] 5: /usr/lib64/ceph/libceph-common.so.2(+0x2767b6) [0x7f850b1ed7b6] 6: (MDCache::journal_cow_dentry(MutationImpl*, EMetaBlob*, CDentry*, snapid_t, CInode**, CDentry::linkage_t*)+0x1215) [0x55768a6ad6b5] 7: (MDCache::journal_dirty_inode(MutationImpl*, EMetaBlob*, CInode*, snapid_t)+0x105) [0x55768a6ad965] 8: (Locker::check_inode_max_size(CInode*, bool, unsigned long, unsigned long, utime_t)+0xc39) [0x55768a78fb09] 9: (RecoveryQueue::_recovered(CInode*, int, unsigned long, utime_t)+0x92d) [0x55768a76377d] 10: (MDSContext::complete(int)+0x56) [0x55768a8c0d46] 11: (MDSIOContextBase::complete(int)+0xa3) [0x55768a8c1073] 12: (Filer::C_Probe::finish(int)+0xb5) [0x55768a974af5] 13: (Context::complete(int)+0xd) [0x55768a5b4b6d] 14: (Finisher::finisher_thread_entry()+0x1a5) [0x7f850b28c9d5] 15: /lib64/libpthread.so.0(+0x814a) [0x7f8509f8f14a] 16: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 rbd_pwl 0/ 5 journaler 0/ 5 objectcacher 0/ 5 immutable_obj_cache 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 0 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 rgw_sync 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace 1/ 5 prioritycache 0/ 5 test 0/ 5 cephfs_mirror 0/ 5 cephsqlite -2/-2 (syslog threshold) 99/99 (stderr threshold) --- pthread ID / name mapping for recent threads --- 140209120859904 / 140209129252608 / md_submit 140209137645312 / 140209146038016 / MR_Finisher 140209162823424 / PQ_Finisher 140209213179648 / ceph-mds 140209229965056 / safe_timer 140209246750464 / ms_dispatch 140209263535872 / io_context_pool 140209280321280 / admin_socket 140209288713984 / msgr-worker-2 140209297106688 / msgr-worker-1 140209305499392 / msgr-worker-0 140209549580160 / ceph-mds max_recent 10000 max_new 10000 log_file /var/lib/ceph/crash/2023-09-11T12:55:25.769044Z_5229ea26-9bea-4c49-960a-4501b227e545/log --- end dump of recent events --- ceph-938952d4-7775-11eb-9f42-bc97e19a216a(a)mds.storage.cbi-storage-02.oaelhv.service: Main process exited, code=exited, status=134/n/a I tried the Advanced: Metadata Repair Tools https://docs.ceph.com/en/nautilus/cephfs/disaster-recovery-experts/ <https://docs.ceph.com/en/nautilus/cephfs/disaster-recovery-experts/> It goes, "recovery successfull" in the logs, then goes "has been put in standby", then crashes a few seconds or minutes later. I don't know what to do now without breaking everything Thank you for your help Let me know if you need any other informations Have a nice day, Sasha

8 months, 1 week

2
1
0 0

2024

2023

2022

2021

2020

2019

ceph-users September 2023