Dear all,
I'm having a hard time troubleshooting a file-system failure on my 3 node cluster
(deployed with cephadm + docker). After moving some files between folders, the cluster
became laggy and Metadata Servers started failing and got stuck in rejoin state. Of course
I already tried to restart the cluster multiple times.
The mds units are now in a failed state because of too many restarts, the file-system is
degraded and cannot be mounted because no mds is up. I think the data pool is ok because I
can get files using rados.
I can trigger the standby mds to become the "major" with ceph orch daemon rm mds
<mds-in-error-id> or deploy a new one but the new "major" mds go again in
error state.
I don't find the mds logs really helpful but you can find one in the attachments for
someone more expert than me.
I am hesitant to follow the guide
https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/ because of the warnings
and because the ceph-journal-tool is poorly documented.
The following might be useful
seppia:~# ceph fs status
starfs - 0 clients
======
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 rejoin(laggy) starfs.polposition.njarir 539 25 17 0
POOL TYPE USED AVAIL
cephfs.starfs.meta metadata 9900M 1027G
cephfs.starfs.data data 12.1T 1027G
MDS version: ceph version 16.2.0 (0c2054e95bcd9b30fdd908a79ac1d8bbc3394442) pacific
(stable)
seppia:~ # ceph health detail
HEALTH_WARN 2 failed cephadm daemon(s); 1 filesystem is degraded; insufficient standby MDS
daemons available; 7 pgs not deep-scrubbed in time
[WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
daemon mds.starfs.polposition.njarir on polposition.starfleet.sns.it is in error
state
daemon mds.starfs.seppia.wdwrho on seppia.starfleet.sns.it is in error state
[WRN] FS_DEGRADED: 1 filesystem is degraded
fs starfs is degraded
[WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
have 0; want 1 more
[WRN] PG_NOT_DEEP_SCRUBBED: 7 pgs not deep-scrubbed in time
pg 3.a8 not deep-scrubbed since 2021-04-20T20:07:48.346677+0000
pg 3.a2 not deep-scrubbed since 2021-04-21T08:10:55.220263+0000
pg 3.7 not deep-scrubbed since 2021-04-21T07:24:20.073569+0000
pg 2.0 not deep-scrubbed since 2021-04-21T05:01:18.439456+0000
pg 9.1a not deep-scrubbed since 2021-04-21T05:18:20.171151+0000
pg 3.1cb not deep-scrubbed since 2021-04-20T21:54:38.251349+0000
pg 3.1ef not deep-scrubbed since 2021-04-21T07:07:18.842132+0000
Thanks for any suggestions,
Alessandro Piazza