MDS corrupt (also RADOS-level copy?) - ceph-users

31 May 2023

Dear All,

we are trying to recover from what we suspect is a corrupt MDS :(
and have been following the guide here:

<https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/>

Symptoms: MDS SSD pool (2TB) filled completely over the weekend, 
normally uses less than 400GB, resulting in MDS crash.

We added 4 x extra SSD to increase pool capacity to 3.5TB, however MDS 
did not recover

# ceph fs status
cephfs2 - 0 clients
=======
RANK   STATE     MDS     ACTIVITY   DNS    INOS   DIRS   CAPS
  0     failed
  1    resolve  wilma-s3            8065   8063   8047      0
  2    resolve  wilma-s2             901k   802k  34.4k     0
       POOL         TYPE     USED  AVAIL
     mds_ssd      metadata  2296G  3566G
primary_fs_data    data       0   3566G
     ec82pool       data    2168T  3557T
STANDBY MDS
   wilma-s1
   wilma-s4

setting "ceph mds repaired 0" causes rank 0 to restart, and then 
immediately fail.

Following the disaster-recovery-experts guide, the first step we did was 
to export the MDS journals, e.g:

# cephfs-journal-tool --rank=cephfs2:0 journal export  /root/backup.bin.0
journal is 9744716714163~658103700
wrote 658103700 bytes at offset 9744716714163 to /root/backup.bin.0

so far so good, however when we try to backup the final MDS the process 
consumes all available RAM (470GB) and needs to be killed after 14 minutes.

# cephfs-journal-tool --rank=cephfs2:2 journal export  /root/backup.bin.2

similarly, "recover_dentries summary" consumes all RAM when applied to MDS 2
# cephfs-journal-tool --rank=cephfs2:2 event recover_dentries summary

We successfully ran "cephfs-journal-tool --rank=cephfs2:0 event 
recover_dentries summary" and "cephfs-journal-tool --rank=cephfs2:1 
event recover_dentries summary"

at this point, we tried to follow the instructions and make a RADOS 
level copy of the journal data, however the link in the docs doesn't 
explain how to do this and just points to 
<http://tracker.ceph.com/issues/9902>

At this point we are tempted to reset the journal on MDS 2, but wanted 
to get a feeling from others about how dangerous this could be?

We have a backup, but as there is 1.8PB of data, it's going to take a 
few weeks to restore....

any ideas gratefully received.

Jake

-- 
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.