Forgot so say: As for your corrupt rank 0, you should check the logs
with a higher debug level. Looks like you were less lucky than we were.
Your journal position may be incorrect. This could be fixed by editing
the journal header. You might also try to tell your MDS to skip corrupt
entries. None of these operations are safe, though.
On 31/05/2023 16:41, Janek Bevendorff wrote:
> Hi Jake,
>
> Very interesting. This sounds very much like what we have been
> experiencing the last two days. We also had a sudden fill-up of the
> metadata pool, which repeated last night. See my question here:
>
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7U27L27FHHP…
>
> I also noticed that I couldn't dump the current journal using the
> cephfs-journal-tool, as it would eat up all my RAM (probably not
> surprising with a journal that seems to be filling up a 16TiB pool).
>
> Note: I did NOT need to reset the journal (and you probably don't need
> to either). I did, however, have to add extra capacity and balance out
> the data. After an MDS restart, the pool quickly cleared out again.
> The first MDS restart took an hour or so and I had to increase the MDS
> lag timeout (mds_beacon_grace), otherwise the MONs kept killing the
> MDS during the resolve phase. I set it to 1600 to be on the safe side.
>
> While your MDS are recovering, you may want to set debug_mds to 10 for
> one of your MDS and check the logs. My logs were being spammed with
> snapshot-related messages, but I cannot really make sense of them.
> Still hoping for a reply on the ML.
>
> In any case, once you are recovered, I recommend you adjust the
> weights of some of your OSDs to be much lower than others as a
> temporary safeguard. This way, only some OSDs would fill up and
> trigger your FULL watermark should this thing repeat.
>
> Janek
>
>
> On 31/05/2023 16:13, Jake Grimmett wrote:
>> Dear All,
>>
>> we are trying to recover from what we suspect is a corrupt MDS :(
>> and have been following the guide here:
>>
>> <https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/>
>>
>> Symptoms: MDS SSD pool (2TB) filled completely over the weekend,
>> normally uses less than 400GB, resulting in MDS crash.
>>
>> We added 4 x extra SSD to increase pool capacity to 3.5TB, however
>> MDS did not recover
>>
>> # ceph fs status
>> cephfs2 - 0 clients
>> =======
>> RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
>> 0 failed
>> 1 resolve wilma-s3 8065 8063 8047 0
>> 2 resolve wilma-s2 901k 802k 34.4k 0
>> POOL TYPE USED AVAIL
>> mds_ssd metadata 2296G 3566G
>> primary_fs_data data 0 3566G
>> ec82pool data 2168T 3557T
>> STANDBY MDS
>> wilma-s1
>> wilma-s4
>>
>> setting "ceph mds repaired 0" causes rank 0 to restart, and then
>> immediately fail.
>>
>> Following the disaster-recovery-experts guide, the first step we did
>> was to export the MDS journals, e.g:
>>
>> # cephfs-journal-tool --rank=cephfs2:0 journal export /root/backup.bin.0
>> journal is 9744716714163~658103700
>> wrote 658103700 bytes at offset 9744716714163 to /root/backup.bin.0
>>
>> so far so good, however when we try to backup the final MDS the
>> process consumes all available RAM (470GB) and needs to be killed
>> after 14 minutes.
>>
>> # cephfs-journal-tool --rank=cephfs2:2 journal export /root/backup.bin.2
>>
>> similarly, "recover_dentries summary" consumes all RAM when applied
>> to MDS 2
>> # cephfs-journal-tool --rank=cephfs2:2 event recover_dentries summary
>>
>> We successfully ran "cephfs-journal-tool --rank=cephfs2:0 event
>> recover_dentries summary" and "cephfs-journal-tool --rank=cephfs2:1
>> event recover_dentries summary"
>>
>> at this point, we tried to follow the instructions and make a RADOS
>> level copy of the journal data, however the link in the docs doesn't
>> explain how to do this and just points to
>> <http://tracker.ceph.com/issues/9902>
>>
>> At this point we are tempted to reset the journal on MDS 2, but
>> wanted to get a feeling from others about how dangerous this could be?
>>
>> We have a backup, but as there is 1.8PB of data, it's going to take a
>> few weeks to restore....
>>
>> any ideas gratefully received.
>>
>> Jake
>>
>>