Update: apparently, we did it!
We walked through the disaster recovery steps where one of the steps
was to reset the journal. I was under the impression that the
specified command 'cephfs-journal-tool [--rank=N] journal reset' would
simply reset all the journals (mdlog and purge_queue), but it seems
like it doesn't. After Mykola (once again, thank you so much for your
input) pointed towards running the command for the purge_queue
specifically, the filesystem got out of the read-only mode and was
mountable again. the exact command was:
cephfs-journal-tool --rank=cephfs:0 --journal=purge_queue journal reset
We didn't have to walk through the recovery with an empty pool, which
is nice. I have a suggestion to include the "journal inspect" command
to the docs for both mdlog and purge_queue to understand that both
journals might need a reset.
Thanks again, Mykola!
Eugen
Zitat von Eugen Block <eblock(a)nde.ag>ag>:
So we did walk through the advanced recovery page but
didn't really
succeed. The CephFS is still going to readonly because of the
purge_queue error. Is there any chance to recover from that or
should we try to recover with an empty metadata pool next?
I'd still appreciate any comments. ;-)
Zitat von Eugen Block <eblock(a)nde.ag>ag>:
> Some more information on the damaged CephFS, apparently the journal
> is damaged:
>
> ---snip---
> # cephfs-journal-tool --rank=storage:0 --journal=mdlog journal inspect
>
> 2023-12-08T15:35:22.922+0200 7f834d0320c0 -1 Missing object 200.000527c4
>
> 2023-12-08T15:35:22.938+0200 7f834d0320c0 -1 Bad entry start ptr
> (0x149f140067f) at 0x149f1174595
>
> 2023-12-08T15:35:22.942+0200 7f834d0320c0 -1 Bad entry start ptr
> (0x149f1400e66) at 0x149f1174d7c
>
> 2023-12-08T15:35:22.954+0200 7f834d0320c0 -1 Bad entry start ptr
> (0x149f1401642) at 0x149f1175558
>
> 2023-12-08T15:35:22.970+0200 7f834d0320c0 -1 Bad entry start ptr
> (0x149f1401e29) at 0x149f1175d3f
>
> 2023-12-08T15:35:22.974+0200 7f834d0320c0 -1 Bad entry start ptr
> (0x149f1402610) at 0x149f1176526
>
> 2023-12-08T15:35:22.978+0200 7f834d0320c0 -1 Missing object 200.000527ca
>
> 2023-12-08T15:35:22.978+0200 7f834d0320c0 -1 Missing object 200.000527cb
>
> 2023-12-08T15:35:22.994+0200 7f834d0320c0 -1 Bad entry start ptr
> (0x149f30008f4) at 0x149f2d7480a
>
> 2023-12-08T15:35:22.998+0200 7f834d0320c0 -1 Bad entry start ptr
> (0x149f3000ced) at 0x149f2d74c03
>
> Overall journal integrity: DAMAGED
> Objects missing:
> 0x527c4
> 0x527ca
> 0x527cb
> Corrupt regions:
> 0x149f0d73f16-149f1174595
> 0x149f1174595-149f1174d7c
> 0x149f1174d7c-149f1175558
> 0x149f1175558-149f1175d3f
> 0x149f1175d3f-149f1176526
> 0x149f1176526-149f2d7480a
> 0x149f2d7480a-149f2d74c03
> 0x149f2d74c03-ffffffffffffffff
>
> # cephfs-journal-tool --rank=storage:0 --journal=purge_queue journal inspect
>
> 2023-12-08T15:35:57.691+0200 7f331621e0c0 -1 Missing object 500.00000dc6
>
> Overall journal integrity: DAMAGED
> Objects missing:
> 0xdc6
> Corrupt regions:
> 0x3718522e9-ffffffffffffffff
> ---snip---
>
> A backup isn't possible:
>
> ---snip---
> # cephfs-journal-tool --rank=storage:0 journal export backup.bin
> 2023-12-08T15:42:07.643+0200 7fde6a24f0c0 -1 Missing object 200.000527c4
>
> 2023-12-08T15:42:07.659+0200 7fde6a24f0c0 -1 Bad entry start ptr
> (0x149f140067f) at 0x149f1174595
>
> 2023-12-08T15:42:07.667+0200 7fde6a24f0c0 -1 Bad entry start ptr
> (0x149f1400e66) at 0x149f1174d7c
>
> 2023-12-08T15:42:07.675+0200 7fde6a24f0c0 -1 Bad entry start ptr
> (0x149f1401642) at 0x149f1175558
>
> 2023-12-08T15:42:07.687+0200 7fde6a24f0c0 -1 Bad entry start ptr
> (0x149f1401e29) at 0x149f1175d3f
>
> 2023-12-08T15:42:07.699+0200 7fde6a24f0c0 -1 Bad entry start ptr
> (0x149f1402610) at 0x149f1176526
>
> 2023-12-08T15:42:07.699+0200 7fde6a24f0c0 -1 Missing object 200.000527ca
>
> 2023-12-08T15:42:07.699+0200 7fde6a24f0c0 -1 Missing object 200.000527cb
>
> 2023-12-08T15:42:07.707+0200 7fde6a24f0c0 -1 Bad entry start ptr
> (0x149f30008f4) at 0x149f2d7480a
>
> 2023-12-08T15:42:07.707+0200 7fde6a24f0c0 -1 Bad entry start ptr
> (0x149f3000ced) at 0x149f2d74c03
>
> 2023-12-08T15:42:07.707+0200 7fde6a24f0c0 -1 journal_export:
> Journal not readable, attempt object-by-object dump with `rados`
>
> Error ((5) Input/output error)
> ---snip---
>
> Does it make sense to continue with the advanced disaster recovery
> [3] bei running (all of) these steps:
>
> cephfs-journal-tool event recover_dentries summary
> cephfs-journal-tool [--rank=N] journal reset
> cephfs-table-tool all reset session
> ceph fs reset <fs name> --yes-i-really-mean-it
> cephfs-table-tool 0 reset session
> cephfs-table-tool 0 reset snap
> cephfs-table-tool 0 reset inode
> cephfs-journal-tool --rank=0 journal reset
> cephfs-data-scan init
>
> Fortunately, I didn't have to run through this procedure too often,
> so I'd appreciate any comments what the best approach would be here.
>
> Thanks!
> Eugen
>
> [3]
>
https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-…
>
>
> Zitat von Eugen Block <eblock(a)nde.ag>ag>:
>
>> I was able to (almost) reproduce the issue in a (Pacific) test
>> cluster. I rebuilt the monmap from the OSDs, brought everything
>> back up, started the mds recovery like described in [1]:
>>
>> ceph fs new <fs_name> <metadata_pool> <data_pool> --force
--recover
>>
>> Then I added two mds daemons which went into standby:
>>
>> ---snip---
>> Started Ceph mds.cephfs.pacific.uexvvq for
>> 1b0afda4-2221-11ee-87be-fa163eed040c.
>> Dez 08 12:51:53 pacific conmon[100493]: debug
>> 2023-12-08T11:51:53.086+0000 7ff5f589b900 0 set uid:gid to
>> 167:167 (ceph:ceph)
>> Dez 08 12:51:53 pacific conmon[100493]: debug
>> 2023-12-08T11:51:53.086+0000 7ff5f589b900 0 ceph version 16.2.14
>> (238ba602515df21ea7ffc75c88db29f9e5ef12c9) pacific (stable),
>> process ceph-md>
>> Dez 08 12:51:53 pacific conmon[100493]: debug
>> 2023-12-08T11:51:53.086+0000 7ff5f589b900 1 main not setting numa
>> affinity
>> Dez 08 12:51:53 pacific conmon[100493]: debug
>> 2023-12-08T11:51:53.086+0000 7ff5f589b900 0 pidfile_write: ignore
>> empty --pid-file
>> Dez 08 12:51:53 pacific conmon[100493]: starting
>> mds.cephfs.pacific.uexvvq at
>> Dez 08 12:51:53 pacific conmon[100493]: debug
>> 2023-12-08T11:51:53.102+0000 7ff5e37be700 1
>> mds.cephfs.pacific.uexvvq Updating MDS map to version 2 from mon.0
>> Dez 08 12:51:53 pacific conmon[100493]: debug
>> 2023-12-08T11:51:53.802+0000 7ff5e37be700 1
>> mds.cephfs.pacific.uexvvq Updating MDS map to version 3 from mon.0
>> Dez 08 12:51:53 pacific conmon[100493]: debug
>> 2023-12-08T11:51:53.802+0000 7ff5e37be700 1
>> mds.cephfs.pacific.uexvvq Monitors have assigned me to become a
>> standby.
>> ---snip---
>>
>>
>> But as soon as I ran
>>
>> pacific:~ # ceph fs set cephfs joinable true
>> cephfs marked joinable; MDS may join as newly active.
>>
>> one MDS daemon became active and the FS is available now. So
>> apparently the "Advanced" steps from [2] usually weren't
>> necessary, but are they in this case? I'm still trying to find an
>> explanation for the purge_queue errors.
>>
>> Zitat von Eugen Block <eblock(a)nde.ag>ag>:
>>
>>> Hi,
>>>
>>> following up on the previous thread (After hardware failure tried
>>> to recover ceph and followed instructions for recovery using
>>> OSDS), we were able to get ceph back into a healthy state
>>> (including the unfound object). Now the CephFS needs to be
>>> recovered and I'm having trouble to fully understand the docs [1]
>>> which the next steps would be. We ran the following which
>>> according to [1] sets the state to existing but failed:
>>>
>>> ceph fs new <fs_name> <metadata_pool> <data_pool> --force
--recover
>>>
>>> But how to continue from here? Should we expect an active MDS at
>>> this point or not? Because the "ceph fs status" output still
>>> shows rank 0 as failed. We then tried:
>>>
>>> ceph fs set <fs_name> joinable true
>>>
>>> But apparently it was already joinable, nothing changed. Before
>>> doing anything (destructive) from the advanced options [2] I
>>> wanted to ask the community, how to get on from here. I pasted
>>> the mds logs at the bottom, I'm not really sure if the current
>>> state is expected or not. Apparently, the journal recovers but
>>> the purge_queue does not:
>>>
>>> mds.0.41 Booting: 2: waiting for purge queue recovered
>>> mds.0.journaler.pq(ro) _finish_probe_end write_pos = 14797504512
>>> (header had 14789452521). recovered.
>>> mds.0.purge_queue operator(): open complete
>>> mds.0.purge_queue operator(): recovering write_pos
>>> monclient: get_auth_request con 0x55c280bc5c00 auth_method 0
>>> monclient: get_auth_request con 0x55c280ee0c00 auth_method 0
>>> mds.0.journaler.pq(ro) _finish_read got error -2
>>> mds.0.purge_queue _recover: Error -2 recovering write_pos
>>> mds.0.purge_queue _go_readonly: going readonly because internal
>>> IO failed: No such file or directory
>>> mds.0.journaler.pq(ro) set_readonly
>>> mds.0.41 unhandled write error (2) No such file or directory,
>>> force readonly...
>>> mds.0.cache force file system read-only
>>> force file system read-only
>>>
>>> Is this expected because the "--recover" flag prevents an active
>>> MDS or not? Before running "ceph mds rmfailed ..." and/or
"ceph
>>> fs reset <file system name>" with the --yes-i-really-mean-it flag
>>> I'd like to ask for your input. In which case should we run those
>>> commands? The docs are not really clear to me. Any input is
>>> highly appreciated!
>>>
>>> Thanks!
>>> Eugen
>>>
>>> [1]
>>>
https://docs.ceph.com/en/latest/cephfs/recover-fs-after-mon-store-loss/
>>> [2]
>>>
https://docs.ceph.com/en/latest/cephfs/administration/#advanced-cephfs-admi…
>>>
>>> ---snip---
>>> Dec 07 15:35:48 node02 bash[692598]: debug -90>
>>> 2023-12-07T13:35:47.730+0000 7f4cd855f700 1
>>> mds.storage.node02.hemalk Updating MDS map to version 41 from mon.0
>>> Dec 07 15:35:48 node02 bash[692598]: debug -89>
>>> 2023-12-07T13:35:47.730+0000 7f4cd855f700 4 mds.0.purge_queue
>>> operator(): data pool 3 not found in OSDMap
>>> Dec 07 15:35:48 node02 bash[692598]: debug -88>
>>> 2023-12-07T13:35:47.730+0000 7f4cd855f700 5 asok(0x55c27fe86000)
>>> register_command objecter_requests hook 0x55c27fe16310
>>> Dec 07 15:35:48 node02 bash[692598]: debug -87>
>>> 2023-12-07T13:35:47.730+0000 7f4cd855f700 10 monclient: _renew_subs
>>> Dec 07 15:35:48 node02 bash[692598]: debug -86>
>>> 2023-12-07T13:35:47.730+0000 7f4cd855f700 10 monclient:
>>> _send_mon_message to mon.node02 at v2:10.40.99.12:3300/0
>>> Dec 07 15:35:48 node02 bash[692598]: debug -85>
>>> 2023-12-07T13:35:47.730+0000 7f4cd855f700 10 log_channel(cluster)
>>> update_config to_monitors: true to_syslog: false syslog_facility:
>>> prio: info to_graylog: false graylog_host: 127.0.0.1
>>> graylog_port: 12201)
>>> Dec 07 15:35:48 node02 bash[692598]: debug -84>
>>> 2023-12-07T13:35:47.730+0000 7f4cd855f700 4 mds.0.purge_queue
>>> operator(): data pool 3 not found in OSDMap
>>> Dec 07 15:35:48 node02 bash[692598]: debug -83>
>>> 2023-12-07T13:35:47.730+0000 7f4cd855f700 4 mds.0.0
>>> apply_blocklist: killed 0, blocklisted sessions (0 blocklist
>>> entries, 0)
>>> Dec 07 15:35:48 node02 bash[692598]: debug -82>
>>> 2023-12-07T13:35:47.730+0000 7f4cd855f700 1 mds.0.41
>>> handle_mds_map i am now mds.0.41
>>> Dec 07 15:35:48 node02 bash[692598]: debug -81>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 1 mds.0.41
>>> handle_mds_map state change up:standby --> up:replay
>>> Dec 07 15:35:48 node02 bash[692598]: debug -80>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 5
>>> mds.beacon.storage.node02.hemalk set_want_state: up:standby ->
>>> up:replay
>>> Dec 07 15:35:48 node02 bash[692598]: debug -79>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 1 mds.0.41 replay_start
>>> Dec 07 15:35:48 node02 bash[692598]: debug -78>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 2 mds.0.41 Booting: 0:
>>> opening inotable
>>> Dec 07 15:35:48 node02 bash[692598]: debug -77>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 10 monclient:
>>> _send_mon_message to mon.node02 at v2:10.40.99.12:3300/0
>>> Dec 07 15:35:48 node02 bash[692598]: debug -76>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 2 mds.0.41 Booting: 0:
>>> opening sessionmap
>>> Dec 07 15:35:48 node02 bash[692598]: debug -75>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 10 monclient:
>>> _send_mon_message to mon.node02 at v2:10.40.99.12:3300/0
>>> Dec 07 15:35:48 node02 bash[692598]: debug -74>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 2 mds.0.41 Booting: 0:
>>> opening mds log
>>> Dec 07 15:35:48 node02 bash[692598]: debug -73>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 5 mds.0.log open
>>> discovering log bounds
>>> Dec 07 15:35:48 node02 bash[692598]: debug -72>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 2 mds.0.41 Booting: 0:
>>> opening purge queue (async)
>>> Dec 07 15:35:48 node02 bash[692598]: debug -71>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 4 mds.0.purge_queue
>>> open: opening
>>> Dec 07 15:35:48 node02 bash[692598]: debug -70>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 1
>>> mds.0.journaler.pq(ro) recover start
>>> Dec 07 15:35:48 node02 bash[692598]: debug -69>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 1
>>> mds.0.journaler.pq(ro) read_head
>>> Dec 07 15:35:48 node02 bash[692598]: debug -68>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 10 monclient:
>>> _send_mon_message to mon.node02 at v2:10.40.99.12:3300/0
>>> Dec 07 15:35:48 node02 bash[692598]: debug -67>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 2 mds.0.41 Booting: 0:
>>> loading open file table (async)
>>> Dec 07 15:35:48 node02 bash[692598]: debug -66>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 10 monclient:
>>> _send_mon_message to mon.node02 at v2:10.40.99.12:3300/0
>>> Dec 07 15:35:48 node02 bash[692598]: debug -65>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 2 mds.0.41 Booting: 0:
>>> opening snap table
>>> Dec 07 15:35:48 node02 bash[692598]: debug -64>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 10 monclient:
>>> _send_mon_message to mon.node02 at v2:10.40.99.12:3300/0
>>> Dec 07 15:35:48 node02 bash[692598]: debug -63>
>>> 2023-12-07T13:35:47.734+0000 7f4cd1d52700 4 mds.0.journalpointer
>>> Reading journal pointer '400.00000000'
>>> Dec 07 15:35:48 node02 bash[692598]: debug -62>
>>> 2023-12-07T13:35:47.734+0000 7f4cd1d52700 10 monclient:
>>> _send_mon_message to mon.node02 at v2:10.40.99.12:3300/0
>>> Dec 07 15:35:48 node02 bash[692598]: debug -61>
>>> 2023-12-07T13:35:47.734+0000 7f4cd4557700 2 mds.0.cache Memory
>>> usage: total 316452, rss 43088, heap 198940, baseline 198940, 0
>>> / 0 inodes have caps, 0 caps, 0 caps per inode
>>> Dec 07 15:35:48 node02 bash[692598]: debug -60>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 10 monclient: _renew_subs
>>> Dec 07 15:35:48 node02 bash[692598]: debug -59>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 10 monclient:
>>> _send_mon_message to mon.node02 at v2:10.40.99.12:3300/0
>>> Dec 07 15:35:48 node02 bash[692598]: debug -58>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 10 monclient:
>>> handle_get_version_reply finishing 1 version 10835
>>> Dec 07 15:35:48 node02 bash[692598]: debug -57>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 10 monclient:
>>> handle_get_version_reply finishing 2 version 10835
>>> Dec 07 15:35:48 node02 bash[692598]: debug -56>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 10 monclient:
>>> handle_get_version_reply finishing 3 version 10835
>>> Dec 07 15:35:48 node02 bash[692598]: debug -55>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 10 monclient:
>>> handle_get_version_reply finishing 4 version 10835
>>> Dec 07 15:35:48 node02 bash[692598]: debug -54>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 10 monclient:
>>> handle_get_version_reply finishing 5 version 10835
>>> Dec 07 15:35:48 node02 bash[692598]: debug -53>
>>> 2023-12-07T13:35:47.734+0000 7f4cd855f700 10 monclient:
>>> handle_get_version_reply finishing 6 version 10835
>>> Dec 07 15:35:48 node02 bash[692598]: debug -52>
>>> 2023-12-07T13:35:47.734+0000 7f4cdb565700 10 monclient:
>>> get_auth_request con 0x55c280bc5800 auth_method 0
>>> Dec 07 15:35:48 node02 bash[692598]: debug -51>
>>> 2023-12-07T13:35:47.734+0000 7f4cdbd66700 10 monclient:
>>> get_auth_request con 0x55c280dc6800 auth_method 0
>>> Dec 07 15:35:48 node02 bash[692598]: debug -50>
>>> 2023-12-07T13:35:47.734+0000 7f4cdad64700 10 monclient:
>>> get_auth_request con 0x55c280dc7800 auth_method 0
>>> Dec 07 15:35:48 node02 bash[692598]: debug -49>
>>> 2023-12-07T13:35:47.734+0000 7f4cd3555700 1
>>> mds.0.journaler.pq(ro) _finish_read_head loghead(trim
>>> 14789115904, expire 14789452521, write 14789452521, stream_format
>>> 1). probing for end of log (from 14789452521)...
>>> Dec 07 15:35:48 node02 bash[692598]: debug -48>
>>> 2023-12-07T13:35:47.734+0000 7f4cd3555700 1
>>> mds.0.journaler.pq(ro) probing for end of the log
>>> Dec 07 15:35:48 node02 bash[692598]: debug -47>
>>> 2023-12-07T13:35:47.738+0000 7f4cd1d52700 1
>>> mds.0.journaler.mdlog(ro) recover start
>>> Dec 07 15:35:48 node02 bash[692598]: debug -46>
>>> 2023-12-07T13:35:47.738+0000 7f4cd1d52700 1
>>> mds.0.journaler.mdlog(ro) read_head
>>> Dec 07 15:35:48 node02 bash[692598]: debug -45>
>>> 2023-12-07T13:35:47.738+0000 7f4cd1d52700 4 mds.0.log Waiting
>>> for journal 0x200 to recover...
>>> Dec 07 15:35:48 node02 bash[692598]: debug -44>
>>> 2023-12-07T13:35:47.738+0000 7f4cdbd66700 10 monclient:
>>> get_auth_request con 0x55c280dc7c00 auth_method 0
>>> Dec 07 15:35:48 node02 bash[692598]: debug -43>
>>> 2023-12-07T13:35:47.738+0000 7f4cd2553700 1
>>> mds.0.journaler.mdlog(ro) _finish_read_head loghead(trim
>>> 1416940748800, expire 1416947000701, write 1417125359769,
>>> stream_format 1). probing for end of log (from 1417125359769)...
>>> Dec 07 15:35:48 node02 bash[692598]: debug -42>
>>> 2023-12-07T13:35:47.738+0000 7f4cd2553700 1
>>> mds.0.journaler.mdlog(ro) probing for end of the log
>>> Dec 07 15:35:48 node02 bash[692598]: debug -41>
>>> 2023-12-07T13:35:47.738+0000 7f4cdb565700 10 monclient:
>>> get_auth_request con 0x55c280e2fc00 auth_method 0
>>> Dec 07 15:35:48 node02 bash[692598]: debug -40>
>>> 2023-12-07T13:35:47.738+0000 7f4cdad64700 10 monclient:
>>> get_auth_request con 0x55c280ee0400 auth_method 0
>>> Dec 07 15:35:48 node02 bash[692598]: debug -39>
>>> 2023-12-07T13:35:47.738+0000 7f4cd2553700 1
>>> mds.0.journaler.mdlog(ro) _finish_probe_end write_pos =
>>> 1417129492480 (header had 1417125359769). recovered.
>>> Dec 07 15:35:48 node02 bash[692598]: debug -38>
>>> 2023-12-07T13:35:47.738+0000 7f4cd1d52700 4 mds.0.log Journal
>>> 0x200 recovered.
>>> Dec 07 15:35:48 node02 bash[692598]: debug -37>
>>> 2023-12-07T13:35:47.738+0000 7f4cd1d52700 4 mds.0.log Recovered
>>> journal 0x200 in format 1
>>> Dec 07 15:35:48 node02 bash[692598]: debug -36>
>>> 2023-12-07T13:35:47.738+0000 7f4cd1d52700 2 mds.0.41 Booting: 1:
>>> loading/discovering base inodes
>>> Dec 07 15:35:48 node02 bash[692598]: debug -35>
>>> 2023-12-07T13:35:47.738+0000 7f4cd1d52700 0 mds.0.cache creating
>>> system inode with ino:0x100
>>> Dec 07 15:35:48 node02 bash[692598]: debug -34>
>>> 2023-12-07T13:35:47.738+0000 7f4cd1d52700 0 mds.0.cache creating
>>> system inode with ino:0x1
>>> Dec 07 15:35:48 node02 bash[692598]: debug -33>
>>> 2023-12-07T13:35:47.742+0000 7f4cdbd66700 10 monclient:
>>> get_auth_request con 0x55c280dc7400 auth_method 0
>>> Dec 07 15:35:48 node02 bash[692598]: debug -32>
>>> 2023-12-07T13:35:47.742+0000 7f4cd2553700 2 mds.0.41 Booting: 2:
>>> replaying mds log
>>> Dec 07 15:35:48 node02 bash[692598]: debug -31>
>>> 2023-12-07T13:35:47.742+0000 7f4cd2553700 2 mds.0.41 Booting: 2:
>>> waiting for purge queue recovered
>>> Dec 07 15:35:48 node02 bash[692598]: debug -30>
>>> 2023-12-07T13:35:47.742+0000 7f4cd3555700 1
>>> mds.0.journaler.pq(ro) _finish_probe_end write_pos = 14797504512
>>> (header had 14789452521). recovered.
>>> Dec 07 15:35:48 node02 bash[692598]: debug -29>
>>> 2023-12-07T13:35:47.742+0000 7f4cd3555700 4 mds.0.purge_queue
>>> operator(): open complete
>>> Dec 07 15:35:48 node02 bash[692598]: debug -28>
>>> 2023-12-07T13:35:47.742+0000 7f4cd3555700 4 mds.0.purge_queue
>>> operator(): recovering write_pos
>>> Dec 07 15:35:48 node02 bash[692598]: debug -27>
>>> 2023-12-07T13:35:47.742+0000 7f4cdb565700 10 monclient:
>>> get_auth_request con 0x55c280bc5c00 auth_method 0
>>> Dec 07 15:35:48 node02 bash[692598]: debug -26>
>>> 2023-12-07T13:35:47.742+0000 7f4cdad64700 10 monclient:
>>> get_auth_request con 0x55c280ee0c00 auth_method 0
>>> Dec 07 15:35:48 node02 bash[692598]: debug -25>
>>> 2023-12-07T13:35:47.746+0000 7f4cd3555700 0
>>> mds.0.journaler.pq(ro) _finish_read got error -2
>>> Dec 07 15:35:48 node02 bash[692598]: debug -24>
>>> 2023-12-07T13:35:47.746+0000 7f4cd3555700 -1 mds.0.purge_queue
>>> _recover: Error -2 recovering write_pos
>>> Dec 07 15:35:48 node02 bash[692598]: debug -23>
>>> 2023-12-07T13:35:47.746+0000 7f4cd3555700 1 mds.0.purge_queue
>>> _go_readonly: going readonly because internal IO failed: No such
>>> file or directory
>>> Dec 07 15:35:48 node02 bash[692598]: debug -22>
>>> 2023-12-07T13:35:47.746+0000 7f4cd3555700 1
>>> mds.0.journaler.pq(ro) set_readonly
>>> Dec 07 15:35:48 node02 bash[692598]: debug -21>
>>> 2023-12-07T13:35:47.746+0000 7f4cd3555700 -1 mds.0.41 unhandled
>>> write error (2) No such file or directory, force readonly...
>>> Dec 07 15:35:48 node02 bash[692598]: debug -20>
>>> 2023-12-07T13:35:47.746+0000 7f4cd3555700 1 mds.0.cache force
>>> file system read-only
>>> Dec 07 15:35:48 node02 bash[692598]: debug -19>
>>> 2023-12-07T13:35:47.746+0000 7f4cd3555700 0 log_channel(cluster)
>>> log [WRN] : force file system read-only
>>> ---snip---