[ceph-users] Unable to online CephFS, MDS segfaults during mds log replay

26 May 2023

Hello,

   I have a Ceph deployment using CephFS. Recently MDS failed and cannot
start. Attempting to start MDS for this filesystem results in nearly
immediate segfault in MDS. Logs below.

cephfs-journal-tool shows Overall journal integrity state OK
root@proxmox-2:/var/log/ceph# cephfs-journal-tool --rank=galaxy:all journal
inspect
Overall journal integrity: OK

Stack dump / log from MDS:
   -14> 2023-05-26T15:01:09.204-0500 7f27c24b2700  1
mds.0.journaler.mdlog(ro) probing for end of the log
   -13> 2023-05-26T15:01:09.208-0500 7f27c34b4700  1 mds.0.journaler.pq(ro)
_finish_read_head loghead(trim 4194304, expire 4194607, write 4194607,
stream_format 1).  probing for end of log (from 4194607)...
   -12> 2023-05-26T15:01:09.208-0500 7f27c34b4700  1 mds.0.journaler.pq(ro)
probing for end of the log
   -11> 2023-05-26T15:01:09.412-0500 7f27c24b2700  1
mds.0.journaler.mdlog(ro) _finish_probe_end write_pos = 2388235687 (header
had 2388213543). recovered.
   -10> 2023-05-26T15:01:09.412-0500 7f27c34b4700  1 mds.0.journaler.pq(ro)
_finish_probe_end write_pos = 4194607 (header had 4194607). recovered.
    -9> 2023-05-26T15:01:09.412-0500 7f27c34b4700  4 mds.0.purge_queue
operator(): open complete
    -8> 2023-05-26T15:01:09.412-0500 7f27c34b4700  1 mds.0.journaler.pq(ro)
set_writeable
    -7> 2023-05-26T15:01:09.412-0500 7f27c1cb1700  4 mds.0.log Journal
0x200 recovered.
    -6> 2023-05-26T15:01:09.412-0500 7f27c1cb1700  4 mds.0.log Recovered
journal 0x200 in format 1
    -5> 2023-05-26T15:01:09.412-0500 7f27c1cb1700  2 mds.0.6403 Booting: 1:
loading/discovering base inodes
    -4> 2023-05-26T15:01:09.412-0500 7f27c1cb1700  0 mds.0.cache creating
system inode with ino:0x100
    -3> 2023-05-26T15:01:09.412-0500 7f27c1cb1700  0 mds.0.cache creating
system inode with ino:0x1
    -2> 2023-05-26T15:01:09.416-0500 7f27c24b2700  2 mds.0.6403 Booting: 2:
replaying mds log
    -1> 2023-05-26T15:01:09.416-0500 7f27c24b2700  2 mds.0.6403 Booting: 2:
waiting for purge queue recovered
     0> 2023-05-26T15:01:09.428-0500 7f27c0caf700 -1 *** Caught signal
(Segmentation fault) **
 in thread 7f27c0caf700 thread_name:md_log_replay

 ceph version 17.2.6 (995dec2cdae920da21db2d455e55efbc339bde24) quincy
(stable)
 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x13140) [0x7f27cd70c140]
 2: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x66c2)
[0x563540fc7372]
 3: (EUpdate::replay(MDSRank*)+0x3c) [0x563540fc8abc]
 4: (MDLog::_replay_thread()+0x7cb) [0x563540f4d0fb]
 5: (MDLog::ReplayThread::entry()+0xd) [0x563540c1fbfd]
 6: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f27cd700ea7]
 7: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.

What are the safest steps to recovery at this point?

Thanks,
Al

2024

2023

2022

2021

2020

2019

[ceph-users] Unable to online CephFS, MDS segfaults during mds log replay