Hello,
I have a Ceph deployment using CephFS. Recently MDS failed and cannot
start. Attempting to start MDS for this filesystem results in nearly
immediate segfault in MDS. Logs below.
cephfs-journal-tool shows Overall journal integrity state OK
root@proxmox-2:/var/log/ceph# cephfs-journal-tool --rank=galaxy:all journal
inspect
Overall journal integrity: OK
Stack dump / log from MDS:
-14> 2023-05-26T15:01:09.204-0500 7f27c24b2700 1
mds.0.journaler.mdlog(ro) probing for end of the log
-13> 2023-05-26T15:01:09.208-0500 7f27c34b4700 1 mds.0.journaler.pq(ro)
_finish_read_head loghead(trim 4194304, expire 4194607, write 4194607,
stream_format 1). probing for end of log (from 4194607)...
-12> 2023-05-26T15:01:09.208-0500 7f27c34b4700 1 mds.0.journaler.pq(ro)
probing for end of the log
-11> 2023-05-26T15:01:09.412-0500 7f27c24b2700 1
mds.0.journaler.mdlog(ro) _finish_probe_end write_pos = 2388235687 (header
had 2388213543). recovered.
-10> 2023-05-26T15:01:09.412-0500 7f27c34b4700 1 mds.0.journaler.pq(ro)
_finish_probe_end write_pos = 4194607 (header had 4194607). recovered.
-9> 2023-05-26T15:01:09.412-0500 7f27c34b4700 4 mds.0.purge_queue
operator(): open complete
-8> 2023-05-26T15:01:09.412-0500 7f27c34b4700 1 mds.0.journaler.pq(ro)
set_writeable
-7> 2023-05-26T15:01:09.412-0500 7f27c1cb1700 4 mds.0.log Journal
0x200 recovered.
-6> 2023-05-26T15:01:09.412-0500 7f27c1cb1700 4 mds.0.log Recovered
journal 0x200 in format 1
-5> 2023-05-26T15:01:09.412-0500 7f27c1cb1700 2 mds.0.6403 Booting: 1:
loading/discovering base inodes
-4> 2023-05-26T15:01:09.412-0500 7f27c1cb1700 0 mds.0.cache creating
system inode with ino:0x100
-3> 2023-05-26T15:01:09.412-0500 7f27c1cb1700 0 mds.0.cache creating
system inode with ino:0x1
-2> 2023-05-26T15:01:09.416-0500 7f27c24b2700 2 mds.0.6403 Booting: 2:
replaying mds log
-1> 2023-05-26T15:01:09.416-0500 7f27c24b2700 2 mds.0.6403 Booting: 2:
waiting for purge queue recovered
0> 2023-05-26T15:01:09.428-0500 7f27c0caf700 -1 *** Caught signal
(Segmentation fault) **
in thread 7f27c0caf700 thread_name:md_log_replay
ceph version 17.2.6 (995dec2cdae920da21db2d455e55efbc339bde24) quincy
(stable)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x13140) [0x7f27cd70c140]
2: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x66c2)
[0x563540fc7372]
3: (EUpdate::replay(MDSRank*)+0x3c) [0x563540fc8abc]
4: (MDLog::_replay_thread()+0x7cb) [0x563540f4d0fb]
5: (MDLog::ReplayThread::entry()+0xd) [0x563540c1fbfd]
6: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f27cd700ea7]
7: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.
What are the safest steps to recovery at this point?
Thanks,
Al