Hey all! I’ve run into an MDS crash on a cluster recently upgraded from Ceph 16.2.7 to
16.2.10. I’m hitting an assert nearly identical to this one gathered by the telemetry
module:
https://tracker.ceph.com/issues/54747
I have a new build compiling to test whether
https://github.com/ceph/ceph/pull/43184/
makes a difference or not, when setting mds_inject_skip_replaying_inotable.
Relevant logs are below, but I’m wondering if anyone has hit anything like this? Thanks
in advance!
=== BEGIN LOG SNIPPET ===
-2> 2023-01-18T20:16:29.789+0000 7f6190243700 -1 log_channel(cluster) log [ERR] :
journal replay alloc 0x10000000010 not in free
[0x10000000011~0x3dc,0x100000003fb~0x1e8,0x100000005e5~0x2,0x100000009d4~0x2,0x1000005cc6d~0x4,0x10001c6b44e~0x4,0x10001cb91f4~0x1f4,0x10001cb93f4~0x3dd,0x10007582c15~0x279,0x10007582e90~0x1f4,0x10007583094~0xfff8a7cf6c]
-1> 2023-01-18T20:16:29.789+0000 7f6190243700 -1
/builds/66321/e7c73776/ceph/-build//WORKDIR/ceph-16.2.10/src/mds/journal.cc: In function
'void EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)' thread 7f6190243700
time 2023-01-18T20:16:29.794189+0000
/WORKDIR/ceph-16.2.10/src/mds/journal.cc: 1577: FAILED ceph_assert(inotablev ==
mds->inotable->get_version())
ceph version 16.2.10 (e7c73776b3136f6d18a35febeb38f5fdd41be364) pacific (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14c)
[0x7f619d548645]
2: /usr/lib/ceph/libceph-common.so.2(+0x27182f) [0x7f619d54882f]
3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5815) [0x560bfd1c6935]
4: (EUpdate::replay(MDSRank*)+0x3c) [0x560bfd1c7ecc]
5: (MDLog::_replay_thread()+0xca9) [0x560bfd153de9]
6: (MDLog::ReplayThread::entry()+0xd) [0x560bfce78fdd]
7: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f619cf29fa3]
8: clone()
0> 2023-01-18T20:16:29.793+0000 7f6190243700 -1 *** Caught signal (Aborted) **
in thread 7f6190243700 thread_name:md_log_replay
ceph version 16.2.10 (e7c73776b3136f6d18a35febeb38f5fdd41be364) pacific (stable)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730) [0x7f619cf34730]
2: gsignal()
3: abort()
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19d)
[0x7f619d548696]
5: /usr/lib/ceph/libceph-common.so.2(+0x27182f) [0x7f619d54882f]
6: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5815) [0x560bfd1c6935]
7: (EUpdate::replay(MDSRank*)+0x3c) [0x560bfd1c7ecc]
8: (MDLog::_replay_thread()+0xca9) [0x560bfd153de9]
9: (MDLog::ReplayThread::entry()+0xd) [0x560bfce78fdd]
10: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f619cf29fa3]
11: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
=== END LOG SNIPPET ===