My cephfs FS recently went through a long recovery from losing some PGs and
ODSs. It finally came back to "HEALTH_OK" for a bit, but then the MDS
servers started crashing with this error in the logs:
I cannot get any of the 3 MDS servers to stay up now.
-313> 2019-07-11 17:42:39.820 7f612c147700 1 --
10.10.30.116:6800/543707238 --> 10.10.30.115:6801/81746 --
mgrreport(unknown.ic2mon02 +0-0 packed 1374) v6 -- 0x2ed1c00 con 0
-313> 2019-07-11 17:42:39.820 7f612b946700 -1
/build/ceph-13.2.6/src/mds/MDCache.cc: In function 'void
MDCache::journal_cow_dentry(MutationImpl*, EMetaBlob*, CDentry*, snapid_t,
CInode**, CDentry::linkage_t*)' thread 7f612b946700 time 2019-07-11
17:42:39.820872
/build/ceph-13.2.6/src/mds/MDCache.cc: 1680: FAILED assert(follows >=
realm->get_newest_seq())
ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic
(stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x14e) [0x7f61367b997e]
2: (()+0x2fab07) [0x7f61367b9b07]
3: (MDCache::journal_cow_dentry(MutationImpl*, EMetaBlob*, CDentry*,
snapid_t, CInode**, CDentry::linkage_t*)+0xd3f) [0x5f821f]
4: (MDCache::journal_dirty_inode(MutationImpl*, EMetaBlob*, CInode*,
snapid_t)+0xc0) [0x5f8450]
5: (MDCache::predirty_journal_parents(boost::intrusive_ptr<MutationImpl>,
EMetaBlob*, CInode*, CDir*, int, int, snapid_t)+0x4b1) [0x5f9141]
6: (Locker::scatter_writebehind(ScatterLock*)+0x465) [0x64a615]
7: (Locker::simple_sync(SimpleLock*, bool*)+0x176) [0x64e506]
8: (Locker::scatter_nudge(ScatterLock*, MDSInternalContextBase*,
bool)+0x3dd) [0x652f6d]
9: (Locker::scatter_tick()+0x1e4) [0x6535a4]
10: (Locker::tick()+0x9) [0x6538b9]
11: (MDSRankDispatcher::tick()+0x1e9) [0x4f00d9]
12: (FunctionContext::finish(int)+0x2c) [0x4d52dc]
13: (Context::complete(int)+0x9) [0x4d31d9]
14: (SafeTimer::timer_thread()+0x18b) [0x7f61367b620b]
15: (SafeTimerThread::entry()+0xd) [0x7f61367b786d]
16: (()+0x76ba) [0x7f61360356ba]
17: (clone()+0x6d) [0x7f613585e41d]
-313> 2019-07-11 17:42:39.820 7f612b946700 -1 *** Caught signal (Aborted)
**
in thread 7f612b946700 thread_name:safe_timer
ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic
(stable)
1: (()+0x11390) [0x7f613603f390]
2: (gsignal()+0x38) [0x7f613578c428]
3: (abort()+0x16a) [0x7f613578e02a]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x256) [0x7f61367b9a86]
5: (()+0x2fab07) [0x7f61367b9b07]
6: (MDCache::journal_cow_dentry(MutationImpl*, EMetaBlob*, CDentry*,
snapid_t, CInode**, CDentry::linkage_t*)+0xd3f) [0x5f821f]
7: (MDCache::journal_dirty_inode(MutationImpl*, EMetaBlob*, CInode*,
snapid_t)+0xc0) [0x5f8450]
8: (MDCache::predirty_journal_parents(boost::intrusive_ptr<MutationImpl>,
EMetaBlob*, CInode*, CDir*, int, int, snapid_t)+0x4b1) [0x5f9141]
9: (Locker::scatter_writebehind(ScatterLock*)+0x465) [0x64a615]
10: (Locker::simple_sync(SimpleLock*, bool*)+0x176) [0x64e506]
11: (Locker::scatter_nudge(ScatterLock*, MDSInternalContextBase*,
bool)+0x3dd) [0x652f6d]
12: (Locker::scatter_tick()+0x1e4) [0x6535a4]
13: (Locker::tick()+0x9) [0x6538b9]
14: (MDSRankDispatcher::tick()+0x1e9) [0x4f00d9]
15: (FunctionContext::finish(int)+0x2c) [0x4d52dc]
16: (Context::complete(int)+0x9) [0x4d31d9]
17: (SafeTimer::timer_thread()+0x18b) [0x7f61367b620b]
18: (SafeTimerThread::entry()+0xd) [0x7f61367b786d]
19: (()+0x76ba) [0x7f61360356ba]
20: (clone()+0x6d) [0x7f613585e41d]