Hello,
MDS process crashed suddently. After trying to restart it, it failed to replay journal and
started to restart continually.
Just to summarize, here is what happened :
1/ The cluster is up and running with 3 nodes (mon and mds in the same nodes) and 3 OSD.
2/ After a few days, 2 (standby-replay and standby) of the 3 MDS processes crashed. No
pid. Ceph status indicates that the processes are down
3/ I try restart it :
- Sometimes, the restarting fails with segmentation fault error. Here is the
ceph-mds.log file :
-20> 2020-07-17T13:50:27.888+0000 7fc8c6c51700 10 monclient: _renew_subs
-19> 2020-07-17T13:50:27.888+0000 7fc8c6c51700 10 monclient: _send_mon_message to
mon.2 at v1:172.31.36.98:6789/0
-18> 2020-07-17T13:50:27.888+0000 7fc8c6c51700 10 monclient: handle_get_version_reply
finishing 0x559dcf9530c0 version 269
-17> 2020-07-17T13:50:27.888+0000 7fc8c6c51700 10 monclient: handle_get_version_reply
finishing 0x559dcfa87520 version 269
-16> 2020-07-17T13:50:27.888+0000 7fc8c6c51700 10 monclient: handle_get_version_reply
finishing 0x559dcfa875c0 version 269
-15> 2020-07-17T13:50:27.888+0000 7fc8c6c51700 10 monclient: handle_get_version_reply
finishing 0x559dcfa871c0 version 269
-14> 2020-07-17T13:50:27.888+0000 7fc8c8c55700 10 monclient: get_auth_request con
0x559dcfada000 auth_method 0
-13> 2020-07-17T13:50:27.888+0000 7fc8c9456700 10 monclient: get_auth_request con
0x559dcfada800 auth_method 0
-12> 2020-07-17T13:50:27.892+0000 7fc8bfc43700 1 mds.282966.journaler.mdlog(ro)
recover start
-11> 2020-07-17T13:50:27.892+0000 7fc8bfc43700 1 mds.282966.journaler.mdlog(ro)
read_head
-10> 2020-07-17T13:50:27.892+0000 7fc8bfc43700 4 mds.0.log Waiting for journal 0x200
to recover...
-9> 2020-07-17T13:50:27.893+0000 7fc8c0444700 1 mds.282966.journaler.mdlog(ro)
_finish_read_head loghead(trim 4194304, expire 4231216, write 4329405, stream_format 1).
probing for end of log (from 4329405)...
-8> 2020-07-17T13:50:27.893+0000 7fc8c0444700 1 mds.282966.journaler.mdlog(ro) probing
for end of the log
-7> 2020-07-17T13:50:27.893+0000 7fc8c0444700 1 mds.282966.journaler.mdlog(ro)
_finish_probe_end write_pos = 4329949 (header had 4329405). recovered.
-6> 2020-07-17T13:50:27.893+0000 7fc8bfc43700 4 mds.0.log Journal 0x200 recovered.
-5> 2020-07-17T13:50:27.893+0000 7fc8bfc43700 4 mds.0.log Recovered journal 0x200 in
format 1
-4> 2020-07-17T13:50:27.893+0000 7fc8bfc43700 2 mds.0.0 Booting: 1:
loading/discovering base inodes
-3> 2020-07-17T13:50:27.893+0000 7fc8bfc43700 0 mds.0.cache creating system inode with
ino:0x100
-2> 2020-07-17T13:50:27.894+0000 7fc8bfc43700 0 mds.0.cache creating system inode with
ino:0x1
-1> 2020-07-17T13:50:27.894+0000 7fc8c0444700 2 mds.0.0 Booting: 2: replaying mds log
0> 2020-07-17T13:50:27.896+0000 7fc8bec41700 -1 *** Caught signal (Segmentation fault)
**
in thread 7fc8bec41700 thread_name:md_log_replay
- Sometimes, the restarting works but journal replay failed even after having reset it (#
cephfs-journal-tool --rank=cephfs:0 journal reset) on the failed nodes. The cluster status
look like this :
# ceph status -w
cluster:
id: acd73aa2-8cdd-41a3-9941-fb397aa1d79e
health: HEALTH_WARN
1 daemons have recently crashed
services:
mon: 3 daemons, quorum 2,0,1 (age 3w)
mgr: mgr.0(active, since 11w), standbys: mgr.2, mgr.1
mds: cephfs:1 {0=node1=up:active} 1 up:standby-replay 1 up:standby
osd: 3 osds: 3 up (since 33h), 3 in (since 11w)
task status:
scrub status:
mds.node1: idle
data:
pools: 3 pools, 49 pgs
objects: 165 objects, 157 MiB
usage: 3.5 GiB used, 41 TiB / 41 TiB avail
pgs: 49 active+clean
io:
client: 1.8 MiB/s rd, 4 op/s rd, 0 op/s wr
2020-10-05T13:32:03.798231+0000 mds.node0 [ERR] failure replaying journal (EMetaBlob)
2020-10-05T13:32:03.851986+0000 mon.2 [INF] daemon mds.node0 restarted
2020-10-05T13:32:04.605163+0000 mds.node0 [ERR] failure replaying journal (EMetaBlob)
2020-10-05T13:32:08.652989+0000 mon.2 [INF] daemon mds.node0 restarted
2020-10-05T13:32:08.916347+0000 mds.node0 [ERR] failure replaying journal (EMetaBlob)
2020-10-05T13:32:12.961902+0000 mon.2 [INF] daemon mds.node0 restarted
2020-10-05T13:32:13.974410+0000 mds.node0 [ERR] failure replaying journal (EMetaBlob)
2020-10-05T13:32:14.023126+0000 mon.2 [INF] daemon mds.node0 restarted
2020-10-05T13:32:14.610039+0000 mds.node0 [ERR] failure replaying journal (EMetaBlob)
Question :
- Why 2 of 3 MDS processes sometimes crash? I suspect the client ( kernel 4.20) on which
there is a cephfs in-tree provisioner (not csi) for kubernetes. how can i highlight it?
Thanks for your support
Show replies by date