Nautilus power outage - 2/3 mons and mgrs dead and no cephfs - ceph-users

11 Oct 2019

Hi list,
Had a power outage killing the whole cluster. Cephfs will not start at all, but RBD works
just fine.
I did have 4 unfound objects that I eventually had to rollback or delete which I don't
really understand as I should've had a copy of the those pbjects on the other drives?

2/3 mons and mgrs are damaged but without any errors. 

I have loads stored on cephfs so would very much like to get that running as a first
priority.

Thanks!
Alex

Info about the home cluster:
I run 23 osds on 3 hosts. 6 of these are a SSD cache layer for the spinning rust, as well
as the metadata portion for cephfs which in retrospect might have to be put back on the
spinning rust.

# ceph -v
ceph version 14.2.4 (65249672c6e6d843510e7e01f8a4b976dcac3db1) nautilus (stable)

# head ceph-mgr.pve21.log.7
2019-10-04 00:00:00.397 7fee56df3700 -1 received  signal: Hangup from pkill -1 -x
ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw  (PID: 193052) UID: 0
2019-10-04 00:00:00.573 7fee44af1700  0 ms_deliver_dispatch: unhandled message
0x55855f6b7500 mgrreport(mds.pve21 +110-0 packed 1366) v7 from mds.0
v2:192.168.1.21:6800/3783320901
2019-10-04 00:00:00.573 7fee545ee700  1 mgr finish mon failed to return metadata for
mds.pve21: (2) No such file or directory
2019-10-04 00:00:01.553 7fee43aef700  0 log_channel(cluster) log [DBG] : pgmap v2680: 1088
pgs: 1 active+clean+inconsistent, 4 active+recovery_unfound+undersized+degraded+remapped,
1083 active+clean; 4.2 TiB data, 13 TiB used, 15 TiB / 28 TiB avail; 5.7 KiB/s rd, 38
KiB/s wr, 4 op/s; 12/3843345 objects degraded (0.000%); 4/1281115 objects unfound
(0.000%)
2019-10-04 00:00:01.573 7fee44af1700  0 ms_deliver_dispatch: unhandled message
0x55855e486380 mgrreport(mds.pve21 +110-0 packed 1366) v7 from mds.0
v2:192.168.1.21:6800/3783320901
2019-10-04 00:00:01.573 7fee545ee700  1 mgr finish mon failed to return metadata for
mds.pve21: (2) No such file or directory
2019-10-04 00:00:02.573 7fee44af1700  0 ms_deliver_dispatch: unhandled message
0x55855e4b5500 mgrreport(mds.pve21 +110-0 packed 1366) v7 from mds.0
v2:192.168.1.21:6800/3783320901
2019-10-04 00:00:02.573 7fee545ee700  1 mgr finish mon failed to return metadata for
mds.pve21: (2) No such file or directory
2019-10-04 00:00:03.553 7fee43aef700  0 log_channel(cluster) log [DBG] : pgmap v2681: 1088
pgs: 1 active+clean+inconsistent, 4 active+recovery_unfound+undersized+degraded+remapped,
1083 active+clean; 4.2 TiB data, 13 TiB used, 15 TiB / 28 TiB avail; 4.7 KiB/s rd, 33
KiB/s wr, 2 op/s; 12/3843345 objects degraded (0.000%); 4/1281115 objects unfound
(0.000%)
2019-10-04 00:00:03.573 7fee44af1700  0 ms_deliver_dispatch: unhandled message
0x55855e3b0380 mgrreport(mds.pve21 +110-0 packed 1366) v7 from mds.0
v2:192.168.1.21:6800/3783320901

# head ceph-mon.pve21.log.7
2019-10-04 00:00:00.389 7f7c25b52700 -1 received  signal: Hangup from killall -q -1
ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw  (PID: 193051) UID: 0
2019-10-04 00:00:00.397 7f7c25b52700 -1 received  signal: Hangup from pkill -1 -x
ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw  (PID: 193052) UID: 0
2019-10-04 00:00:00.573 7f7c1f345700  0 mon.pve21@0(leader) e20 handle_command
mon_command({"prefix": "mds metadata", "who":
"pve21"} v 0) v1
2019-10-04 00:00:00.573 7f7c1f345700  0 log_channel(audit) log [DBG] :
from='mgr.137464844 192.168.1.21:0/2201' entity='mgr.pve21'
cmd=[{"prefix": "mds metadata", "who": "pve21"}]:
dispatch
2019-10-04 00:00:01.573 7f7c1f345700  0 mon.pve21@0(leader) e20 handle_command
mon_command({"prefix": "mds metadata", "who":
"pve21"} v 0) v1
2019-10-04 00:00:01.573 7f7c1f345700  0 log_channel(audit) log [DBG] :
from='mgr.137464844 192.168.1.21:0/2201' entity='mgr.pve21'
cmd=[{"prefix": "mds metadata", "who": "pve21"}]:
dispatch
2019-10-04 00:00:02.573 7f7c1f345700  0 mon.pve21@0(leader) e20 handle_command
mon_command({"prefix": "mds metadata", "who":
"pve21"} v 0) v1
2019-10-04 00:00:02.573 7f7c1f345700  0 log_channel(audit) log [DBG] :
from='mgr.137464844 192.168.1.21:0/2201' entity='mgr.pve21'
cmd=[{"prefix": "mds metadata", "who": "pve21"}]:
dispatch
2019-10-04 00:00:03.573 7f7c1f345700  0 mon.pve21@0(leader) e20 handle_command
mon_command({"prefix": "mds metadata", "who":
"pve21"} v 0) v1
2019-10-04 00:00:03.573 7f7c1f345700  0 log_channel(audit) log [DBG] :
from='mgr.137464844 192.168.1.21:0/2201' entity='mgr.pve21'
cmd=[{"prefix": "mds metadata", "who": "pve21"}]:
dispatch

# head ceph-mds.pve21.log.7
2019-10-04 00:00:00.389 7f1b2f1b5700 -1 received  signal: Hangup from killall -q -1
ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw  (PID: 193051) UID: 0
2019-10-04 00:00:00.397 7f1b2f1b5700 -1 received  signal: Hangup from  (PID: 193052) UID:
0
2019-10-04 00:00:04.881 7f1b319ba700  0 --1-
[v2:192.168.1.21:6800/3783320901,v1:192.168.1.21:6801/3783320901] >>
v1:192.168.1.23:0/2770609702 conn(0x556f839bb200 0x556f838d4000 :6801 s=OPENED pgs=5 cs=3
l=0).fault server, going to standby
2019-10-04 00:00:06.157 7f1b321bb700  0 --1-
[v2:192.168.1.21:6800/3783320901,v1:192.168.1.21:6801/3783320901] >>
v1:192.168.1.23:0/2770609702 conn(0x556f839e0000 0x556f83807800 :6801
s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept
replacing existing (lossy) channel (new one lossy=0)
2019-10-04 00:00:06.157 7f1b321bb700  0 --1-
[v2:192.168.1.21:6800/3783320901,v1:192.168.1.21:6801/3783320901] >>
v1:192.168.1.23:0/2770609702 conn(0x556f839e0000 0x556f83807800 :6801
s=READ_FOOTER_AND_DISPATCH pgs=6 cs=4 l=0).handle_message_footer missed message?  skipped
from seq 0 to 2
2019-10-04 00:01:19.167 7f1b311b9700  0 --1-
[v2:192.168.1.21:6800/3783320901,v1:192.168.1.21:6801/3783320901] >>
v1:192.168.1.21:0/3200878088 conn(0x556f839c2900 0x556f837f8000 :6801 s=OPENED pgs=2 cs=1
l=0).fault server, going to standby
2019-10-04 00:01:23.555 7f1b311b9700  0 --1-
[v2:192.168.1.21:6800/3783320901,v1:192.168.1.21:6801/3783320901] >>
v1:192.168.1.22:0/2875552603 conn(0x556f839bda80 0x556f837f9000 :6801 s=OPENED pgs=2 cs=1
l=0).fault server, going to standby
2019-10-04 00:02:08.768 7f1b311b9700  0 --1-
[v2:192.168.1.21:6800/3783320901,v1:192.168.1.21:6801/3783320901] >>
v1:192.168.1.23:0/2427365808 conn(0x556f839bd180 0x556f83e9d800 :6801 s=OPENED pgs=2 cs=1
l=0).fault server, going to standby
2019-10-04 00:02:20.140 7f1b311b9700  0 --1-
[v2:192.168.1.21:6800/3783320901,v1:192.168.1.21:6801/3783320901] >>
v1:192.168.1.21:0/3200878088 conn(0x556f839c2900 0x556f837f8000 :6801 s=OPENED pgs=5 cs=3
l=0).fault server, going to standby
2019-10-04 00:02:21.420 7f1b319ba700  0 --1-
[v2:192.168.1.21:6800/3783320901,v1:192.168.1.21:6801/3783320901] >>
v1:192.168.1.21:0/3200878088 conn(0x556f839e0480 0x556f83d7f000 :6801
s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept
replacing existing (lossy) channel (new one lossy=0)