Hi list,
Had a power outage killing the whole cluster. Cephfs will not start at all, but RBD works
just fine.
I did have 4 unfound objects that I eventually had to rollback or delete which I don't
really understand as I should've had a copy of the those pbjects on the other drives?
2/3 mons and mgrs are damaged but without any errors.
I have loads stored on cephfs so would very much like to get that running as a first
priority.
Thanks!
Alex
Info about the home cluster:
I run 23 osds on 3 hosts. 6 of these are a SSD cache layer for the spinning rust, as well
as the metadata portion for cephfs which in retrospect might have to be put back on the
spinning rust.
# ceph -v
ceph version 14.2.4 (65249672c6e6d843510e7e01f8a4b976dcac3db1) nautilus (stable)
# head ceph-mgr.pve21.log.7
2019-10-04 00:00:00.397 7fee56df3700 -1 received signal: Hangup from pkill -1 -x
ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw (PID: 193052) UID: 0
2019-10-04 00:00:00.573 7fee44af1700 0 ms_deliver_dispatch: unhandled message
0x55855f6b7500 mgrreport(mds.pve21 +110-0 packed 1366) v7 from mds.0
v2:192.168.1.21:6800/3783320901
2019-10-04 00:00:00.573 7fee545ee700 1 mgr finish mon failed to return metadata for
mds.pve21: (2) No such file or directory
2019-10-04 00:00:01.553 7fee43aef700 0 log_channel(cluster) log [DBG] : pgmap v2680: 1088
pgs: 1 active+clean+inconsistent, 4 active+recovery_unfound+undersized+degraded+remapped,
1083 active+clean; 4.2 TiB data, 13 TiB used, 15 TiB / 28 TiB avail; 5.7 KiB/s rd, 38
KiB/s wr, 4 op/s; 12/3843345 objects degraded (0.000%); 4/1281115 objects unfound
(0.000%)
2019-10-04 00:00:01.573 7fee44af1700 0 ms_deliver_dispatch: unhandled message
0x55855e486380 mgrreport(mds.pve21 +110-0 packed 1366) v7 from mds.0
v2:192.168.1.21:6800/3783320901
2019-10-04 00:00:01.573 7fee545ee700 1 mgr finish mon failed to return metadata for
mds.pve21: (2) No such file or directory
2019-10-04 00:00:02.573 7fee44af1700 0 ms_deliver_dispatch: unhandled message
0x55855e4b5500 mgrreport(mds.pve21 +110-0 packed 1366) v7 from mds.0
v2:192.168.1.21:6800/3783320901
2019-10-04 00:00:02.573 7fee545ee700 1 mgr finish mon failed to return metadata for
mds.pve21: (2) No such file or directory
2019-10-04 00:00:03.553 7fee43aef700 0 log_channel(cluster) log [DBG] : pgmap v2681: 1088
pgs: 1 active+clean+inconsistent, 4 active+recovery_unfound+undersized+degraded+remapped,
1083 active+clean; 4.2 TiB data, 13 TiB used, 15 TiB / 28 TiB avail; 4.7 KiB/s rd, 33
KiB/s wr, 2 op/s; 12/3843345 objects degraded (0.000%); 4/1281115 objects unfound
(0.000%)
2019-10-04 00:00:03.573 7fee44af1700 0 ms_deliver_dispatch: unhandled message
0x55855e3b0380 mgrreport(mds.pve21 +110-0 packed 1366) v7 from mds.0
v2:192.168.1.21:6800/3783320901
# head ceph-mon.pve21.log.7
2019-10-04 00:00:00.389 7f7c25b52700 -1 received signal: Hangup from killall -q -1
ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw (PID: 193051) UID: 0
2019-10-04 00:00:00.397 7f7c25b52700 -1 received signal: Hangup from pkill -1 -x
ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw (PID: 193052) UID: 0
2019-10-04 00:00:00.573 7f7c1f345700 0 mon.pve21@0(leader) e20 handle_command
mon_command({"prefix": "mds metadata", "who":
"pve21"} v 0) v1
2019-10-04 00:00:00.573 7f7c1f345700 0 log_channel(audit) log [DBG] :
from='mgr.137464844 192.168.1.21:0/2201' entity='mgr.pve21'
cmd=[{"prefix": "mds metadata", "who": "pve21"}]:
dispatch
2019-10-04 00:00:01.573 7f7c1f345700 0 mon.pve21@0(leader) e20 handle_command
mon_command({"prefix": "mds metadata", "who":
"pve21"} v 0) v1
2019-10-04 00:00:01.573 7f7c1f345700 0 log_channel(audit) log [DBG] :
from='mgr.137464844 192.168.1.21:0/2201' entity='mgr.pve21'
cmd=[{"prefix": "mds metadata", "who": "pve21"}]:
dispatch
2019-10-04 00:00:02.573 7f7c1f345700 0 mon.pve21@0(leader) e20 handle_command
mon_command({"prefix": "mds metadata", "who":
"pve21"} v 0) v1
2019-10-04 00:00:02.573 7f7c1f345700 0 log_channel(audit) log [DBG] :
from='mgr.137464844 192.168.1.21:0/2201' entity='mgr.pve21'
cmd=[{"prefix": "mds metadata", "who": "pve21"}]:
dispatch
2019-10-04 00:00:03.573 7f7c1f345700 0 mon.pve21@0(leader) e20 handle_command
mon_command({"prefix": "mds metadata", "who":
"pve21"} v 0) v1
2019-10-04 00:00:03.573 7f7c1f345700 0 log_channel(audit) log [DBG] :
from='mgr.137464844 192.168.1.21:0/2201' entity='mgr.pve21'
cmd=[{"prefix": "mds metadata", "who": "pve21"}]:
dispatch
# head ceph-mds.pve21.log.7
2019-10-04 00:00:00.389 7f1b2f1b5700 -1 received signal: Hangup from killall -q -1
ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw (PID: 193051) UID: 0
2019-10-04 00:00:00.397 7f1b2f1b5700 -1 received signal: Hangup from (PID: 193052) UID:
0
2019-10-04 00:00:04.881 7f1b319ba700 0 --1-
[v2:192.168.1.21:6800/3783320901,v1:192.168.1.21:6801/3783320901] >>
v1:192.168.1.23:0/2770609702 conn(0x556f839bb200 0x556f838d4000 :6801 s=OPENED pgs=5 cs=3
l=0).fault server, going to standby
2019-10-04 00:00:06.157 7f1b321bb700 0 --1-
[v2:192.168.1.21:6800/3783320901,v1:192.168.1.21:6801/3783320901] >>
v1:192.168.1.23:0/2770609702 conn(0x556f839e0000 0x556f83807800 :6801
s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept
replacing existing (lossy) channel (new one lossy=0)
2019-10-04 00:00:06.157 7f1b321bb700 0 --1-
[v2:192.168.1.21:6800/3783320901,v1:192.168.1.21:6801/3783320901] >>
v1:192.168.1.23:0/2770609702 conn(0x556f839e0000 0x556f83807800 :6801
s=READ_FOOTER_AND_DISPATCH pgs=6 cs=4 l=0).handle_message_footer missed message? skipped
from seq 0 to 2
2019-10-04 00:01:19.167 7f1b311b9700 0 --1-
[v2:192.168.1.21:6800/3783320901,v1:192.168.1.21:6801/3783320901] >>
v1:192.168.1.21:0/3200878088 conn(0x556f839c2900 0x556f837f8000 :6801 s=OPENED pgs=2 cs=1
l=0).fault server, going to standby
2019-10-04 00:01:23.555 7f1b311b9700 0 --1-
[v2:192.168.1.21:6800/3783320901,v1:192.168.1.21:6801/3783320901] >>
v1:192.168.1.22:0/2875552603 conn(0x556f839bda80 0x556f837f9000 :6801 s=OPENED pgs=2 cs=1
l=0).fault server, going to standby
2019-10-04 00:02:08.768 7f1b311b9700 0 --1-
[v2:192.168.1.21:6800/3783320901,v1:192.168.1.21:6801/3783320901] >>
v1:192.168.1.23:0/2427365808 conn(0x556f839bd180 0x556f83e9d800 :6801 s=OPENED pgs=2 cs=1
l=0).fault server, going to standby
2019-10-04 00:02:20.140 7f1b311b9700 0 --1-
[v2:192.168.1.21:6800/3783320901,v1:192.168.1.21:6801/3783320901] >>
v1:192.168.1.21:0/3200878088 conn(0x556f839c2900 0x556f837f8000 :6801 s=OPENED pgs=5 cs=3
l=0).fault server, going to standby
2019-10-04 00:02:21.420 7f1b319ba700 0 --1-
[v2:192.168.1.21:6800/3783320901,v1:192.168.1.21:6801/3783320901] >>
v1:192.168.1.21:0/3200878088 conn(0x556f839e0480 0x556f83d7f000 :6801
s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept
replacing existing (lossy) channel (new one lossy=0)
Show replies by date
Hi,
I am still having issues accessing my cephfs and managed to pull out more interesting
logs, I also have enabled logs to 20/20 that I intend to upload as soon as my ceph tracker
account gets accepted.
Oct 17 16:35:22 pve21 kernel: libceph: read_partial_message 000000008ae0e636 signature
check failed
Oct 17 16:35:22 pve21 kernel: libceph: mds0 192.168.1.22:6801 bad crc/signature
Oct 17 16:49:14 pve23 pvestatd[3150]: mount error: exit code 5
Oct 17 16:49:19 pve23 ceph-mon[2373]: 2019-10-17 16:49:19.559 7ff2f21d0700 -1
mon.pve23@2(electing) e20 failed to get devid for : fallback method has serial
''but no model
[ 39.843048] libceph: read_partial_message 0000000010ae5ee0 signature check failed
[ 39.843062] libceph: mds0 192.168.1.22:6801 bad crc/signature
Oct 17 16:54:18 pve21 ceph-mon[2215]: 2019-10-17 16:54:18.163 7f3ccd47d700 -1
log_channel(cluster) log [ERR] : Health check failed: 1 filesystem is offline
(MDS_ALL_DOWN)
Thanks!
A
Final update.
I switched the below from false and everything magically started working!
cephx_require_signatures = true
cephx_cluster_require_signatures = true
cephx_sign_messages = true
On 10/17/19 7:48 PM, Alex L wrote:
Final update.
I switched the below from false and everything magically started working!
cephx_require_signatures = true
cephx_cluster_require_signatures = true
cephx_sign_messages = true
Are you sure the time is in sync in your cluster after the power outage?
Wido
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
Hi Wido,
It was one of the first thing I checked yes, and it was synched properly. I have the full
logs but since everything works now, I am unsure if I should upload them to the tracker ?
Thanks,
A