Multiple OSDs down, and won't come up (possibly related to other Nautilus issues) - ceph-users

1 Apr 2020

So, this is following on from a discussion in the #ceph IRC channel, where we seem to have
reached the limit of what we can do.

I have a ~15 node, 311 OSD cluster. (20 OSDs per node). 
The cluster is Nautilus - the 3 MONs + the first 8 OSD hosts were installed as Mimic and
upgraded to Nautilus with ceph-ansible ; the remaining OSD hosts were added directly with
Nautilus as they were only added in a few weeks ago.

Yesterday, suddenly, about half of the OSDs (~140) were marked Down, and a number of slow
operations were detected.

Initially, examining the logs (and with a bit of help from IRC), I noticed that the
ansible roles used to build the newer OSDs had configured chrony incorrectly, and their
clocks were drifting. 
(There were BADAUTHORIZER errors in OSD logs, too.) 

I fixed the chrony configuration... and we (including people in IRC) expected everything
to just... stabilise.

Things have not stabilised, which leads me to suspect that there are other issues at
play.

After noticing a number of issues with mgrs deadlocking in Nautilus - eg
https://tracker.ceph.com/issues/17170 https://tracker.ceph.com/issues/43048 - I tried
stopping all mgrs and mons, and then slowly bringing them up.
This has not helped. 

Interestingly, the OSDs with slow ops (some of which are marked down) report ops_in_flight
which are "wait for new map", whilst the lead mon believes those same ops are
timed out. 
(I can of course, telnet to every OSD, even the down ones, from other OSDs, including ones
which report issues talking to them on the same port; and from the lead mon.)

I am wondering if this is an example of: https://tracker.ceph.com/issues/44184  as we did
create a new pool shortly after adding the new OSD host nodes... but it isn't clear
from that ticket [or the discussion on this list] how to fix this, other than removing the
pool - which I can't do, as we need this pool to exist, and the pool is replaces needs
to be decomissioned.

Can anyone advise what I should do next? At present, obviously, the cluster is unusable.