Host crash undetected by ceph health check - ceph-users

7 May 2021

Dear cephers,

today it seems I observed an impossible event for the first time: an OSD host crashed, but
the ceph health monitoring did not recognise the crash. Not a single OSD was marked down
and IO simply stopped, waiting for the crashed OSDs to respond. All that was reported was
slow ops, slow meta data IO, MDS behind on trimming, but no OSD fail. I have rebooted
these machines a lot of times and have never seen the health check fail to recognise that
instantly. The only difference I see is that these were clean shut-downs, not crashes (I
believe the OSDs mark themselves as down).

For debugging this problem, can anyone provide me with a pointer when this could be the
result of a misconfiguration?

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14