osds processes shutdown during outage - ceph-users

16 Feb 2021

Hi,

(sorry if this gets posted twice. I forgot a subject in the first mail)

We expereinced an outage this morning on a jewel cluster with 1559 osds.
It appeared that a switch uplink in a rack misbehaved and once shutting that
interface ceph health restored quickly. I have some questions though on
osd behaviour that I hope someone can answer

1 - In a lot of osd logs I saw that neighbours reported the osd down
(while the process was still running and obviously logging). Then after a
while the logs shows

  * Got signal Interrupt
  * prepare_to_stop starting shutdown

and the osd process stops

Why does the osd proces stop? Is it instructed to do so by the monitor
because neighbours reported it down and ceph wants to avoid flapping?

2 - The osds reported a lot of

  * heartbeat_check: no reply from #ip:#port

When I telnet to the ip and port I get a connection just fine. Is there a
way to run a heartbeat_check from the commandline so that we can try
capture the traffic to determine why it fails

Thanks

Marcel