Hi,
(sorry if this gets posted twice. I forgot a subject in the first mail)
We expereinced an outage this morning on a jewel cluster with 1559 osds.
It appeared that a switch uplink in a rack misbehaved and once shutting that
interface ceph health restored quickly. I have some questions though on
osd behaviour that I hope someone can answer
1 - In a lot of osd logs I saw that neighbours reported the osd down
(while the process was still running and obviously logging). Then after a
while the logs shows
* Got signal Interrupt
* prepare_to_stop starting shutdown
and the osd process stops
Why does the osd proces stop? Is it instructed to do so by the monitor
because neighbours reported it down and ceph wants to avoid flapping?
2 - The osds reported a lot of
* heartbeat_check: no reply from #ip:#port
When I telnet to the ip and port I get a connection just fine. Is there a
way to run a heartbeat_check from the commandline so that we can try
capture the traffic to determine why it fails
Thanks
Marcel
Show replies by date