OSD down cause all OSD slow ops - ceph-users

28 Mar 2023

We experienced a Ceph failure causing the system to become unresponsive with no IOPS or
throughput due to a problematic OSD process on one node. This resulted in slow operations
and no IOPS for all other OSDs in the cluster. The incident timeline is as follows:

Alert triggered for OSD problem.
6 out of 12 OSDs on the node were down.
Soft restart attempted, but smartmontools process stuck while shutting down server.
Hard restart attempted and service resumed as usual.

Our Ceph cluster has 19 nodes, 218 OSDs, and is using version 15.2.17  octopus (stable).

Questions:
1. What is Ceph's detection mechanism? Why couldn't Ceph detect the faulty node
and automatically abandon its resources?
2. Did we miss any patches or bug fixes?
3. Suggestions for improvements to quickly detect and avoid similar issues in the future?