Hi,
you might suffer from the same bug we suffered:
https://tracker.ceph.com/issues/53729
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/KG35GRTN4ZI…
Basically there is a bug that prevents the removal of PGlog items. You need
to update to pacific for the fix. There is also a very easy check if you
MIGHT be affected:
https://tracker.ceph.com/issues/53729#note-65
Am Do., 30. März 2023 um 17:02 Uhr schrieb <petersun(a)raksmart.com>om>:
We experienced a Ceph failure causing the system to
become unresponsive
with no IOPS or throughput due to a problematic OSD process on one node.
This resulted in slow operations and no IOPS for all other OSDs in the
cluster. The incident timeline is as follows:
Alert triggered for OSD problem.
6 out of 12 OSDs on the node were down.
Soft restart attempted, but smartmontools process stuck while shutting
down server.
Hard restart attempted and service resumed as usual.
Our Ceph cluster has 19 nodes, 218 OSDs, and is using version 15.2.17
octopus (stable).
Questions:
1. What is Ceph's detection mechanism? Why couldn't Ceph detect the faulty
node and automatically abandon its resources?
2. Did we miss any patches or bug fixes?
3. Suggestions for improvements to quickly detect and avoid similar issues
in the future?
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.