[ceph-users] Re: Ceph Failure and OSD Node Stuck Incident

31 Mar 2023

Hi Peter, I would recommend from my experience to replace the Samsung Evo
SSDs, with Datacenter SSDs.
Regards, Joachim
________________________________________
Clyso GmbH - Ceph Foundation Member

&lt;petersun(a)raksmart.com&gt; schrieb am Do., 30. März 2023, 16:37:

...
  We encountered a Ceph failure where the system became
unresponsive with no
 IOPS or throughput after encountering a failed node. Upon investigation, it
 appears that the OSD process on one of the Ceph storage nodes is stuck, but
 ping is still responsive. However, during the failure, Ceph was unable to
 recognize the problematic node, which resulted in all other OSDs in the
 cluster experiencing slow operations and no IOPS in the cluster at all.

 Here's the timeline of the incident:

 - At 10:40, an alert is triggered, indicating a problem with the OSD.
 - After the alert, Ceph becomes unresponsive with no IOPS or throughput.
 - At 11:26, an engineer discovers that there is a gradual OSD failure,
 with 6 out of 12 OSDs on the node being down.
 - At 11:46, the Ceph engineer is unable to SSH into the faulty node and
 attempts a soft restart, but the "smartmontools" process is stuck while
 shutting down the server. Ping works during this time.
 - After waiting for about one or two minutes, a hard restart is attempted
 for the server.
 - At 11:57, after the Ceph node starts normally, service resumes as usual,
 indicating that the issue has been resolved.

 Here is some basic information about our services:

 - `Mon: 5 daemons, quorum host001, host002, host003, host004, host005 (age
 4w)`
 - `Mgr: host005 (active, since 4w), standbys: host001, host002, host003,
 host004`
 - `Osd: 218 osds: 218 up (since 22h), 218 in (since 22h)`

 We have a cluster with 19 nodes, including 15 SSD nodes and 4 HDD nodes.
 In total, there are 218 OSDs. The SSD nodes have 11 OSDs with Samsung EVO
 870 SSD and each drive DB/WAL by 1.6T NVME drive. We are using Ceph version
 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable).

 Here is the health check detail:
 [root@node21 ~]#  ceph health detail
 HEALTH_WARN 1 osds down; Reduced data availability: 12 pgs inactive, 12
 pgs peering; Degraded data redundancy: 272273/43967625 objects degraded
 (0.619%), 88 pgs degraded, 5 pgs undersized; 18192 slow ops, oldest one
 blocked for 3730 sec, daemons
 [osd.0,osd.1,osd.101,osd.103,osd.107,osd.108,osd.109,osd.11,osd.111,osd.112]...
 have slow ops.
 [WRN] OSD_DOWN: 1 osds down
         osd.174 (root=default,host=hkhost031) is down
 [WRN] PG_AVAILABILITY: Reduced data availability: 12 pgs inactive, 12 pgs
 peering
         pg 2.dc is stuck peering for 49m, current state peering, last
 acting [87,95,172]
         pg 2.e2 is stuck peering for 15m, current state peering, last
 acting [51,177,97]

 ......
   pg 2.f7e is active+undersized+degraded, acting [10,214]
         pg 2.f84 is active+undersized+degraded, acting [91,52]
 [WRN] SLOW_OPS: 18192 slow ops, oldest one blocked for 3730 sec, daemons
 [osd.0,osd.1,osd.101,osd.103,osd.107,osd.108,osd.109,osd.11,osd.111,osd.112]...
 have slow ops.

 I have the following questions:

 1. Why couldn't Ceph detect the faulty node and automatically abandon its
 resources? Can anyone provide more troubleshooting guidance for this case?
 2. What is Ceph's detection mechanism and where can I find related
 information? All of our production cloud machines were affected and
 suspended. If RBD is unstable, we cannot continue to use Ceph technology
 for our RBD source.
 3. Did we miss any patches or bug fixes?
 4. Is there anyone who can suggest improvements and how we can quickly
 detect and avoid similar issues in the future?
 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Ceph Failure and OSD Node Stuck Incident