OSD lost: firmware bug in Kingston SSDs? - ceph-users

6 May 2021

Hi all,

I lost 2 OSDs deployed on a single Kingston SSD in a rather strange way and am wondering
if anyone has made similar observations or is aware of a firmware bug with these disks.

Disk model: KINGSTON SEDC500M3840G (it ought to be a DC grade model with super
capacitors)
Smartctl does not report any drive errors.
Performance per TB is as expected, OSDs are "ceph-volume lvm batch" bluestore
deployed, everything collocated.

Short version: I disable volatile write cache on all OSD disks, but the Kingston disks
seem to behave as if this cache is *not* disabled. Smartctl and hdparm report wcache=off
though. The OSD loss looks like what unflushed write cache during power loss would result
in. I'm afraid now that our cluster might be vulnerable to power loss.

Long version:

Our disks are on Dell HBA330 Mini controllers and are in state "non-raid". The
controller itself has no cache and is HBA-mode only.

Log entry:

The iDRAC log shows that the disk was removed from a drive group:

---
PDR5 	Disk 6 in Backplane 2 of Integrated Storage Controller 1 is removed.
Detailed Description: A physical disk has been removed from the disk group. This alert can
also be caused by loose or defective cables or by problems with the enclosure.
---

The iDRAC did not report the disk as failed and neither as "removed from drive
bay". I reseated the disk and it came back as healthy. I assume it was a problem with
connectivity to the back-plane (chassis). If I now try to start up the OSDs on this disk,
I get the error:

starting osd.581 at - osd_data /var/lib/ceph/osd/ceph-581
/var/lib/ceph/osd/ceph-581/journal
starting osd.580 at - osd_data /var/lib/ceph/osd/ceph-580
/var/lib/ceph/osd/ceph-580/journal
2021-05-06 09:23:47.160 7fead5a1fb80 -1 bluefs mount failed to replay log: (5)
Input/output error
2021-05-06 09:23:47.160 7fead5a1fb80 -1 bluestore(/var/lib/ceph/osd/ceph-581) _open_db
failed bluefs mount: (5) Input/output error
2021-05-06 09:23:47.630 7fead5a1fb80 -1 osd.581 0 OSD:init: unable to mount object store
2021-05-06 09:23:47.630 7fead5a1fb80 -1  ** ERROR: osd init failed: (5) Input/output
error

I have removed disks of active OSDs before without any bluestore corruption happening.
While it is very well possible that this particular "disconnect" event may lead
to a broken OSD, there is also another observation where the Kingston disks stick out
compared with other SSD OSDs, which make me suspicious of this being a disk cache firmware
problem:

The I/O indicator LED lights up with significantly lower frequency than for all other SSD
types on the same pool even though we have 2 instead of 1 OSD deployed on the Kingstons
(the other disks are 2TB Micron Pro). While this could be due to a wiring difference
I'm starting to suspect that this might be an indication of volatile caching.

Does anyone using Kingston DC-M-SSDs have similar or contradicting experience?
How did these disks handle power outages?
Any recommendations?

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14