Hi Frank,
Glad to hear the testing went well and the Kingston SSDs behaved! Fingers crossed your
issue was just a corner case...
Cheers,
A.
Sent from
Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
From: Frank Schilder<mailto:frans@dtu.dk>
Sent: 13 May 2021 10:15
To: Andrew Walker-Brown<mailto:andrew_jbrown@hotmail.com>;
ceph-users@ceph.io<mailto:ceph-users@ceph.io>
Subject: Re: OSD lost: firmware bug in Kingston SSDs?
Hi Andrew,
I did a few power-out tests by pulling the power cord of a server several times. This
server contains a mix of disks, including the Kingston SSDs (also the one that failed
before). Every time, all OSDs recovered and an initiated deep scrub did not find silent
corruptions either. The test was done under production load.
Looks like the OSD crash I observed was caused by special and hopefully rare
circumstances.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans(a)dtu.dk>
Sent: 06 May 2021 15:27:14
To: Andrew Walker-Brown; ceph-users(a)ceph.io
Subject: [ceph-users] Re: OSD lost: firmware bug in Kingston SSDs?
Hi Andrew,
thanks, that is reassuring. To be sure, I plan to do a few power out tests with this
server. Never had any issues with that so far, its the first time I see a corrupted OSD.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Andrew Walker-Brown <andrew_jbrown(a)hotmail.com>
Sent: 06 May 2021 15:23:30
To: Frank Schilder; ceph-users(a)ceph.io
Subject: RE: OSD lost: firmware bug in Kingston SSDs?
Hi Frank,
I’m running the same SSDs (approx. 20) in Dell servers on HBA330’s. Haven’t had any
issues and have suffered at least one power outage. Just checking the wcache setting and
it shows as enabled.
Running Octopus 15.1.9 and docker containers. Originally part of a Proxmox cluster but
now standalone Ceph.
Cheers,
A
Sent from
Mail<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2F…
for Windows 10
From: Frank Schilder<mailto:frans@dtu.dk>
Sent: 06 May 2021 10:11
To: ceph-users@ceph.io<mailto:ceph-users@ceph.io>
Subject: [ceph-users] OSD lost: firmware bug in Kingston SSDs?
Hi all,
I lost 2 OSDs deployed on a single Kingston SSD in a rather strange way and am wondering
if anyone has made similar observations or is aware of a firmware bug with these disks.
Disk model: KINGSTON SEDC500M3840G (it ought to be a DC grade model with super
capacitors)
Smartctl does not report any drive errors.
Performance per TB is as expected, OSDs are "ceph-volume lvm batch" bluestore
deployed, everything collocated.
Short version: I disable volatile write cache on all OSD disks, but the Kingston disks
seem to behave as if this cache is *not* disabled. Smartctl and hdparm report wcache=off
though. The OSD loss looks like what unflushed write cache during power loss would result
in. I'm afraid now that our cluster might be vulnerable to power loss.
Long version:
Our disks are on Dell HBA330 Mini controllers and are in state "non-raid". The
controller itself has no cache and is HBA-mode only.
Log entry:
The iDRAC log shows that the disk was removed from a drive group:
---
PDR5 Disk 6 in Backplane 2 of Integrated Storage Controller 1 is removed.
Detailed Description: A physical disk has been removed from the disk group. This alert can
also be caused by loose or defective cables or by problems with the enclosure.
---
The iDRAC did not report the disk as failed and neither as "removed from drive
bay". I reseated the disk and it came back as healthy. I assume it was a problem with
connectivity to the back-plane (chassis). If I now try to start up the OSDs on this disk,
I get the error:
starting osd.581 at - osd_data /var/lib/ceph/osd/ceph-581
/var/lib/ceph/osd/ceph-581/journal
starting osd.580 at - osd_data /var/lib/ceph/osd/ceph-580
/var/lib/ceph/osd/ceph-580/journal
2021-05-06 09:23:47.160 7fead5a1fb80 -1 bluefs mount failed to replay log: (5)
Input/output error
2021-05-06 09:23:47.160 7fead5a1fb80 -1 bluestore(/var/lib/ceph/osd/ceph-581) _open_db
failed bluefs mount: (5) Input/output error
2021-05-06 09:23:47.630 7fead5a1fb80 -1 osd.581 0 OSD:init: unable to mount object store
2021-05-06 09:23:47.630 7fead5a1fb80 -1 ** ERROR: osd init failed: (5) Input/output
error
I have removed disks of active OSDs before without any bluestore corruption happening.
While it is very well possible that this particular "disconnect" event may lead
to a broken OSD, there is also another observation where the Kingston disks stick out
compared with other SSD OSDs, which make me suspicious of this being a disk cache firmware
problem:
The I/O indicator LED lights up with significantly lower frequency than for all other SSD
types on the same pool even though we have 2 instead of 1 OSD deployed on the Kingstons
(the other disks are 2TB Micron Pro). While this could be due to a wiring difference
I'm starting to suspect that this might be an indication of volatile caching.
Does anyone using Kingston DC-M-SSDs have similar or contradicting experience?
How did these disks handle power outages?
Any recommendations?
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io