We ship all of this to our centralized monitoring system (and a lot more) and have
dashboards/proactive monitoring/alerting with 100PiB+ of Ceph. If you're running Ceph
in production, I believe host-level monitoring is critical, above and beyond Ceph level.
Things like inlet/outlet temperature, hardware state of various components, and various
other details are probably best served by monitoring external to Ceph itself.
I did a quick glance and didn't see this data (OSD errors re: reads/writes) exposed in
the Pacific version of Ceph's Prometheus-style exporter, but I may have overlooked it.
This would be nice to have, as well, if it does not exist.
We collect drive counters at the host level, and alert at levels prior to general impact.
Even a failing drive can cause latency spikes which are frustrating, before it starts
returning errors (correctable errors) - the OSD will not see these other than longer
latency on operations. Seeing a change in the smart counters either at a high rate or
above thresholds you define is most certainly something I would suggest ensuring is
covered in whatever host-level monitoring you're already performing for production
usage.
David
On Mon, Jan 9, 2023, at 17:46, Erik Lindahl wrote:
Hi,
Good points; however, given that ceph already collects all this statistics, isn't
there any way to set (?) reasonable thresholds and actually have ceph detect the amount of
read errors and suggest that a given drive should be replaced?
It seems a bit strange that we all should have to wait for a PG read error, then log into
each node to check the number of read errors for each device and keep track of this? Of
course it's possible to write scripts for everything, but there must be numerous Ceph
sites with hundreds of OSD nodes, so I'm a bit surprised this isn't more
automated...
Cheers,
Erik
--
Erik Lindahl <erik.lindahl(a)gmail.com>
On 10 Jan 2023 at 00:09 +0100, Anthony D'Atri <aad(a)dreamsnake.net>et>, wrote:
On Jan 9, 2023, at 17:46, David Orman
<ormandj(a)corenode.com> wrote:
It's important to note we do not suggest using the SMART "OK" indicator as
the drive being valid. We monitor correctable/uncorrectable error counts, as you can see a
dramatic rise when the drives start to fail. 'OK' will be reported for SMART
health long after the drive is throwing many uncorrectable errors and needs replacement.
You have to look at the actual counters, themselves.
I strongly agree, especially given personal experience with SSD firmware design flaws.
Also, examining UDMA / CRC error rates led to the discovery that certain aftermarket
drive carriers had lower tolerances than those from the chassis vendor, resulting in
drives that were silently slow. Reseating in most cases restored performance.
— aad
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io