Hi,
On 10 Jan 2023, at 07:10, David Orman
<ormandj(a)corenode.com> wrote:
We ship all of this to our centralized monitoring system (and a lot more) and have
dashboards/proactive monitoring/alerting with 100PiB+ of Ceph. If you're running Ceph
in production, I believe host-level monitoring is critical, above and beyond Ceph level.
Things like inlet/outlet temperature, hardware state of various components, and various
other details are probably best served by monitoring external to Ceph itself.
I agree with David's suggestions
I did a quick glance and didn't see this data (OSD errors re: reads/writes) exposed
in the Pacific version of Ceph's Prometheus-style exporter, but I may have overlooked
it. This would be nice to have, as well, if it does not exist.
We collect drive counters at the host level, and alert at levels prior to general impact.
Even a failing drive can cause latency spikes which are frustrating, before it starts
returning errors (correctable errors) - the OSD will not see these other than longer
latency on operations. Seeing a change in the smart counters either at a high rate or
above thresholds you define is most certainly something I would suggest ensuring is
covered in whatever host-level monitoring you're already performing for production
usage.
Seems to me that there is no need to reinvent the wheel and create even more GIL problems
for ceph-mgr. In previous year was released production-ready exporter for smartctl data,
with NVMe support [1]
Golang, CI & tested in production with Ceph - ready to go 🙂
[1]
https://github.com/prometheus-community/smartctl_exporter