With older releases, Michael Kidd’s log parser scripts were invaluable, notably
map_reporters_to_buckets.sh
https://github.com/linuxkidd/ceph-log-parsers
With newer releases, at least, one can send `dump_blocked_ops` to the OSD admin socket. I
collect these via Prometheus / node_exporter, it’s straightforward to visualize them in
Grafana with queries per-OSD and per-node. The builtin metrics might offer this data
too.
Often there’s a pattern — a given node/rack/OSD is the outlier for blocked ops, with a
cohort of others affected via replication.
Other things to look for are network packet drops or retransmits, crc/framing errors on
the switch side, a drop in MemAvailabile, high load average, etc. also reported by
node_exporter, OSD lifetimes / mon op latency / large OSD tcmalloc heap freelists (admin
socket),
I’m a big fan of Prometheus and Grafana. It’s really straightforward to add one’s own
stats too. Drive write latency can be tracked with something like
clamp_min(delta(node_disk_write_time_ms{ceph_role=“ssd",device=~"sd.*"}[5m])/delta(node_disk_writes_completed{ceph_role=“osd",device=~"sd.*"}[5m]),0)
This can help identify outlier drives and firmware issues.
Tracking drive e2e / UDMA / CRC errors, reallocated blocks (absolute and rate), lifetime
remaining via SMART, though SMART is not as uniformly implemented as one would like so
some interpretation and abstraction is warranted.
— ymmv aad
I am curious, though, how one might have pin-pointed a
troublesome
host/OSD prior to this. Looking back at some of the detail when
attempting to diagnose, we do see some ops taking longer in
sub_op_committed, but not really a lot else. We'd get an occasional
slow operation on OSD warning, but the OSDs were spread across various
ceph nodes, not just the one with issues, I'm assuming due to EC.
There was no real clarity on where the 'jam' was happening, at least
in anything we looked at. I'm wondering if there's a better way to see
what, specifically, is "slow" on a cluster. Looking at even the OSD
perf output wasn't helpful, because all of that was fine - it was
likely due to EC and write operations to OSDs on that specific node in
question. Is there some way to look at a cluster and see which hosts
are problematic/leading to slowness in an EC-based setup?