[ceph-users] Re: Slow cluster / misplaced objects - Ceph 15.2.9

27 Feb 2021

With older releases, Michael Kidd’s log parser scripts were invaluable, notably
map_reporters_to_buckets.sh

https://github.com/linuxkidd/ceph-log-parsers

With newer releases, at least, one can send `dump_blocked_ops` to the OSD admin socket.  I
collect these via Prometheus / node_exporter, it’s straightforward to visualize them in
Grafana with queries per-OSD and per-node.  The builtin metrics might offer this data
too.

Often there’s a pattern — a given node/rack/OSD is the outlier for blocked ops, with a
cohort of others affected via replication.

Other things to look for are network packet drops or retransmits, crc/framing errors on
the switch side, a drop in MemAvailabile, high load average, etc. also reported by
node_exporter, OSD lifetimes / mon op latency / large OSD tcmalloc heap freelists (admin
socket), 

I’m a big fan of Prometheus and Grafana.  It’s really straightforward to add one’s own
stats too.  Drive write latency can be tracked with something like

clamp_min(delta(node_disk_write_time_ms{ceph_role=“ssd",device=~"sd.*"}[5m])/delta(node_disk_writes_completed{ceph_role=“osd",device=~"sd.*"}[5m]),0)

This can help identify outlier drives and firmware issues.

Tracking drive e2e / UDMA / CRC errors, reallocated blocks (absolute and rate), lifetime
remaining via SMART, though SMART is not as uniformly implemented as one would like so
some interpretation and abstraction is warranted.

— ymmv aad

...
  I am curious, though, how one might have pin-pointed a
troublesome
 host/OSD prior to this. Looking back at some of the detail when
 attempting to diagnose, we do see some ops taking longer in
 sub_op_committed, but not really a lot else. We'd get an occasional
 slow operation on OSD warning, but the OSDs were spread across various
 ceph nodes, not just the one with issues, I'm assuming due to EC.

 There was no real clarity on where the 'jam' was happening, at least
 in anything we looked at. I'm wondering if there's a better way to see
 what, specifically, is "slow" on a cluster. Looking at even the OSD
 perf output wasn't helpful, because all of that was fine - it was
 likely due to EC and write operations to OSDs on that specific node in
 question. Is there some way to look at a cluster and see which hosts
 are problematic/leading to slowness in an EC-based setup? 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Slow cluster / misplaced objects - Ceph 15.2.9