Hi,
we have a very similar situation. We updated from nautilus -> pacific (16.2.11) and saw
a rapid increase in the commit_latency and op_w_latency (>10s on some OSDs) after a few
hours. We also have nearly exclusive rbd workload.
After deleting old snapshots we saw an improvenent, and after recreating snapshots the
numbers went up again. Without snapshots the numbers are slowly getting higher but not as
fast as before with existing snapshots. We also use SAS connected NVMe-SSDs.
bluefs_buffered_io made no difference. We compacted the rocksdb on a single OSD yesterday,
and funnily enough this is now the OSD with the highest op_w_latency. I generated a perf
graph for this single OSD and can generate more, but I'm not sure how to share this
data with you...?
I saw in the thread that Boris redeployed all OSDs. Could that be a more permanent
solution or is this also just temporarily (like deleting the snapshots)?
Greetings,
Jan