Hi,
Quoting Stefan Kooman (stefan(a)bit.nl):
Hi,
After the upgrade to 13.2.8 deep-scrub has a big impact on client IO:
loads of SLOW_OPS and high latency. We hardly ever had SLOW_OPS, but
since the upgrade the impact is so big that we even have OSDs marking
each other out (OSD op thread timeout) multiple times during the scrub
window. Plenty of CPU / RAM / IOPS left, hardly any load on these OSD
servers. Has there anything changed in this release that can explain
this behaviour?
Besides this the impact of rebalance is very severe as well. With only
the balancer remapping a couple of PGs at a time there are loads of
(MDS_)SLOW_OPS. This morning the cephfs metadata pool got rebalanced ...
and that triggered a lot of SLOW_OPS. One particular OSD was pegged at
1000% CPU for more than half an hour (not doing that much IO): that's 10
cores going full throttle! After a restart this issue was gone.
We can now also trigger SLOW_OPS on a bunch of OSDs when we do a "rbd du
-p $POOL", something that has never been an issue. The images in
the rbd pools have the following features enabled: layering,
exclusive-lock, object-map, fast-diff, deep-flatten.
Has there anything changed in 13.2.8 that affects these kind of
operations?
Gr. Stefan
--
| BIT BV
https://www.bit.nl/ Kamer van Koophandel 09090351
| GPG: 0xD14839C6 +31 318 648 688 / info(a)bit.nl