[ceph-users] Re: deep-scrub / backfilling: large amount of SLOW_OPS after upgrade to 13.2.8

9 Jan 2020

Hi,

Quoting Stefan Kooman (stefan(a)bit.nl):
...
  Hi,

 After the upgrade to 13.2.8 deep-scrub has a big impact on client IO:
 loads of SLOW_OPS and high latency. We hardly ever had SLOW_OPS, but
 since the upgrade the impact is so big that we even have OSDs marking
 each other out (OSD op thread timeout) multiple times during the scrub
 window. Plenty of CPU / RAM / IOPS left, hardly any load on these OSD
 servers. Has there anything changed in this release that can explain
 this behaviour?

 Besides this the impact of rebalance is very severe as well. With only
 the balancer remapping a couple of PGs at a time there are loads of
 (MDS_)SLOW_OPS. This morning the cephfs metadata pool got rebalanced ...
 and that triggered a lot of SLOW_OPS. One particular OSD was pegged at
 1000% CPU for more than half an hour (not doing that much IO): that's 10
 cores going full throttle! After a restart this issue was gone. 
We can now also trigger SLOW_OPS on a bunch of OSDs when we do a "rbd du
-p $POOL", something that has never been an issue. The images in
the rbd pools have the following features enabled: layering,
exclusive-lock, object-map, fast-diff, deep-flatten.

Has there anything changed in 13.2.8 that affects these kind of
operations?

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/        Kamer van Koophandel 09090351
| GPG: 0xD14839C6                   +31 318 648 688 / info(a)bit.nl

2024

2023

2022

2021

2020

2019

[ceph-users] Re: deep-scrub / backfilling: large amount of SLOW_OPS after upgrade to 13.2.8