[ceph-users] Re: Slow ops on OSDs

6 Oct 2020

On 2020-10-06 13:05, Igor Fedotov wrote:
...

 On 10/6/2020 1:04 PM, Kristof Coucke wrote:
  Another strange thing is going on:

 No client software is using the system any longer, so we would expect
 that all IOs are related to the recovery (fixing of the degraded PG).
 However, the disks that are reaching high IO are not a member of the
 PGs that are being fixed.

 So, something is heavily using the disk, but I can't find the process
 immediately. I've read something that there can be old client
 processes that keep on connecting to an OSD for retrieving data for a
 specific PG while that PG is no longer available on that disk.

 I bet it's rather PG removal happening in background.... 
^^ This, and probably the accompanying RocksDB housekeeping that goes
with it. As only removing PGs shouldn't be a too big a deal at all.
Especially with very small files (and a lot of them) you probably have a
lot of OMAP / META data, (ceph osd df will tell you).

If that's indeed the case than there is a (way) quicker option to get
out of this situation: offline compacting of the OSDs. This process
happens orders of magnitude faster than when the OSDs are still online.

To check if this hypothesis is true: are the OSD servers under CPU
stress where the PGs were located previously (and not the new hosts)?

Offline compaction per host:

systemctl stop ceph-osd.target

for osd in `ls /var/lib/ceph/osd/`; do (ceph-kvstore-tool bluestore-kv
/var/lib/ceph/osd/$osd compact &);done

Gr. Stefan

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Slow ops on OSDs