[ceph-users] Re: Very slow snaptrim operations blocking client I/O

30 Jan 2023

Hi,

Josh already suggested, but I will one more time. We had similar 
behaviour upgrading from Nautilus to Pacific. In our case compacting the 
OSDs did the trick.

For us there was no performance impact running the compaction (ceph osd 
daemon osd.0 compact) although we run them in batches and not all at 
once on all OSDs just in case. Also, no need to restart OSDs for this 
operation.

https://www.spinics.net/lists/ceph-users/msg74774.html

Regards,

Ana

On 30-01-2023 10:23, Victor Rodriguez wrote:
> I'm reading that thread on spinics. Thanks for pointing that out.
>
> The cluster was upgraded from Nautilus to Octopus first week of 
> December. No previous version has been used in this cluster. I just 
> don't remember if we used snapshots after the upgrade. 
> bluestore_fsck_quick_fix_on_mount is at default of "true". Did not 
> notice anything unusual during or after the upgrade.
>
> Is there any way to check if the OMAP conversion is done for an OSD? 
> Maybe it tries to do the conversion every time I restart an OSD and 
> fails? (given they take nearly a minute to start). I still have 13 PGs 
> pending snaptrims and client I/O is severely affected even while doing 
> on OSD at a time and osd_snap_trim_sleep_ssd=5 :\
>
>
> On 1/30/23 09:23, Frank Schilder wrote:
>> Hi Victor,
>>
>> out of curiosity, did you upgrade the cluster recently to octopus? We 
>> and others observed this behaviour when following one out of two 
>> routes to upgrade OSDs. There was a thread "Octopus OSDs extremely 
>> slow during upgrade from mimic", which seems to have been lost with 
>> the recent mail list outage. If it is relevant, I could copy pieces I 
>> have into here.
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Victor Rodriguez &lt;vrodriguez(a)soltecsis.com&gt;
>> Sent: 29 January 2023 22:40:46
>> To: ceph-users(a)ceph.io
>> Subject: [ceph-users] Re: Very slow snaptrim operations blocking 
>> client I/O
>>
>> Looks like this is going to take a few days. I hope to manage the
>> available performance for VMs with osd_snap_trim_sleep_ssd.
>>
>> I'm wondering if after that long snaptrim process you went through, was
>> your cluster was stable again and snapshots/snaptrims did work properly?
>>
>>
>> On 1/29/23 16:01, Matt Vandermeulen wrote:
>>> I should have explicitly stated that during the recovery, it was still
>>> quite bumpy for customers.  Some snaptrims were very quick, some took
>>> what felt like a really long time.  This was however a cluster with a
>>> very large number of volumes and a long, long history of snapshots.
>>> I'm not sure what the difference will be from our case versus a single
>>> large volume with a big snapshot.
>>>
>>>
>>>
>>> On 2023-01-28 20:45, Victor Rodriguez wrote:
>>>> On 1/29/23 00:50, Matt Vandermeulen wrote:
>>>>> I've observed a similar horror when upgrading a cluster from
>>>>> Luminous to Nautilus, which had the same effect of an overwhelming
>>>>> amount of snaptrim making the cluster unusable.
>>>>>
>>>>> In our case, we held its hand by setting all OSDs to have zero max
>>>>> trimming PGs, unsetting nosnaptrim, and then slowly enabling
>>>>> snaptrim a few OSDs at a time.  It was painful to babysit but it
>>>>> allowed the cluster to catch up without falling over.
>>>>
>>>> That's an interesting approach! Thanks!
>>>>
>>>> On preliminary tests seems that just running snaptrim on a single PG
>>>> of a single OSD still makes the cluster barely usable. I have to
>>>> increase osd_snap_trim_sleep_ssd to ~1 so the cluster remains usable
>>>> by getting a third of its performance. After a while, a few PG got
>>>> trimmed and feels like some of them are harder to trim than others,
>>>> as some need a higher osd_snap_trim_sleep_ssd value to let the
>>>> cluster perform.
>>>>
>>>> I don't know how long this is going to take... Maybe recreating the
>>>> OSD's and dealing with the rebalance is a better option?
>>>>
>>>> There's something ugly going on here... I would really like to put
my
>>>> finger on it.
>>>>
>>>>
>>>>> On 2023-01-28 19:43, Victor Rodriguez wrote:
>>>>>> After some investigation this is what I'm seeing:
>>>>>>
>>>>>> - OSD processes get stuck at least at 100% CPU if I ceph osd
unset
>>>>>> nosnaptrim. They keep at 100% CPU even if I ceph osd set
>>>>>> nosnaptrim. They stayed like that for at least 26 hours. Some
quick
>>>>>> benchmarks don't show a reduction of the performance of the
cluster.
>>>>>>
>>>>>> - Restarting a OSD lowers it's CPU usage to typical levels,
as
>>>>>> expected, but it also usually sets some other OSD in a different
>>>>>> host to typical levels.
>>>>>>
>>>>>> - All OSDs in this cluster take quite a bit to start: between 35
to
>>>>>> 70 seconds depending on the OSD. Clearly much longer than any
other
>>>>>> OSD in any of my clusters.
>>>>>>
>>>>>> - I believe that the size of the rocksdb database is dumped in
the
>>>>>> OSD log when an automatic compact operation is triggered. The
"sum"
>>>>>> sizes of these OSD range between 2.5 and 5.1 GB. Thats way
bigger
>>>>>> that those in any other cluster I have.
>>>>>>
>>>>>> - ceph daemon osd.* calc_objectstore_db_histogram is giving
values
>>>>>> for num_pgmeta_omap (I don't know what it is) way bigger than
those
>>>>>> on any other of my clusters for some OSD. Also, values are not
>>>>>> similar among the OSD which hold the same PGs.
>>>>>>
>>>>>> osd.0:    "num_pgmeta_omap": 17526766,
>>>>>> osd.1:    "num_pgmeta_omap": 2653379,
>>>>>> osd.2:    "num_pgmeta_omap": 12358703,
>>>>>> osd.3:    "num_pgmeta_omap": 6404975,
>>>>>> osd.6:    "num_pgmeta_omap": 19845318,
>>>>>> osd.7:    "num_pgmeta_omap": 6043083,
>>>>>> osd.12:   "num_pgmeta_omap": 18666776,
>>>>>> osd.13:    "num_pgmeta_omap": 615846,
>>>>>> osd.14:    "num_pgmeta_omap": 13190188,
>>>>>>
>>>>>> - Compacting the OSD barely reduces rocksdb size and does not
>>>>>> reduce num_pgmeta_omap at all.
>>>>>>
>>>>>> - This is the only cluster I have were there are some RBD images
>>>>>> that I mount directly from some clients, that is, they are not
>>>>>> disks for QEMU/Proxmox VMs. Maybe I have something misconfigured
>>>>>> related to this?  This cluster is at least two and half years
old
>>>>>> an never had this issue with snaptrims.
>>>>>>
>>>>>> Thanks in advance!
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>> -- 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Very slow snaptrim operations blocking client I/O