I should have explicitly stated that during the recovery, it was still
quite bumpy for customers. Some snaptrims were very quick, some took
what felt like a really long time. This was however a cluster with a
very large number of volumes and a long, long history of snapshots. I'm
not sure what the difference will be from our case versus a single large
volume with a big snapshot.
On 2023-01-28 20:45, Victor Rodriguez wrote:
> On 1/29/23 00:50, Matt Vandermeulen wrote:
>> I've observed a similar horror when upgrading a cluster from Luminous
>> to Nautilus, which had the same effect of an overwhelming amount of
>> snaptrim making the cluster unusable.
>>
>> In our case, we held its hand by setting all OSDs to have zero max
>> trimming PGs, unsetting nosnaptrim, and then slowly enabling snaptrim
>> a few OSDs at a time. It was painful to babysit but it allowed the
>> cluster to catch up without falling over.
>
>
> That's an interesting approach! Thanks!
>
> On preliminary tests seems that just running snaptrim on a single PG of
> a single OSD still makes the cluster barely usable. I have to increase
> osd_snap_trim_sleep_ssd to ~1 so the cluster remains usable by getting
> a third of its performance. After a while, a few PG got trimmed and
> feels like some of them are harder to trim than others, as some need a
> higher osd_snap_trim_sleep_ssd value to let the cluster perform.
>
> I don't know how long this is going to take... Maybe recreating the
> OSD's and dealing with the rebalance is a better option?
>
> There's something ugly going on here... I would really like to put my
> finger on it.
>
>
>> On 2023-01-28 19:43, Victor Rodriguez wrote:
>>> After some investigation this is what I'm seeing:
>>>
>>> - OSD processes get stuck at least at 100% CPU if I ceph osd unset
>>> nosnaptrim. They keep at 100% CPU even if I ceph osd set nosnaptrim.
>>> They stayed like that for at least 26 hours. Some quick benchmarks
>>> don't show a reduction of the performance of the cluster.
>>>
>>> - Restarting a OSD lowers it's CPU usage to typical levels, as
>>> expected, but it also usually sets some other OSD in a different host
>>> to typical levels.
>>>
>>> - All OSDs in this cluster take quite a bit to start: between 35 to
>>> 70 seconds depending on the OSD. Clearly much longer than any other
>>> OSD in any of my clusters.
>>>
>>> - I believe that the size of the rocksdb database is dumped in the
>>> OSD log when an automatic compact operation is triggered. The "sum"
>>> sizes of these OSD range between 2.5 and 5.1 GB. Thats way bigger
>>> that those in any other cluster I have.
>>>
>>> - ceph daemon osd.* calc_objectstore_db_histogram is giving values
>>> for num_pgmeta_omap (I don't know what it is) way bigger than those
>>> on any other of my clusters for some OSD. Also, values are not
>>> similar among the OSD which hold the same PGs.
>>>
>>> osd.0: "num_pgmeta_omap": 17526766,
>>> osd.1: "num_pgmeta_omap": 2653379,
>>> osd.2: "num_pgmeta_omap": 12358703,
>>> osd.3: "num_pgmeta_omap": 6404975,
>>> osd.6: "num_pgmeta_omap": 19845318,
>>> osd.7: "num_pgmeta_omap": 6043083,
>>> osd.12: "num_pgmeta_omap": 18666776,
>>> osd.13: "num_pgmeta_omap": 615846,
>>> osd.14: "num_pgmeta_omap": 13190188,
>>>
>>> - Compacting the OSD barely reduces rocksdb size and does not reduce
>>> num_pgmeta_omap at all.
>>>
>>> - This is the only cluster I have were there are some RBD images that
>>> I mount directly from some clients, that is, they are not disks for
>>> QEMU/Proxmox VMs. Maybe I have something misconfigured related to
>>> this? This cluster is at least two and half years old an never had
>>> this issue with snaptrims.
>>>
>>> Thanks in advance!
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io