I've observed a similar horror when upgrading a cluster from Luminous to
Nautilus, which had the same effect of an overwhelming amount of
snaptrim making the cluster unusable.
In our case, we held its hand by setting all OSDs to have zero max
trimming PGs, unsetting nosnaptrim, and then slowly enabling snaptrim a
few OSDs at a time. It was painful to babysit but it allowed the
cluster to catch up without falling over.
On 2023-01-28 19:43, Victor Rodriguez wrote:
> After some investigation this is what I'm seeing:
>
> - OSD processes get stuck at least at 100% CPU if I ceph osd unset
> nosnaptrim. They keep at 100% CPU even if I ceph osd set nosnaptrim.
> They stayed like that for at least 26 hours. Some quick benchmarks
> don't show a reduction of the performance of the cluster.
>
> - Restarting a OSD lowers it's CPU usage to typical levels, as
> expected, but it also usually sets some other OSD in a different host
> to typical levels.
>
> - All OSDs in this cluster take quite a bit to start: between 35 to 70
> seconds depending on the OSD. Clearly much longer than any other OSD in
> any of my clusters.
>
> - I believe that the size of the rocksdb database is dumped in the OSD
> log when an automatic compact operation is triggered. The "sum" sizes
> of these OSD range between 2.5 and 5.1 GB. Thats way bigger that those
> in any other cluster I have.
>
> - ceph daemon osd.* calc_objectstore_db_histogram is giving values for
> num_pgmeta_omap (I don't know what it is) way bigger than those on any
> other of my clusters for some OSD. Also, values are not similar among
> the OSD which hold the same PGs.
>
> osd.0: "num_pgmeta_omap": 17526766,
> osd.1: "num_pgmeta_omap": 2653379,
> osd.2: "num_pgmeta_omap": 12358703,
> osd.3: "num_pgmeta_omap": 6404975,
> osd.6: "num_pgmeta_omap": 19845318,
> osd.7: "num_pgmeta_omap": 6043083,
> osd.12: "num_pgmeta_omap": 18666776,
> osd.13: "num_pgmeta_omap": 615846,
> osd.14: "num_pgmeta_omap": 13190188,
>
> - Compacting the OSD barely reduces rocksdb size and does not reduce
> num_pgmeta_omap at all.
>
> - This is the only cluster I have were there are some RBD images that I
> mount directly from some clients, that is, they are not disks for
> QEMU/Proxmox VMs. Maybe I have something misconfigured related to
> this? This cluster is at least two and half years old an never had
> this issue with snaptrims.
>
> Thanks in advance!
>
>
> On 1/27/23 17:29, Victor Rodriguez wrote:
>> Ah yes, checked that too. Monitors and OSD's report with ceph config
>> show-with-defaults that bluefs_buffered_io is set to true as default
>> setting (it isn't overriden somewere).
>>
>>
>> On 1/27/23 17:15, Wesley Dillingham wrote:
>>> I hit this issue once on a nautilus cluster and changed the OSD
>>> parameter bluefs_buffered_io = true (was set at false). I believe the
>>> default of this parameter was switched from false to true in release
>>> 14.2.20, however, perhaps you could still check what your osds are
>>> configured with in regard to this config item.
>>>
>>> Respectfully,
>>>
>>> *Wes Dillingham*
>>> wes(a)wesdillingham.com
>>> LinkedIn <http://www.linkedin.com/in/wesleydillingham>
>>>
>>>
>>> On Fri, Jan 27, 2023 at 8:52 AM Victor Rodriguez
>>> <vrodriguez(a)soltecsis.com> wrote:
>>>
>>> Hello,
>>>
>>> Asking for help with an issue. Maybe someone has a clue about
>>> what's
>>> going on.
>>>
>>> Using ceph 15.2.17 on Proxmox 7.3. A big VM had a snapshot and I
>>> removed
>>> it. A bit later, nearly half of the PGs of the pool entered
>>> snaptrim and
>>> snaptrim_wait state, as expected. The problem is that such
>>> operations
>>> ran extremely slow and client I/O was nearly nothing, so all VMs
>>> in the
>>> cluster got stuck as they could not I/O to the storage. Taking
>>> and
>>> removing big snapshots is a normal operation that we do often and
>>> this
>>> is the first time I see this issue in any of my clusters.
>>>
>>> Disks are all Samsung PM1733 and network is 25G. It gives us
>>> plenty of
>>> performance for the use case and never had an issue with the
>>> hardware.
>>>
>>> Both disk I/O and network I/O was very low. Still, client I/O
>>> seemed to
>>> get queued forever. Disabling snaptrim (ceph osd set nosnaptrim)
>>> stops
>>> any active snaptrim operation and client I/O resumes back to
>>> normal.
>>> Enabling snaptrim again makes client I/O to almost halt again.
>>>
>>> I've been playing with some settings:
>>>
>>> ceph tell 'osd.*' injectargs '--osd-max-trimming-pgs 1'
>>> ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep 30'
>>> ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep-ssd
30'
>>> ceph tell 'osd.*' injectargs
'--osd-pg-max-concurrent-snap-trims
>>> 1'
>>>
>>> None really seemed to help. Also tried restarting OSD services.
>>>
>>> This cluster was upgraded from 14.2.x to 15.2.17 a couple of
>>> months. Is
>>> there any setting that must be changed which may cause this
>>> problem?
>>>
>>> I have scheduled a maintenance window, what should I look for to
>>> diagnose this problem?
>>>
>>> Any help is very appreciated. Thanks in advance.
>>>
>>> Victor
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>