Thanks for the response. No I have not yet told the OSDs participating in that PG to
compact. It was something I had thought about, but was somewhat concerned about what that
might do, or what performance impact that might have (or if the OSD would come out alive
on the other side). I think we may have found a less impactful way to trim these bilog
entries by using `--start-marker` and `--end-marker` and simply looping and incrementing
those marker values by 1000 each time. This is far less impactful than running the
commands without those flags: it was taking ~45 seconds each time to enumerate bilog
entries to trim in which the lead OSD was nearly unresponsive. It took diving into the
source code and the help of a few colleagues (as well as some trial and error on
non-production systems) to figure out what values those arguments actually wanted.
Thankfully I was able to get a listing of all OMAP keys for that object a couple weeks
ago. I’m still not sure how comfortable I would be doing this to a bucket that was
actually mission critical (this one contains non-critical data), but I think we may have a
way forward to dislodge this large OMAP by trimming. Thanks again!
From: Dan van der Ster <dan.vanderster(a)clyso.com>
Date: Wednesday, April 26, 2023 at 11:11 AM
To: Ben.Zieglmeier <Ben.Zieglmeier(a)target.com>
Cc: ceph-users(a)ceph.io <ceph-users(a)ceph.io>
Subject: [EXTERNAL] Re: [ceph-users] Massive OMAP remediation
Are you compacting the relevant osds periodically? ceph tell osd.x
compact (for the three osds holding the bilog) would help reshape the
rocksdb levels to least perform better for a little while until the
next round of bilog trims.
Otherwise, I have experience deleting ~50M object indices in one step
in the past, probably back in the luminous days IIRC. It will likely
lockup the relevant osds for a while while the omap is removed. If you
dare take that step, it might help to set nodown; that might prevent
other osds from flapping and creating more work.
Clyso GmbH |
On Tue, Apr 25, 2023 at 2:45 PM Ben.Zieglmeier
We have a RGW cluster running Luminous (12.2.11) that has one object with an extremely
large OMAP database in the index pool. Listomapkeys on the object returned 390 Million
keys to start. Through bilog trim commands, we’ve whittled that down to about 360 Million.
This is a bucket index for a regrettably unsharded bucket. There are only about 37K
objects actually in the bucket, but through years of neglect, the bilog grown completely
out of control. We’ve hit some major problems trying to deal with this particular OMAP
object. We just crashed 4 OSDs when a bilog trim caused enough churn to knock one of the
OSDs housing this PG out of the cluster temporarily. The OSD disks are 6.4TB NVMe, but are
split into 4 partitions, each housing their own OSD daemon (collocated journal).
We want to be rid of this large OMAP object, but are running out of options to deal with
it. Reshard outright does not seem like a viable option, as we believe the deletion would
deadlock OSDs can could cause impact. Continuing to run `bilog trim` 1000 records at a
time has been what we’ve done, but this also seems to be creating impacts to
performance/stability. We are seeking options to remove this problematic object without
creating additional problems. It is quite likely this bucket is abandoned, so we could
remove the data, but I fear even the deletion of such a large OMAP could bring OSDs down
and cause potential for metadata loss (the other bucket indexes on that same PG).
Any insight available would be highly appreciated.
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io