RGWReshardLock::lock failed to acquire lock ret=-16 - ceph-users

31 Oct 2019

Hi,

Currently running Mimic 13.2.5.

We had reports this morning of timeouts and failures with PUT and GET
requests to our Ceph RGW cluster. I found these messages in the RGW
log:
RGWReshardLock::lock failed to acquire lock on
bucket_name:bucket_instance ret=-16
NOTICE: resharding operation on bucket index detected, blocking
block_while_resharding ERROR: bucket is still resharding, please retry

Which were preceded by many of these, which I think are normal/expected.
check_bucket_shards: resharding needed: stats.num_objects=6415879
shard max_objects=6400000

Our RGW cluster sits behind haproxy which notified me approx 90
seconds after the first 'resharding needed' message that no backends
were available. It appears this dynamic reshard process caused the
RGWs to lock up for a period of time. Roughly 2 minutes later the
reshard error messages stop and operation returns to normal.

Looking back through previous RGW logs, I see a similar event from
about a week ago, on the same bucket. We have several buckets with
shard counts exceeding 1k (this one only has 128), and much larger
object counts, so clearly this isn't the first time dynamic sharding
has been invoked on this cluster.

Has anyone seen this? I expect it will come up again, and can turn up
debugging if that'll help. Thanks for any assistance!
Josh