Hi,
Currently running Mimic 13.2.5.
We had reports this morning of timeouts and failures with PUT and GET
requests to our Ceph RGW cluster. I found these messages in the RGW
log:
RGWReshardLock::lock failed to acquire lock on
bucket_name:bucket_instance ret=-16
NOTICE: resharding operation on bucket index detected, blocking
block_while_resharding ERROR: bucket is still resharding, please retry
Which were preceded by many of these, which I think are normal/expected.
check_bucket_shards: resharding needed: stats.num_objects=6415879
shard max_objects=6400000
Our RGW cluster sits behind haproxy which notified me approx 90
seconds after the first 'resharding needed' message that no backends
were available. It appears this dynamic reshard process caused the
RGWs to lock up for a period of time. Roughly 2 minutes later the
reshard error messages stop and operation returns to normal.
Looking back through previous RGW logs, I see a similar event from
about a week ago, on the same bucket. We have several buckets with
shard counts exceeding 1k (this one only has 128), and much larger
object counts, so clearly this isn't the first time dynamic sharding
has been invoked on this cluster.
Has anyone seen this? I expect it will come up again, and can turn up
debugging if that'll help. Thanks for any assistance!
Josh
Show replies by thread
Any thoughts on this? We just experienced this again last night. Our 3
RGW servers had issues servicing requests for approx 7 minutes while
this reshard happened. Our users received 5xx errors from haproxy
which fronts the RGW instances. Haproxy is configured with a backend
server timeout of 60 seconds and logged a couple thousand connections
with terminating code 'sH--', indicating the RGWs did not return
response headers within that time.
This is especially concerning because it happens on many buckets, not
just the one currently being resharded.
I am testing Nautilus on our dev cluster, are there any known fixes
for this issue included?
Regards,
Josh
On Thu, Oct 31, 2019 at 2:43 PM Josh Haft <paccrap(a)gmail.com> wrote:
>
> Hi,
>
> Currently running Mimic 13.2.5.
>
> We had reports this morning of timeouts and failures with PUT and GET
> requests to our Ceph RGW cluster. I found these messages in the RGW
> log:
> RGWReshardLock::lock failed to acquire lock on
> bucket_name:bucket_instance ret=-16
> NOTICE: resharding operation on bucket index detected, blocking
> block_while_resharding ERROR: bucket is still resharding, please retry
>
> Which were preceded by many of these, which I think are normal/expected.
> check_bucket_shards: resharding needed: stats.num_objects=6415879
> shard max_objects=6400000
>
> Our RGW cluster sits behind haproxy which notified me approx 90
> seconds after the first 'resharding needed' message that no backends
> were available. It appears this dynamic reshard process caused the
> RGWs to lock up for a period of time. Roughly 2 minutes later the
> reshard error messages stop and operation returns to normal.
>
> Looking back through previous RGW logs, I see a similar event from
> about a week ago, on the same bucket. We have several buckets with
> shard counts exceeding 1k (this one only has 128), and much larger
> object counts, so clearly this isn't the first time dynamic sharding
> has been invoked on this cluster.
>
> Has anyone seen this? I expect it will come up again, and can turn up
> debugging if that'll help. Thanks for any assistance!
> Josh