I'd like to say that it was something smart but it was a bit of luck.
I logged in on a hypervisor (we run OSDs and OpenStack hypervisors on the
same hosts) to deal with another issue, and while checking the system I
noticed that one of the OSDs was using a lot more CPU than the others. It
made me think that the increased IOPS could put a strain on some of the
OSDs without impacting the whole cluster so I decided to increate pg_num to
spread the operations to more OSDs, and it did the trick. The qlen metric
went back to something similar to what we had before the problems started.
We're going to look into adding CPU/RAM monitoring for all the OSDs next.
Gauvain
On Fri, Dec 22, 2023 at 2:58 PM Drew Weaver <drew.weaver(a)thenap.com> wrote:
Can you say how you determined that this was a
problem?
-----Original Message-----
From: Gauvain Pocentek <gauvainpocentek(a)gmail.com>
Sent: Friday, December 22, 2023 8:09 AM
To: ceph-users(a)ceph.io
Subject: [ceph-users] Re: RGW requests piling up
Hi again,
It turns out that our rados cluster wasn't that happy, the rgw index pool
wasn't able to handle the load. Scaling the PG number helped (256 to 512),
and the RGW is back to a normal behaviour.
There is still a huge number of read IOPS on the index, and we'll try to
figure out what's happening there.
Gauvain
On Thu, Dec 21, 2023 at 1:40 PM Gauvain Pocentek <
gauvainpocentek(a)gmail.com>
wrote:
Hello Ceph users,
We've been having an issue with RGW for a couple days and we would
appreciate some help, ideas, or guidance to figure out the issue.
We run a multi-site setup which has been working pretty fine so far.
We don't actually have data replication enabled yet, only metadata
replication. On the master region we've started to see requests piling
up in the rgw process, leading to very slow operations and failures
all other the place (clients timeout before getting responses from
rgw). The workaround for now is to restart the rgw containers regularly.
We've made a mistake and forcefully deleted a bucket on a secondary
zone, this might be the trigger but we are not sure.
Other symptoms include:
* Increased memory usage of the RGW processes (we bumped the container
limits from 4G to 48G to cater for that)
* Lots of read IOPS on the index pool (4 or 5 times more compared to
what we were seeing before)
* The prometheus ceph_rgw_qlen and ceph_rgw_qactive metrics (number of
active requests) seem to show that the number of concurrent requests
increases with time, although we don't see more requests coming in on
the load-balancer side.
The current thought is that the RGW process doesn't close the requests
properly, or that some requests just hang. After a restart of the
process things look OK but the situation turns bad fairly quickly
(after 1 hour we start to see many timeouts).
The rados cluster seems completely healthy, it is also used for rbd
volumes, and we haven't seen any degradation there.
Has anyone experienced that kind of issue? Anything we should be
looking at?
Thanks for your help!
Gauvain
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
email to ceph-users-leave(a)ceph.io