New subject: RGW requests piling up

22 Dec 2023

I'd like to say that it was something smart but it was a bit of luck.

I logged in on a hypervisor (we run OSDs and OpenStack hypervisors on the
same hosts) to deal with another issue, and while checking the system I
noticed that one of the OSDs was using a lot more CPU than the others. It
made me think that the increased IOPS could put a strain on some of the
OSDs without impacting the whole cluster so I decided to increate pg_num to
spread the operations to more OSDs, and it did the trick. The qlen metric
went back to something similar to what we had before the problems started.

We're going to look into adding CPU/RAM monitoring for all the OSDs next.

Gauvain

On Fri, Dec 22, 2023 at 2:58 PM Drew Weaver &lt;drew.weaver(a)thenap.com&gt; wrote:

...
  Can you say how you determined that this was a
problem?

 -----Original Message-----
 From: Gauvain Pocentek &lt;gauvainpocentek(a)gmail.com&gt;
 Sent: Friday, December 22, 2023 8:09 AM
 To: ceph-users(a)ceph.io
 Subject: [ceph-users] Re: RGW requests piling up

 Hi again,

 It turns out that our rados cluster wasn't that happy, the rgw index pool
 wasn't able to handle the load. Scaling the PG number helped (256 to 512),
 and the RGW is back to a normal behaviour.

 There is still a huge number of read IOPS on the index, and we'll try to
 figure out what's happening there.

 Gauvain

 On Thu, Dec 21, 2023 at 1:40 PM Gauvain Pocentek <
 gauvainpocentek(a)gmail.com&gt;
 wrote:

  Hello Ceph users,

 We've been having an issue with RGW for a couple days and we would
 appreciate some help, ideas, or guidance to figure out the issue.

 We run a multi-site setup which has been working pretty fine so far.
 We don't actually have data replication enabled yet, only metadata
 replication. On the master region we've started to see requests piling
 up in the rgw process, leading to very slow operations and failures
 all other the place (clients timeout before getting responses from
 rgw). The workaround for now is to restart the rgw containers regularly.

 We've made a mistake and forcefully deleted a bucket on a secondary
 zone, this might be the trigger but we are not sure.

 Other symptoms include:

 * Increased memory usage of the RGW processes (we bumped the container
 limits from 4G to 48G to cater for that)
 * Lots of read IOPS on the index pool (4 or 5 times more compared to
 what we were seeing before)
 * The prometheus ceph_rgw_qlen and ceph_rgw_qactive metrics (number of
 active requests) seem to show that the number of concurrent requests
 increases with time, although we don't see more requests coming in on
 the load-balancer side.

 The current thought is that the RGW process doesn't close the requests
 properly, or that some requests just hang. After a restart of the
 process things look OK but the situation turns bad fairly quickly
 (after 1 hour we start to see many timeouts).

 The rados cluster seems completely healthy, it is also used for rbd
 volumes, and we haven't seen any degradation there.

 Has anyone experienced that kind of issue? Anything we should be
 looking at?

 Thanks for your help!

 Gauvain
  _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
 email to ceph-users-leave(a)ceph.io