bucket notification retries - ceph-users

31 May 2023

Dear Community,
I would like to collect your feedback on this issue. This is a followup
from a discussion that started in the RGW refactoring meeting on 31-May-23
(thanks @Krunal Chheda &lt;kchheda3(a)bloomberg.net&gt; for bringing up this
topic!).

Currently persistent notifications are retried indefinitely.
The only limiting mechanism that exists is that all notifications to a
specific topic are stored in one RADOS object (of size 128MB).
Assuming notifications are ~1KB at most, this would give us at least 128K
notifications that can wait in the queue.
When the queue fills up (e.g. kafka broker is down for 20 minutes, we are
sending ~100 notifications per second) we start sending "slow down" replies
to the client, and in this case the S3 operation will not be performed.
This means that, for example, an outage of the kafka system would
eventually cause an outage of our service. Note that this may also be a
result of a misconfiguration of the kafka broker, or decommissioning of a
broker.

To avoid that, we propose several options:
* use a fifo instead of a queue. This would allow us to hold more than 128K
messages - survive longer broker outages, and at a higher message rate.
there should still probably be a limit set on the size of the fifo
* define maximum number of retries allowed for a notification
* define maximum time the notification may stay in the queue before it is
removed

We should probably start with these definitions done as topic attributes,
reflecting our delivery guarantees for this specific destination.
Will try to capture the results of the discussion in this tracker:
https://tracker.ceph.com/issues/61532

Thanks,

Yuval