Hello,
Recently we got a problem from an internal customer on our S3. Our setup consist
of roughly 10 servers with 140 OSDs. Our 3 RGWs are collocated with monitors on
dedicated servers in a HA setup with HAProxy in front. We are running 16.2.14
on Podman with Cephadm.
Our S3 is constantly having a traffic of 500 req/s average per RGW instance.
The problem is described in this issue:
https://tracker.ceph.com/issues/63935.
Basically this customer is having a Grafana Mimir instance pushing to our S3 and
during a compaction process it does a special pattern like this:
```
29/Dec/2023:17:13:28.961 rgw-frontend~ rgw-backend/server-mon-01-rgw0 0/0/0/127/127 200
228 - - ---- 132/132/70/67/0 0/0 "PUT /1234/object HTTP/1.1"
29/Dec/2023:17:13:29.101 rgw-frontend~ rgw-backend/server-mon-01-rgw0 0/0/0/1/1 200 381 -
- ---- 132/132/76/71/0 0/0 "GET /1234/object HTTP/1.1"
29/Dec/2023:17:13:29.121 rgw-frontend~ rgw-backend/server-mon-01-rgw0 0/0/0/1/1 200 381 -
- ---- 132/132/71/59/0 0/0 "GET /1234/object HTTP/1.1"
29/Dec/2023:17:13:29.137 rgw-frontend~ rgw-backend/server-mon-03-rgw0 0/0/0/4/4 204 153 -
- ---- 132/132/71/6/0 0/0 "DELETE /1234/object HTTP/1.1"
29/Dec/2023:19:03:21.671 rgw-frontend~ rgw-backend/server-mon-03-rgw0 0/0/0/1/1 404 472 -
- ---- 55/55/26/0/0 0/0 "GET /1234/object HTTP/1.1"
```
It is doing PUT, GET and DELETE in the same second. Afterwards the customer can
see the deleted object when doing a ListObjects in the bucket but if he tries to access it
then RGW
returns a 404.
After looking in Ceph, it appears the object has a bucket index entry but the
associated RADOS object does not exist anymore. The bucket does not have
versioning or object locking.
Did someone encounter something similar? Thank you!
Regards,
--
Mathias Chapelain
Storage Engineer
Proton AG