Hi Paul,

I’ve turned the following idea of yours into a tracker:

https://tracker.ceph.com/issues/41051

4. Common prefixes could filtered in the rgw class on the OSD instead
of in radosgw

Consider a bucket with 100 folders with 1000 objects in each and only one shard

/p1/1, /p1/2, ..., /p1/1000, /p2/1, /p2/2, ..., /p2/1000, ... /p100/1000


Now a user wants to list / with aggregating common prefixes on the
delimiter / and
wants up to 1000 results.
So there'll be 100 results returned to the client: the common prefixes
p1 to p100.

How much data will be transfered between the OSDs and radosgw for this request?
How many omap entries does the OSD scan?

radosgw will ask the (single) index object to list the first 1000 objects. It'll
return 1000 objects in a quite unhelpful way: /p1/1, /p1/2, ...., /p1/1000

radosgw will discard 999 of these and detect one common prefix and continue the
iteration at /p1/\xFF to skip the remaining entries in /p1/ if there are any.
The OSD will then return everything in /p2/ in that next request and so on.

So it'll internally list every single object in that bucket. That will
be a problem
for large buckets and having lots of shards doesn't help either.


This shouldn't be too hard to fix: add an option "aggregate prefixes" to the RGW
class method and duplicate the fast-forward logic from radosgw in
cls_rgw. It doesn't
even need to change the response type or anything, it just needs to
limit entries in
common prefixes to one result.
Is this a good idea or am I missing something?

IO would be reduced by a factor of 100 for that particular
pathological case. I've
unfortunately seen a real-world setup that I think hits a case like that.

Eric

--
J. Eric Ivancich
he/him/his
Red Hat Storage
Ann Arbor, Michigan, USA