Re: [ceph-users] Adventures with large RGW buckets

2 Aug 2019

On 8/2/19 3:04 AM, Harald Staub wrote:
...
  Right now our main focus is on the Veeam use case
(VMWare backup), used 
 with an S3 storage tier. Currently we host a bucket with 125M objects 
 and one with 100M objects.

 As Paul stated, searching common prefixes can be painful. We had some 
 cases that did not work (taking too much time, radosgw taking too much 
 memory) until the upgrade from 14.2.1 to 14.2.2, which includes an 
 important fix for that :-)

 We expect up to 400M objects per bucket. Following the 100k 
 recommendation, we started with 4096 shards per bucket.

 Other cases to search common prefixes took several minutes. It helped us 
 to reshard from 4096 to 1024, response time became nearly 3 times faster.

 It feels that the main reason to have shards is to get distribution of 
 index operations' load over several PGs and therefore over several OSDs. 
 So maybe a number of shards much higher than the number of PGs or OSDs 
 does not help a lot? But it introduces some overhead. Maybe it would be 
 better to have a recommendation based on the number of OSDs involved? 
More shards are also important for recovery. Overall recovery time for a
given bucket is reduced since each shard can be recovered in parallel.
With fewer shards, you get larger rados objects, each of which will take 
longer to recover, potentially causing a longer outage if there aren't
enough copies to be active.

...
  The mentioned resharding (4096 -> 1024) itself
worked ("completed 
 successfully"), but the removal of one of the old indexes did not. The 
 cluster saw an OSD going down, which seems to have aborted the cleanup. 
 This OSD stayed up, but there were timeouts, probably during RocksDB 
 compaction (from looking at the OSD log). The affected OSD has the 
 highest number of PGs of the index pool. Again, this would suggest that 
 a lot of shards does not help when many shards are processed together in 
 one RocksDB.

 Manually removing the objects of the old index one by one was no 
 problem. Maybe dynamic resharding could do it similarly to avoid the 
 RocksDB overload? Or RocksDB could be made to stay responsive? 
There's a lot of work going into making rocksdb more effective for these 
cases - much of it discussed in the performance weekly from July 25:

https://pad.ceph.com/p/performance_weekly

One of the major pieces there is sharding rocksdb into multiple column
families. This reduces the size of an individual LSM-tree within
rocksdb, meaning levels are smaller and compactions are faster. In
testing so far this significantly reduces tail latency (60-70% reduced
99% latency for pure-omap writes, similar to an rgw bucket index
workload).

Josh

...
  On 31.07.19 20:02, Paul Emmerich wrote:
  Hi,

 we are seeing a trend towards rather large RGW S3 buckets lately.
 we've worked on
 several clusters with 100 - 500 million objects in a single bucket, 
 and we've
 been asked about the possibilities of buckets with several billion 
 objects more
 than once.

  From our experience: buckets with tens of million objects work just 
 fine with
 no big problems usually. Buckets with hundreds of million objects 
 require some
 attention. Buckets with billions of objects? "How about indexless 
 buckets?" -
 "No, we need to list them".

 A few stories and some questions:

 1. The recommended number of objects per shard is 100k. Why? How was this
 default configuration derived?

 It doesn't really match my experiences. We know a few clusters running 
 with
 larger shards because resharding isn't possible for various reasons at 
 the
 moment. They sometimes work better than buckets with lots of shards.

 So we've been considering to at least double that 100k target shard size
 for large buckets, that would make the following point far less annoying.

 2. Many shards + ordered object listing = lots of IO

 Unfortunately telling people to not use ordered listings when they 
 don't really
 need them doesn't really work as their software usually just doesn't 
 support
 that :(

 A listing request for X objects will retrieve up to X objects from 
 each shard
 for ordering them. That will lead to quite a lot of traffic between 
 the OSDs
 and the radosgw instances, even for relatively innocent simple queries 
 as X
 defaults to 1000 usually.

 Simple example: just getting the first page of a bucket listing with 4096
 shards fetches around 1 GB of data from the OSD to return ~300kb or so 
 to the
 S3 client.

 I've got two clusters here that are only used for some relatively 
 low-bandwidth
 backup use case here. However, there are a few buckets with hundreds 
 of millions
 of objects that are sometimes being listed by the backup system.

 The result is that this cluster has an average read IO of 1-2 GB/s, 
 all going
 to the index pool. Not a big deal since that's coming from SSDs and 
 goes over
 80 Gbit/s LACP bonds. But it does pose the question about scalability
 as the user-
 visible load created by the S3 clients is quite low.

 3. Deleting large buckets

 Someone accidentaly put 450 million small objects into a bucket and 
 only noticed
 when the cluster ran full. The bucket isn't needed, so just delete it 
 and case
 closed?

 Deleting is unfortunately far slower than adding objects, also
 radosgw-admin leaks
 memory during deletion: https://tracker.ceph.com/issues/40700

 Increasing --max-concurrent-ios helps with deletion speed (option does 
 effect
 deletion concurrency, documentation says it's only for other specific 
 commands).

 Since the deletion is going faster than new data is being added to 
 that cluster
 the "solution" was to run the deletion command in a memory-limited 
 cgroup and
 restart it automatically after it gets killed due to leaking.

 How could the bucket deletion of the future look like? Would it be 
 possible
 to put all objects in buckets into RADOS namespaces and implement some 
 kind
 of efficient namespace deletion on the OSD level similar to how pool 
 deletions
 are handled at a lower level?

 4. Common prefixes could filtered in the rgw class on the OSD instead
 of in radosgw

 Consider a bucket with 100 folders with 1000 objects in each and only 
 one shard

 /p1/1, /p1/2, ..., /p1/1000, /p2/1, /p2/2, ..., /p2/1000, ... /p100/1000

 Now a user wants to list / with aggregating common prefixes on the
 delimiter / and
 wants up to 1000 results.
 So there'll be 100 results returned to the client: the common prefixes
 p1 to p100.

 How much data will be transfered between the OSDs and radosgw for this 
 request?
 How many omap entries does the OSD scan?

 radosgw will ask the (single) index object to list the first 1000 
 objects. It'll
 return 1000 objects in a quite unhelpful way: /p1/1, /p1/2, ...., 
 /p1/1000

 radosgw will discard 999 of these and detect one common prefix and 
 continue the
 iteration at /p1/\xFF to skip the remaining entries in /p1/ if there 
 are any.
 The OSD will then return everything in /p2/ in that next request and 
 so on.

 So it'll internally list every single object in that bucket. That will
 be a problem
 for large buckets and having lots of shards doesn't help either.

 This shouldn't be too hard to fix: add an option "aggregate prefixes" 
 to the RGW
 class method and duplicate the fast-forward logic from radosgw in
 cls_rgw. It doesn't
 even need to change the response type or anything, it just needs to
 limit entries in
 common prefixes to one result.
 Is this a good idea or am I missing something?

 IO would be reduced by a factor of 100 for that particular
 pathological case. I've
 unfortunately seen a real-world setup that I think hits a case like that.

 Paul
  _______________________________________________
 ceph-users mailing list
 ceph-users(a)lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

2024

2023

2022

2021

2020

2019

Re: [ceph-users] Adventures with large RGW buckets