Storing 20 billions of immutable objects in Ceph, 75% <16KB - ceph-users

17 Feb 2021

Bonjour,

TL;DR: Is it more advisable to work on Ceph internals to make it friendly to this
particular workload or write something similar to EOS[0] (i.e Rocksdb + Xrootd + RBD)?

This is a followup of two previous mails[1] sent while researching this topic. In a
nutshell, the Software Heritage project[1] currently has ~750TB and 10 billions objects,
75% of which have a size smaller than 16KB and 50% have a size smaller than 4KB. But they
only account for ~5% of the 750TB: 25% of the objects have a size > 16KB and total
~700TB. The objects can be compressed by ~50% and 750TB only needs 350TB of actual
storage. (if you're interested in the details see [2]).

Let say those 10 billions objects are stored in a single 4+2 erasure coded pool with
bluestore compression set for objects that have a size > 32KB and the smallest
allocation size for bluestore set to 4KB[3]. The 750TB won't use the expected 350TB
but about 30% more, i.e. ~450TB (see [4] for the maths). This space amplification is
because storing a 1 byte object uses the same space as storing a 16KB object (see [5] to
repeat the experience at home). In a 4+2 erasure coded pool, each of the 6 chunks will use
no less than 4KB because that's the smallest allocation size for bluestore. That's
4 * 4KB = 16KB even when all that is needed is 1 byte.

It was suggested[6] to have two different pools: one with a 4+2 erasure pool and
compression for all objects with a size > 32KB that are expected to compress to 16KB.
And another with 3 replicas for the smaller objects to reduce space amplification to a
minimum without compromising on durability. A client looking for the object could make two
simultaneous requests to the two pools. They would get 404 from one of them and the object
from the other.

Another workaround, is best described in the "Finding a needle in Haystack:
Facebook’s photo storage"[9] paper and essentially boils down to using a database to
store a map between the object name and its location. That does not scale out (writing the
database index is the bottleneck) but it's simple enough and is successfully
implemented in EOS[0] with >200PB worth of data and in seaweedfs[10], another promising
object store software based on the same idea.

Instead of working around the problem, maybe Ceph could be modified to make better use of
the immutability of these objects[7], a hint that is apparently only used to figure out
how to best compress it and for checksum calculation[8]. I honestly have not clue how
difficult it would be. All I know is that it's not easy otherwise it would have been
done already: there seem to be a general need for efficiently (space wise and performance
wise) storing large quantities of objects smaller than 4KB.

Is it more advisable to:

  * work on Ceph internals to make it friendly to this particular workload or,
  * write another implementation of "Finding a needle in Haystack: Facebook’s photo
storage"[9] based on RBD[11]?

I'm currently leaning toward working on Ceph internals but there are pros and cons to
both approaches[12]. And since all this is still very new to me, there also is the
possibility that I'm missing something. Maybe it's *super* difficult  to improve
Ceph in this way. I should try to figure that out sooner rather than later.

I realize it's a lot to take in and unless you're facing the exact same problem
there is very little chance you read that far :-) But if you did... I'm *really*
interested to hear what yout think. In any case I'll report back to this thread once a
decision has been made.

Cheers

[0] https://eos-web.web.cern.ch/eos-web/
[1]
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJF…
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISX…
[2] https://forge.softwareheritage.org/T3054
[3]
https://github.com/ceph/ceph/blob/3f5e778ad6f055296022e8edabf701b6958fb602/…
[4] https://forge.softwareheritage.org/T3052#58864
[5] https://forge.softwareheritage.org/T3052#58917
[6] https://forge.softwareheritage.org/T3052#58876
[7]
https://docs.ceph.com/en/latest/rados/api/librados/#c.@3.LIBRADOS_ALLOC_HIN…
[8] https://forge.softwareheritage.org/T3055
[9] https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf
[10] https://github.com/chrislusf/seaweedfs/wiki/Components
[11] https://forge.softwareheritage.org/T3049
[12] https://forge.softwareheritage.org/T3054#58977

-- 
Loïc Dachary, Artisan Logiciel Libre