Why not put all the data to a zfs pool with 3-4 levels deep directory
structure each directory named with 2 byte in range 00-FF?
Four levels deep, you get 255^4=4B folders with 3-4 objects per folder
or three levels deep you get 255^3=16M folders with ~1000 objects
each.
On Wed, Feb 17, 2021 at 8:14 PM Loïc Dachary <loic(a)dachary.org> wrote:
>
> Hi Nathan,
>
> Good thinking :-) The names of the objects are indeed the SHA256 of their content,
which provides deduplication.
>
> Cheers
>
> On 17/02/2021 18:04, Nathan Fish wrote:
> > I'm not much of a programmer, but as soon as I hear "immutable
> > objects" I think "content-addressed". I don't know if you
have many
> > duplicate objects in this set, but content-addressing gives you
> > object-level dedup for free. Do you have to preserve some meaningful
> > object names from the original dataset, or just do you just need some
> > kind of ID?
> >
> > On Wed, Feb 17, 2021 at 11:37 AM Loïc Dachary <loic(a)dachary.org> wrote:
> >> Bonjour,
> >>
> >> TL;DR: Is it more advisable to work on Ceph internals to make it friendly to
this particular workload or write something similar to EOS[0] (i.e Rocksdb + Xrootd +
RBD)?
> >>
> >> This is a followup of two previous mails[1] sent while researching this
topic. In a nutshell, the Software Heritage project[1] currently has ~750TB and 10
billions objects, 75% of which have a size smaller than 16KB and 50% have a size smaller
than 4KB. But they only account for ~5% of the 750TB: 25% of the objects have a size >
16KB and total ~700TB. The objects can be compressed by ~50% and 750TB only needs 350TB of
actual storage. (if you're interested in the details see [2]).
> >>
> >> Let say those 10 billions objects are stored in a single 4+2 erasure coded
pool with bluestore compression set for objects that have a size > 32KB and the
smallest allocation size for bluestore set to 4KB[3]. The 750TB won't use the expected
350TB but about 30% more, i.e. ~450TB (see [4] for the maths). This space amplification is
because storing a 1 byte object uses the same space as storing a 16KB object (see [5] to
repeat the experience at home). In a 4+2 erasure coded pool, each of the 6 chunks will use
no less than 4KB because that's the smallest allocation size for bluestore. That's
4 * 4KB = 16KB even when all that is needed is 1 byte.
> >>
> >> It was suggested[6] to have two different pools: one with a 4+2 erasure pool
and compression for all objects with a size > 32KB that are expected to compress to
16KB. And another with 3 replicas for the smaller objects to reduce space amplification to
a minimum without compromising on durability. A client looking for the object could make
two simultaneous requests to the two pools. They would get 404 from one of them and the
object from the other.
> >>
> >> Another workaround, is best described in the "Finding a needle in
Haystack: Facebook’s photo storage"[9] paper and essentially boils down to using a
database to store a map between the object name and its location. That does not scale out
(writing the database index is the bottleneck) but it's simple enough and is
successfully implemented in EOS[0] with >200PB worth of data and in seaweedfs[10],
another promising object store software based on the same idea.
> >>
> >> Instead of working around the problem, maybe Ceph could be modified to make
better use of the immutability of these objects[7], a hint that is apparently only used to
figure out how to best compress it and for checksum calculation[8]. I honestly have not
clue how difficult it would be. All I know is that it's not easy otherwise it would
have been done already: there seem to be a general need for efficiently (space wise and
performance wise) storing large quantities of objects smaller than 4KB.
> >>
> >> Is it more advisable to:
> >>
> >> * work on Ceph internals to make it friendly to this particular workload
or,
> >> * write another implementation of "Finding a needle in Haystack:
Facebook’s photo storage"[9] based on RBD[11]?
> >>
> >> I'm currently leaning toward working on Ceph internals but there are
pros and cons to both approaches[12]. And since all this is still very new to me, there
also is the possibility that I'm missing something. Maybe it's *super* difficult
to improve Ceph in this way. I should try to figure that out sooner rather than later.
> >>
> >> I realize it's a lot to take in and unless you're facing the exact
same problem there is very little chance you read that far :-) But if you did... I'm
*really* interested to hear what yout think. In any case I'll report back to this
thread once a decision has been made.
> >>
> >> Cheers
> >>
> >> [0]
https://eos-web.web.cern.ch/eos-web/
> >> [1]
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJF…
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISX…
> >> [2]
https://forge.softwareheritage.org/T3054
> >> [3]
https://github.com/ceph/ceph/blob/3f5e778ad6f055296022e8edabf701b6958fb602/…
> >> [4]
https://forge.softwareheritage.org/T3052#58864
> >> [5]
https://forge.softwareheritage.org/T3052#58917
> >> [6]
https://forge.softwareheritage.org/T3052#58876
> >> [7]
https://docs.ceph.com/en/latest/rados/api/librados/#c.@3.LIBRADOS_ALLOC_HIN…
> >> [8]
https://forge.softwareheritage.org/T3055
> >> [9]
https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf
> >> [10]
https://github.com/chrislusf/seaweedfs/wiki/Components
> >> [11]
https://forge.softwareheritage.org/T3049
> >> [12]
https://forge.softwareheritage.org/T3054#58977
> >>
> >> --
> >> Loïc Dachary, Artisan Logiciel Libre
> >>
> >>
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users(a)ceph.io
> >> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io