On Thu, Feb 18, 2021 at 12:36 AM Robin H. Johnson <robbat2(a)gentoo.org> wrote:
On Wed, Feb 17, 2021 at 05:36:53PM +0100, Loïc Dachary wrote:
TL;DR: Is it more advisable to work on Ceph internals to make it
friendly to this particular workload or write something similar to
EOS (i.e Rocksdb + Xrootd + RBD)?
CERN's EOSPPC instance, which is one of
the biggest from what I can
find, was up around 3.5B files in 2019; and you're proposing running 10B
files, so I don't know how EOS will handle that. Maybe Dan can chime in
on the scalability there.
The EOS namespace is now QuarkDB https://github.com/gbitzes/QuarkDB
But even with a clever namespace I don't think it is practical to
manage a system with 10B tiny files.
Enumerating them for a consistency check or migrating between hosts or
recovering from failures is going to be painful.
> Please do keep on this important work! I've tried to do something
> similar at a much smaller scale for Gentoo Linux's historical collection
> of source code media (distfiles), but am significantly further behind
> your effort.
> > Let say those 10 billions objects are stored in a single 4+2 erasure
> > coded pool with bluestore compression set for objects that have a size
> > 32KB and the smallest allocation size for bluestore set to 4KB.
> > The 750TB won't use the expected 350TB but about 30% more, i.e.
> > ~450TB (see  for the maths). This space amplification is because
> > storing a 1 byte object uses the same space as storing a 16KB object
> > (see  to repeat the experience at home). In a 4+2 erasure coded
> > pool, each of the 6 chunks will use no less than 4KB because that's
> > the smallest allocation size for bluestore. That's 4 * 4KB = 16KB
> > even when all that is needed is 1 byte.
> I think you have an error here: with 4KB allocation size in 4+2 pool,
> any object sized (0,16K] will take _6_ chunks: 20KB of storage.
> Any object sized (16K,32K] will take _12_ chunks: 40K of storage.
> I'd attack this from another side entirely:
> - how aggressively do you want to pack objects overall? e.g. if you have
> a few thousand objects in the 4-5K range, do you want zero bytes
> wasted between objects?
> - how aggressively do you want to dudup objects that share common data,
> esp if it's not aligned on some common byte margins?
> - what are the data portability requirements to move/extract data from
> this system at a later point?
> - how complex of an index are you willing to maintain to
> reconstruct/access data?
> - What requirements are there about the ordering and accessibility of
> the packs? How related do the pack objects need to be? e.g. are the
> packed as they arrive in time order, to build up successive packs of
> size, or are there many packs and you append the "correct" pack for a
> given object?
> I'm normally distinctly in the camp that object storage systems should
> natively expose all objects, but that also doesn't account for your
> immutability/append-only nature.
> I see your discussion at https://forge.softwareheritage.org/T3054#58977
> as well, about the "full scale out" vs "scale up metadata & scale
> data" parts.
> To brainstorm parts of an idea, I'm wondering about Git's
> still-in-development partial clone work, with the caveat that you intend
> to NEVER checkout the entire repository at the same time.
> Ideally, using some manner of fuse filesystem (similar to Git Virtual
> Filesystem) w/ an index-only clone, naive clients could access the
> object they wanted, which would be fetched on demand from the git server
> which has mostly git packs and a few sparse objects that are waiting for
> The write path on ingest clients would involve sending back the new
> data, and git background processes on some regular interval packing the
> loose objects into new packfiles.
> Running this on top of CephFS for now means that you get the ability to
> move it to future storage systems more easily than any custom RBD/EOS
> development you might do: bring up enough space, sync the files over,
> Git handles the deduplication, compression, access methods, and
> generates large pack files, which Ceph can store more optimally than the
> plethora of tiny objects.
> Overall, this isn't great, but there aren't a lot of alternatives as
> your great research has noted.
> Being able to take a backup of the Git-on-CephFS is also made a lot
> easier since it's a filesystem: "just" write out the 350TB to 20x
> Thinking back to older systems, like SGI's hierarchal storage modules
> for XFS, the packing overhead starts to become significant for your
> objects: some of the underlying mechanisms in the XFS HSM DMAPI, if they
> ended up packing immutable objects to tape still had tar & tar-like
> headers (at least 512 bytes per object), your 10B objects would take at
> least 4TB of extra space (before compression).
> > It was suggested to have two different pools: one with a 4+2 erasure pool and
compression for all objects with a size > 32KB that are expected to compress to 16KB.
And another with 3 replicas for the smaller objects to reduce space amplification to a
minimum without compromising on durability. A client looking for the object could make two
simultaneous requests to the two pools. They would get 404 from one of them and the object
from the other.
> > Another workaround, is best described in the "Finding a needle in Haystack:
Facebook’s photo storage" paper and essentially boils down to using a database to
store a map between the object name and its location. That does not scale out (writing the
database index is the bottleneck) but it's simple enough and is successfully
implemented in EOS with >200PB worth of data and in seaweedfs, another promising
object store software based on the same idea.
> > Instead of working around the problem, maybe Ceph could be modified to make
better use of the immutability of these objects, a hint that is apparently only used to
figure out how to best compress it and for checksum calculation. I honestly have not
clue how difficult it would be. All I know is that it's not easy otherwise it would
have been done already: there seem to be a general need for efficiently (space wise and
performance wise) storing large quantities of objects smaller than 4KB.
> > Is it more advisable to:
> > * work on Ceph internals to make it friendly to this particular workload or,
> > * write another implementation of "Finding a needle in Haystack:
Facebook’s photo storage" based on RBD?
> > I'm currently leaning toward working on Ceph internals but there are pros
and cons to both approaches. And since all this is still very new to me, there also is
the possibility that I'm missing something. Maybe it's *super* difficult to
improve Ceph in this way. I should try to figure that out sooner rather than later.
> > I realize it's a lot to take in and unless you're facing the exact same
problem there is very little chance you read that far :-) But if you did... I'm
*really* interested to hear what yout think. In any case I'll report back to this
thread once a decision has been made.
> > Cheers
> >  https://eos-web.web.cern.ch/eos-web/
> > 
> >  https://forge.softwareheritage.org/T3054
> > 
> >  https://forge.softwareheritage.org/T3052#58864
> >  https://forge.softwareheritage.org/T3052#58917
> >  https://forge.softwareheritage.org/T3052#58876
> > 
> >  https://forge.softwareheritage.org/T3055
> >  https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf
> >  https://github.com/chrislusf/seaweedfs/wiki/Components
> >  https://forge.softwareheritage.org/T3049
> >  https://forge.softwareheritage.org/T3054#58977
> > --
> > Loïc Dachary, Artisan Logiciel Libre
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
> Robin Hugh Johnson
> Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
> E-Mail : robbat2(a)gentoo.org
> GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
> GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io