On 18/02/2021 00:35, Robin H. Johnson wrote:
On Wed, Feb 17, 2021 at 05:36:53PM +0100, Loïc Dachary
TL;DR: Is it more advisable to work on Ceph internals to make it
friendly to this particular workload or write something similar to
EOS (i.e Rocksdb + Xrootd + RBD)?
CERN's EOSPPC instance, which is one of
the biggest from what I can
find, was up around 3.5B files in 2019; and you're proposing running 10B
files, so I don't know how EOS will handle that. Maybe Dan can chime in
on the scalability there.
This is an essential piece of information I was missing.
It also makes sense that there are much larger objects in the context of the CERN.
Please do keep on this important work! I've tried to do something
similar at a much smaller scale for Gentoo Linux's historical collection
of source code media (distfiles), but am significantly further behind
Thanks for the encouragements! These are very preliminary stages but
I'm enthusiastic about what will follow because I'll have the opportunity to work
on it until a solution is implemented and deployed.
Let say those 10 billions objects are stored in a
single 4+2 erasure
coded pool with bluestore compression set for objects that have a size
32KB and the smallest allocation size for bluestore set to 4KB.
The 750TB won't use the expected 350TB but about 30% more, i.e.
~450TB (see  for the maths). This space amplification is because
storing a 1 byte object uses the same space as storing a 16KB object
(see  to repeat the experience at home). In a 4+2 erasure coded
pool, each of the 6 chunks will use no less than 4KB because that's
the smallest allocation size for bluestore. That's 4 * 4KB = 16KB
even when all that is needed is 1 byte.
I think you have an error here: with 4KB
allocation size in 4+2 pool,
any object sized (0,16K] will take _6_ chunks: 20KB of storage.
Any object sized (16K,32K] will take _12_ chunks: 40K of storage.
I should have
mentioned that my calculations were ignoring the replication overhead (parity chunks or
copies). Good catch :-)
I'd attack this from another side entirely:
- how aggressively do you want to pack objects overall? e.g. if you have
a few thousand objects in the 4-5K range, do you want zero bytes
wasted between objects?
50% of the objects have a size <4KB, that is
~5billions currently and growing. *But* they account for only 1% of the total size. So
maybe not very agressively but not passively either.
- how aggressively do you want to dudup objects that
share common data,
esp if it's not aligned on some common byte margins?
addressed by the SHA256 of their content and that takes care of deduplication.
- what are the data portability requirements to
move/extract data from
this system at a later point?
The data portability is ensured by using Free
Software only and open standards where possible. And by distributing the software in a way
that can be conveniently installed by a third party. Does that answer your question? The
durability of the software/format couple used to store data is something I'm not
worried about but may I should.
- how complex of an index are you willing to maintain
I don't envision the index being more complex than
SHA256 => content (roughly).
- What requirements are there about the ordering and
the packs? How related do the pack objects need to be? e.g. are the
packed as they arrive in time order, to build up successive packs of
size, or are there many packs and you append the "correct" pack for a
There are no ordering requirements.
I'm normally distinctly in the camp that object storage systems should
natively expose all objects, but that also doesn't account for your
I see your discussion at https://forge.softwareheritage.org/T3054#58977
as well, about the "full scale out" vs "scale up metadata & scale out
To brainstorm parts of an idea, I'm wondering about Git's
still-in-development partial clone work,
[snip] I did not know about "partial
clone" and will explore this in https://forge.softwareheritage.org/T3065
. Although it
is probably not a good fit for a 2021 solution, it sounds like a great source of
Thinking back to older systems, like SGI's
hierarchal storage modules
for XFS, the packing overhead starts to become significant for your
objects: some of the underlying mechanisms in the XFS HSM DMAPI, if they
ended up packing immutable objects to tape still had tar & tar-like
headers (at least 512 bytes per object), your 10B objects would take at
least 4TB of extra space (before compression).
I'm tempted to overlook lessons
from the past. In part because I'm afraid I'll loose myself :-) In part because I
assume the world changed a lot since. If however you think (have a hunch) that it might be
useful, I'll give it a try.
Thanks for the great feedback!
Loïc Dachary, Artisan Logiciel Libre