On 02/02/2021 20:34, Gregory Farnum wrote:
Packing's obviously a good idea for storing these
kinds of artifacts
in Ceph, and hacking through the existing librbd might indeed be
easier than building something up from raw RADOS, especially if you
want to use stuff like rbd-mirror.
My main concern would just be as Dan points out, that we don't test
rbd with extremely large images and we know deleting that image will
take a looooong time — I don't know of other issues off the top of my
head, and in the worst case you could always fall back to manipulating
it with raw librados if there is an issue.
Right. Dan's comment gave me pause:
it does not seem to be
a good idea to assume a RBD image of an infinite size. A friend who read this
thread suggested a sensible approach (which also is in line with the
Haystack paper): instead of making a single gigantic image, make
multiple 1TB images. The index is bigger
SHA256 sum of the artifact => name/uuid of the 1TB image,offset,size
SHA256 sum of the artifact => offset,size
But each image still provides packing for over 100 millions artifacts when the
average artifact size is around 3KB. It also allows:
* multiple writers (one for each image),
* rbd-mirroring individual 1TB images to a different Ceph cluster (challenging with a
single 100TB+ image),
* copying a 1TB image from a pool with a given erasure code profile to another pool with a
* growing from 1TB to 2TB in the future by merging two 1TB images,
But you might also check in on the status of Danny
email project. Email and these artifacts seemingly have a lot in
They do. This is inspiring:
Thanks for the pointer.
On Mon, Feb 1, 2021 at 12:52 PM Loïc Dachary <loic(a)dachary.org> wrote:
> Hi Dan,
> On 01/02/2021 21:13, Dan van der Ster wrote:
>> Hi Loïc,
>> We've never managed 100TB+ in a single RBD volume. I can't think of
>> anything, but perhaps there are some unknown limitations when they get so
>> It should be easy enough to use rbd bench to create and fill a massive test
>> image to validate everything works well at that size.
> Good idea! I'll look for a cluster with 100TB of free space and post my
>> Also, I assume you'll be doing the IO from just one client? Multiple
>> readers/writers to a single volume could get complicated.
>> Otherwise, yes RBD sounds very convenient for what you need.
> It is inspired by https://static.usenix.org/event/osdi10/tech/full_papers/Beaver.pdf
which suggests an ad-hoc implementation to pack immutable objects together. But I think
RBD already provides the underlying logic, even though it is not specialized for this use
case. RGW also packs small objects together and would be a good candidate. But it provides
more flexibility to modify/delete objects and I assume it will be slower to write N
objects with RGW than to write them sequentially on an RBD image. But I did not try and
maybe I should.
> To be continued.
>> Cheers, Dan
>> On Sat, Jan 30, 2021, 4:01 PM Loïc Dachary <loic(a)dachary.org> wrote:
>>> In the context Software Heritage (a noble mission to preserve all source
>>> code), artifacts have an average size of ~3KB and there are billions of
>>> them. They never change and are never deleted. To save space it would make
>>> sense to write them, one after the other, in an every growing RBD volume
>>> (more than 100TB). An index, located somewhere else, would record the
>>> offset and size of the artifacts in the volume.
>>> I wonder if someone already implemented this idea with success? And if
>>> not... does anyone see a reason why it would be a bad idea?
>>>  https://docs.softwareheritage.org/
>>> Loïc Dachary, Artisan Logiciel Libre
>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> Loïc Dachary, Artisan Logiciel Libre
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Loïc Dachary, Artisan Logiciel Libre