Using RBD to pack billions of small files

List overview All Threads
Download

newer

older

replace OSD without PG remapping

15.2.9 ETA?

Loïc Dachary

30 Jan 2021 30 Jan '21

4:01 p.m.

Bonjour, In the context Software Heritage (a noble mission to preserve all source code)[0], artifacts have an average size of ~3KB and there are billions of them. They never change and are never deleted. To save space it would make sense to write them, one after the other, in an every growing RBD volume (more than 100TB). An index, located somewhere else, would record the offset and size of the artifacts in the volume. I wonder if someone already implemented this idea with success? And if not... does anyone see a reason why it would be a bad idea? Cheers [0] https://docs.softwareheritage.org/ -- Loïc Dachary, Artisan Logiciel Libre

Attachments:

OpenPGP_signature.sig (application/pgp-signature — 840 bytes)

Show replies by date

Alex Gorbachev

1 Feb 1 Feb

3:27 a.m.

Dear Loïc , I do not have direct experience with this many files, but it resonates for me with deduplication, such as borg (https://www.borgbackup.org/) or a similar implementation in the latest Proxmox Backup Server ( https://pbs.proxmox.com/wiki/index.php/Main_Page). I think you would need a filesystem for either, so not sure how well this would integrate directly with RBD, but maybe cephfs is an option? I typically run zfs on top of rbd, and use only zfs compression, and then put borg on top of zfs. There is overhead, but this is a very flexible setup, operationally. All the best in your endeavor! -- Alex Gorbachev ISS/Storcium On Sat, Jan 30, 2021 at 10:01 AM Loïc Dachary <loic(a)dachary.org> wrote:

...

Loïc Dachary

8:43 a.m.

Hi Alex, Using borg would indeed make sense to copy the replicate the rbd content in case rbd-mirror is not an option, nice idea :-) Interestingly there is no need for a proper file system: the files are immutable and never deleted. They are indexed by the SHA256 of their content and a map where the key is the SHA256 and the value is the offset,size in the rbd image would be enough. Cheers On 01/02/2021 03:27, Alex Gorbachev wrote:

...

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

-- Loïc Dachary, Artisan Logiciel Libre

Alex Gorbachev

8:18 p.m.

Hi Loïc, Does not borg need a file system to write its files to? We do replicate the chunks incrementally with rsync, and that is a very nice and, importantly, idempotent way, to sync up data to a second site. -- Alex Gorbachev ISS/Storcium On Mon, Feb 1, 2021 at 2:43 AM Loïc Dachary <loic(a)dachary.org> wrote:

...

Dear Loïc , I do not have direct experience with this many files, but it resonates

for

me with deduplication, such as borg (https://www.borgbackup.org/) or a similar implementation in the latest Proxmox Backup Server ( https://pbs.proxmox.com/wiki/index.php/Main_Page). I think you would

need

a filesystem for either, so not sure how well this would integrate

directly

with RBD, but maybe cephfs is an option? I typically run zfs on top of rbd, and use only zfs compression, and then put borg on top of zfs.

There

is overhead, but this is a very flexible setup, operationally. All the best in your endeavor! -- Alex Gorbachev ISS/Storcium On Sat, Jan 30, 2021 at 10:01 AM Loïc Dachary <loic(a)dachary.org> wrote: > Bonjour, > > In the context Software Heritage (a noble mission to preserve all source > code)[0], artifacts have an average size of ~3KB and there are billions

> them. They never change and are never deleted. To save space it would

make

sense to write them, one after the other, in an every growing RBD volume (more than 100TB). An index, located somewhere else, would record the offset and size of the artifacts in the volume. I wonder if someone already implemented this idea with success? And if not... does anyone see a reason why it would be a bad idea? Cheers [0] https://docs.softwareheritage.org/ -- Loïc Dachary, Artisan Logiciel Libre _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

-- Loïc Dachary, Artisan Logiciel Libre

Loïc Dachary

8:49 p.m.

On 01/02/2021 20:18, Alex Gorbachev wrote:

...

Hi Loïc, Does not borg need a file system to write its files to?

That's also my understanding.

...

We do replicate the chunks incrementally with rsync, and that is a very nice and, importantly, idempotent way, to sync up data to a second site. -- Alex Gorbachev ISS/Storcium On Mon, Feb 1, 2021 at 2:43 AM Loïc Dachary <loic(a)dachary.org <mailto:loic@dachary.org>> wrote: Hi Alex, Using borg would indeed make sense to copy the replicate the rbd content in case rbd-mirror is not an option, nice idea :-) Interestingly there is no need for a proper file system: the files are immutable and never deleted. They are indexed by the SHA256 of their content and a map where the key is the SHA256 and the value is the offset,size in the rbd image would be enough. Cheers On 01/02/2021 03:27, Alex Gorbachev wrote:

Dear Loïc , I do not have direct experience with this many files, but it resonates for me with deduplication, such as borg (https://www.borgbackup.org/ <https://www.borgbackup.org/>) or a similar implementation in the latest Proxmox Backup Server ( https://pbs.proxmox.com/wiki/index.php/Main_Page <https://pbs.proxmox.com/wiki/index.php/Main_Page>). I think you would need a filesystem for either, so not sure how well this would integrate directly with RBD, but maybe cephfs is an option? I typically run zfs on top of rbd, and use only zfs compression, and then put borg on top of zfs. There is overhead, but this is a very flexible setup, operationally. All the best in your endeavor! -- Alex Gorbachev ISS/Storcium On Sat, Jan 30, 2021 at 10:01 AM Loïc Dachary <loic(a)dachary.org <mailto:loic@dachary.org>> wrote:

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

-- Loïc Dachary, Artisan Logiciel Libre

Martin Verges

8:36 a.m.

Hello, source code should be compressible, maybe just creating something like a tar.gz per repo or so? That way you would get much bigger objects that could improve speed and make it easier to store on any storage system. -- Martin Verges Managing director Mobile: +49 174 9335695 E-Mail: martin.verges(a)croit.io Chat: https://t.me/MartinVerges croit GmbH, Freseniusstr. 31h, 81247 Munich CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht Munich HRB 231263 Web: https://croit.io YouTube: https://goo.gl/PGE1Bx Am Sa., 30. Jan. 2021 um 16:01 Uhr schrieb Loïc Dachary <loic(a)dachary.org>rg>: > > Bonjour, > > In the context Software Heritage (a noble mission to preserve all source code)[0], artifacts have an average size of ~3KB and there are billions of them. They never change and are never deleted. To save space it would make sense to write them, one after the other, in an every growing RBD volume (more than 100TB). An index, located somewhere else, would record the offset and size of the artifacts in the volume. > > I wonder if someone already implemented this idea with success? And if not... does anyone see a reason why it would be a bad idea? > > Cheers > > [0] https://docs.softwareheritage.org/ > > -- > Loïc Dachary, Artisan Logiciel Libre > > > > > > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Loïc Dachary

8:52 a.m.

Hi Martin, On 01/02/2021 08:36, Martin Verges wrote:

...

I should have been more specific about what "artifacts" are in the context of Software Heritage, sorry about that. You can read more in the architecture document if you're interested[0]. From my point of view, the problem is to store the artifacts and handle them as opaque blobs of data and take advantage of their properties (immutable, never deleted) to keep it simple and save space. That being said, it is often a good idea to rethink the data structure itself to find a better and more efficient strategy and keeping a tarbal of the repo could be a good choice. But that would be an entirely different project. Cheers [0] https://docs.softwareheritage.org/devel/architecture.html

...

-- Martin Verges Managing director Mobile: +49 174 9335695 E-Mail: martin.verges(a)croit.io Chat: https://t.me/MartinVerges croit GmbH, Freseniusstr. 31h, 81247 Munich CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht Munich HRB 231263 Web: https://croit.io YouTube: https://goo.gl/PGE1Bx Am Sa., 30. Jan. 2021 um 16:01 Uhr schrieb Loïc Dachary <loic(a)dachary.org>rg>: > Bonjour, > > In the context Software Heritage (a noble mission to preserve all source code)[0], artifacts have an average size of ~3KB and there are billions of them. They never change and are never deleted. To save space it would make sense to write them, one after the other, in an every growing RBD volume (more than 100TB). An index, located somewhere else, would record the offset and size of the artifacts in the volume. > > I wonder if someone already implemented this idea with success? And if not... does anyone see a reason why it would be a bad idea? > > Cheers > > [0] https://docs.softwareheritage.org/ > > -- > Loïc Dachary, Artisan Logiciel Libre > > > > > > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

-- Loïc Dachary, Artisan Logiciel Libre

Dan van der Ster

9:13 p.m.

Hi Loïc, We've never managed 100TB+ in a single RBD volume. I can't think of anything, but perhaps there are some unknown limitations when they get so big. It should be easy enough to use rbd bench to create and fill a massive test image to validate everything works well at that size. Also, I assume you'll be doing the IO from just one client? Multiple readers/writers to a single volume could get complicated. Otherwise, yes RBD sounds very convenient for what you need. Cheers, Dan On Sat, Jan 30, 2021, 4:01 PM Loïc Dachary <loic(a)dachary.org> wrote:

...

Loïc Dachary

9:51 p.m.

Hi Dan, On 01/02/2021 21:13, Dan van der Ster wrote:

...

Good idea! I'll look for a cluster with 100TB of free space and post my findings.

...

Also, I assume you'll be doing the IO from just one client? Multiple readers/writers to a single volume could get complicated.

Yes.

...

Otherwise, yes RBD sounds very convenient for what you need.

It is inspired by https://static.usenix.org/event/osdi10/tech/full_papers/Beaver.pdf which suggests an ad-hoc implementation to pack immutable objects together. But I think RBD already provides the underlying logic, even though it is not specialized for this use case. RGW also packs small objects together and would be a good candidate. But it provides more flexibility to modify/delete objects and I assume it will be slower to write N objects with RGW than to write them sequentially on an RBD image. But I did not try and maybe I should. To be continued.

...

Cheers, Dan On Sat, Jan 30, 2021, 4:01 PM Loïc Dachary <loic(a)dachary.org> wrote:

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

-- Loïc Dachary, Artisan Logiciel Libre

Gregory Farnum

2 Feb 2 Feb

8:34 p.m.

Packing's obviously a good idea for storing these kinds of artifacts in Ceph, and hacking through the existing librbd might indeed be easier than building something up from raw RADOS, especially if you want to use stuff like rbd-mirror. My main concern would just be as Dan points out, that we don't test rbd with extremely large images and we know deleting that image will take a looooong time — I don't know of other issues off the top of my head, and in the worst case you could always fall back to manipulating it with raw librados if there is an issue. But you might also check in on the status of Danny Al-Gaaf's rados email project. Email and these artifacts seemingly have a lot in common. -Greg On Mon, Feb 1, 2021 at 12:52 PM Loïc Dachary <loic(a)dachary.org> wrote:

...

Hi Dan, On 01/02/2021 21:13, Dan van der Ster wrote:

Good idea! I'll look for a cluster with 100TB of free space and post my findings.

Also, I assume you'll be doing the IO from just one client? Multiple readers/writers to a single volume could get complicated.

Yes.

Otherwise, yes RBD sounds very convenient for what you need.

Cheers, Dan On Sat, Jan 30, 2021, 4:01 PM Loïc Dachary <loic(a)dachary.org> wrote:

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

-- Loïc Dachary, Artisan Logiciel Libre _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Anthony D'Atri

9:10 p.m.

I’d be nervous about a plan to utilize a single volume, growing indefinitely. I would think that from a blast radius perspective that you’d want to strike a balance between a single monolithic blockchain-style volume vs a zillion tiny files. Perhaps a strategy to shard into, say, 10 TB volumes. That size is large enough to hold lots of immutable code yet not so unweildy that it becomes infeasible to manage.

...

Hi Dan, On 01/02/2021 21:13, Dan van der Ster wrote:

Good idea! I'll look for a cluster with 100TB of free space and post my findings.

Also, I assume you'll be doing the IO from just one client? Multiple readers/writers to a single volume could get complicated.

Yes.

Otherwise, yes RBD sounds very convenient for what you need.

Cheers, Dan On Sat, Jan 30, 2021, 4:01 PM Loïc Dachary <loic(a)dachary.org> wrote:

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

-- Loïc Dachary, Artisan Logiciel Libre _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Loïc Dachary

9:32 p.m.

Hi Greg, On 02/02/2021 20:34, Gregory Farnum wrote:

...

Right. Dan's comment gave me pause: it does not seem to be a good idea to assume a RBD image of an infinite size. A friend who read this thread suggested a sensible approach (which also is in line with the Haystack paper): instead of making a single gigantic image, make multiple 1TB images. The index is bigger SHA256 sum of the artifact => name/uuid of the 1TB image,offset,size instead of SHA256 sum of the artifact => offset,size But each image still provides packing for over 100 millions artifacts when the average artifact size is around 3KB. It also allows: * multiple writers (one for each image), * rbd-mirroring individual 1TB images to a different Ceph cluster (challenging with a single 100TB+ image), * copying a 1TB image from a pool with a given erasure code profile to another pool with a different profile, * growing from 1TB to 2TB in the future by merging two 1TB images, * etc.

...

But you might also check in on the status of Danny Al-Gaaf's rados email project. Email and these artifacts seemingly have a lot in common.

They do. This is inspiring: https://github.com/ceph-dovecot/dovecot-ceph-plugin https://github.com/ceph-dovecot/dovecot-ceph-plugin/tree/master/src/librmb Thanks for the pointer. Cheers

...

-Greg On Mon, Feb 1, 2021 at 12:52 PM Loïc Dachary <loic(a)dachary.org> wrote: > Hi Dan, > > On 01/02/2021 21:13, Dan van der Ster wrote: >> Hi Loïc, >> >> We've never managed 100TB+ in a single RBD volume. I can't think of >> anything, but perhaps there are some unknown limitations when they get so >> big. >> It should be easy enough to use rbd bench to create and fill a massive test >> image to validate everything works well at that size. > Good idea! I'll look for a cluster with 100TB of free space and post my findings. >> Also, I assume you'll be doing the IO from just one client? Multiple >> readers/writers to a single volume could get complicated. > Yes. >> Otherwise, yes RBD sounds very convenient for what you need. > It is inspired by https://static.usenix.org/event/osdi10/tech/full_papers/Beaver.pdf which suggests an ad-hoc implementation to pack immutable objects together. But I think RBD already provides the underlying logic, even though it is not specialized for this use case. RGW also packs small objects together and would be a good candidate. But it provides more flexibility to modify/delete objects and I assume it will be slower to write N objects with RGW than to write them sequentially on an RBD image. But I did not try and maybe I should. > > To be continued. >> Cheers, Dan >> >> >> On Sat, Jan 30, 2021, 4:01 PM Loïc Dachary <loic(a)dachary.org> wrote: >> >>> Bonjour, >>> >>> In the context Software Heritage (a noble mission to preserve all source >>> code)[0], artifacts have an average size of ~3KB and there are billions of >>> them. They never change and are never deleted. To save space it would make >>> sense to write them, one after the other, in an every growing RBD volume >>> (more than 100TB). An index, located somewhere else, would record the >>> offset and size of the artifacts in the volume. >>> >>> I wonder if someone already implemented this idea with success? And if >>> not... does anyone see a reason why it would be a bad idea? >>> >>> Cheers >>> >>> [0] https://docs.softwareheritage.org/ >>> >>> -- >>> Loïc Dachary, Artisan Logiciel Libre >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > -- > Loïc Dachary, Artisan Logiciel Libre > > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

-- Loïc Dachary, Artisan Logiciel Libre

Burkhard Linke

3 Feb 3 Feb

9:07 a.m.

Hi, On 2/2/21 9:32 PM, Loïc Dachary wrote:

...

Hi Greg, On 02/02/2021 20:34, Gregory Farnum wrote:

*snipsnap*

...

Just my 2 cents: You could use the first byte of the SHA sum to identify the image, e.g. using a fixed number of 256 images. Or some flexible approach similar to the way filestore used to store rados objects. Regards, Burkhard

Loïc Dachary

9:41 a.m.

...

A friend suggested the same to save space. Good idea.

Burkhard Linke

9:54 a.m.

Hi, On 2/3/21 9:41 AM, Loïc Dachary wrote:

...

A friend suggested the same to save space. Good idea.

If you want to further reduce the index size, you can just store the offset, and the first 4? 8? bytes at that offset define the size of the following artifacts. That's similar to the way Pascal used to store strings in the good ol' times. You might also want to think about using a complete header which also includes the artifact's name etc. This will allow you to rebuild the index if it becomes corrupted. The storage overhead should be insignificant Your index will become a simple mapping of SHA sum -> offset, and you might also be able to use optimized implementations. Regards, Burkhard

Matt Wilder

10:02 p.m.

...

Hi, On 2/3/21 9:41 AM, Loïc Dachary wrote:

> Just my 2 cents: > > You could use the first byte of the SHA sum to identify the image, e.g.

using a fixed number of 256 images. Or some flexible approach similar to the way filestore used to store rados objects.

A friend suggested the same to save space. Good idea.

-- This e-mail and all information in, attached to, or linked via this e-mail (together the ‘e-mail’) is confidential and may be legally privileged. It is intended solely for the intended addressee(s). Access to, or any onward transmission, of this e-mail by any other person is not authorised. If you are not the intended recipient, you are requested to immediately alert the sender of this e-mail and to immediately delete this e-mail. Any disclosure in any form of all or part of this e-mail, or of any the parties to it, including any copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. This e-mail is not, and is not intended to be, and should not be construed as being, (a) any offer, solicitation, or promotion of any kind; (b) the basis of any investment or other decision(s); (c) any recommendation to buy, sell or transact in any manner any good(s), product(s) or service(s), nor engage in any investment(s) or other transaction(s) or activities; or (d) the provision of, or related to, any advisory service(s) or activities, including regarding any investment, tax, legal, financial, accounting, consulting or any other related service(s).

Loïc Dachary

11:23 p.m.

Hi Matt, I did not know about pixz, thanks for the pointer. The idea it implements is also new to me and it looks like it can usefully be applied to this use case. I'm not going to say "awesome" because I can't grasp how useful it really is right now. But I'll definitely think about it :-) Cheers On 03/02/2021 22:02, Matt Wilder wrote:

...

If it were me, I would do something along the lines of: - Bundle larger blocks of code into pixz <https://github.com/vasi/pixz> (essentially indexed tar files, allowing random access) and store them in RadosGW. - Build a small frontend that fetches (with caching) them and provides the file contents via whatever your UI is. On Wed, Feb 3, 2021 at 12:55 AM Burkhard Linke < Burkhard.Linke(a)computational.bio.uni-giessen.de> wrote: > Hi, > > On 2/3/21 9:41 AM, Loïc Dachary wrote: >>> Just my 2 cents: >>> >>> You could use the first byte of the SHA sum to identify the image, e.g. > using a fixed number of 256 images. Or some flexible approach similar to > the way filestore used to store rados objects. >> A friend suggested the same to save space. Good idea. > > If you want to further reduce the index size, you can just store the > offset, and the first 4? 8? bytes at that offset define the size of the > following artifacts. That's similar to the way Pascal used to store > strings in the good ol' times. You might also want to think about using > a complete header which also includes the artifact's name etc. This will > allow you to rebuild the index if it becomes corrupted. The storage > overhead should be insignificant > > Your index will become a simple mapping of SHA sum -> offset, and you > might also be able to use optimized implementations. > > > Regards, > > Burkhard > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io >

-- Loïc Dachary, Artisan Logiciel Libre

1172

days inactive

1176

days old

ceph-users@ceph.io

Manage subscription

16 comments

8 participants

tags (0)

participants (8)

Alex Gorbachev
Anthony D'Atri
Burkhard Linke
Dan van der Ster
Gregory Farnum
Loïc Dachary
Martin Verges
Matt Wilder