Re: Using RBD to pack billions of small files

List overview All Threads
Download

newer

older

log_meta log_data was turned off...

NFS version 4.0

Loïc Dachary

4 Feb 2021 4 Feb '21

11:11 a.m.

Hi Frederico, On 04/02/2021 05:51, Federico Lucifredi wrote:

...

Hi Loïc, I am intrigued, but am missing something: why not using RGW, and store the source code files as objects? RGW has native compression and can take care of that behind the scenes.

Excellent question!

...

Is the desire to use RBD only due to minimum allocation sizes?

I *assume* that since RGW does have specific strategies to take advantage of the fact that objects are immutable and will never be removed: * It will be slower to add artifacts in RGW than in an RBD image + index * The metadata in RGW will be larger than an RBD image + index However I have not verified this and if you have an opinion I'd love to hear it :-) Cheers

...

Best -F -- "'Problem' is a bleak word for challenge" - Richard Fish _________________________________________ Federico Lucifredi Product Management Director, Ceph Storage Platform Red Hat A273 4F57 58C0 7FE8 838D 4F87 AEEB EC18 4A73 88AC redhat.com <http://redhat.com> TRIED. TESTED. TRUSTED. On Sat, Jan 30, 2021 at 10:01 AM Loïc Dachary <loic(a)dachary.org <mailto:loic@dachary.org>> wrote: Bonjour, In the context Software Heritage (a noble mission to preserve all source code)[0], artifacts have an average size of ~3KB and there are billions of them. They never change and are never deleted. To save space it would make sense to write them, one after the other, in an every growing RBD volume (more than 100TB). An index, located somewhere else, would record the offset and size of the artifacts in the volume. I wonder if someone already implemented this idea with success? And if not... does anyone see a reason why it would be a bad idea? Cheers [0] https://docs.softwareheritage.org/ <https://docs.softwareheritage.org/> -- Loïc Dachary, Artisan Logiciel Libre _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

-- Loïc Dachary, Artisan Logiciel Libre

Attachments:

OpenPGP_signature.sig (application/pgp-signature — 840 bytes)

Show replies by date

Lionel Bouton

4 Feb 4 Feb

2:38 p.m.

New subject: Using RBD to pack billions of small files

Hi, Le 04/02/2021 à 08:41, Loïc Dachary a écrit :

...

Hi Frederico, On 04/02/2021 05:51, Federico Lucifredi wrote:

Hi Loïc, I am intrigued, but am missing something: why not using RGW, and store the source code files as objects? RGW has native compression and can take care of that behind the scenes.

Excellent question!

Is the desire to use RBD only due to minimum allocation sizes?

I *assume* that since RGW does have

If I understand correctly I assume that you are missing a "not" here.

...

specific strategies to take advantage of the fact that objects are immutable and will never be removed: * It will be slower to add artifacts in RGW than in an RBD image + index * The metadata in RGW will be larger than an RBD image + index However I have not verified this and if you have an opinion I'd love to hear it :-)

Reading the exchanges I believe you are focused on the reading speed and space efficiency. Did you consider the writing speed with such a scheme ? Depending on how you store the index, you could block on each write and would have to consider Ceph latency (ie: if your writer fails recovering can be tricky without having waited for writes to update your index). With your 100TB target and 3kb artifact size a 1ms latency and blocking writes translate to a whole year spent writing. If you manage to get to a 0.1ms latency (not sure if this is achievable with Ceph yet) you end with a month. Depending on how you plan to populate the store this could be a problem. You'll have to consider if the artifact writing rate limit can become a bottleneck during normal use too. You can probably design a scheme supporting storing multiple values in a single write but it seems to add complexity which might come with unwanted performance problems and space use itself. I'm not familiar with space efficiency on modern Ceph versions (still using filestore on Hammer...), do you have a ballpark estimation of the costs of storing artifacts as simple objects ? Unless you already worked out the whole design that would be my first concern : it could end up being an inefficiency worth the trade-off for simplicity. I'm unfamiliar with the gateway and how well and easily it can scale so my first impulse was to bypass RGW to use the librados interface directly. You can definitely begin with a RGW solution as it is a bit easier to implement and switch to librados later if RGW ever becomes a bottleneck. If you need speed either writing or reading, both RGW and librados would work : you can have as many clients managing objects in parallel without any lock on writes on your end to manage. This is a very simple storage design and simplicity can't be overrated :-) The only potential downside (in addition to space inefficiency) that I can see would be walking the list of objects. This is doable but with billions of them this could be very slow. Not sure if it could become a need given your use case though. For reference, I just found the results of a test with a moderately comparable test set : https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond. I didn't finish reading it yet but the volume seems comparable to your use case although with 64kB objects. Note : I've seen questions about 100TB RBDs in the thread. We use such beasts in two clusters : they work fine but are a pain when deleting or downsizing them. During one downsize on the slowest cluster we had to pause the operation manually (SIGSTOP to the rbd process) during periods of high loads and let it continue after. This took about a week (but the cluster was admittedly underpowered for its use at the time). Best regards, -- Lionel Bouton

Loïc Dachary

4:06 p.m.

New subject: Using RBD to pack billions of small files

On 04/02/2021 12:08, Lionel Bouton wrote:

...

Hi, Le 04/02/2021 à 08:41, Loïc Dachary a écrit :

Hi Frederico, On 04/02/2021 05:51, Federico Lucifredi wrote:

Hi Loïc, I am intrigued, but am missing something: why not using RGW, and store the source code files as objects? RGW has native compression and can take care of that behind the scenes.

Excellent question!

Is the desire to use RBD only due to minimum allocation sizes?

I *assume* that since RGW does have

If I understand correctly I assume that you are missing a "not" here.

Yes :-)

...

Reading the exchanges I believe you are focused on the reading speed and space efficiency. Did you consider the writing speed with such a scheme ?

Yes and the goal is to achieve 100MB/s write speed.

...

Depending on how you store the index, you could block on each write and would have to consider Ceph latency (ie: if your writer fails recovering can be tricky without having waited for writes to update your index). With your 100TB target and 3kb artifact size a 1ms latency and blocking writes translate to a whole year spent writing. If you manage to get to a 0.1ms latency (not sure if this is achievable with Ceph yet) you end with a month. Depending on how you plan to populate the store this could be a problem. You'll have to consider if the artifact writing rate limit can become a bottleneck during normal use too.

...

You can probably design a scheme supporting storing multiple values in a single write but it seems to add complexity which might come with unwanted performance problems and space use itself. I'm not familiar with space efficiency on modern Ceph versions (still using filestore on Hammer...), do you have a ballpark estimation of the costs of storing artifacts as simple objects ? Unless you already worked out the whole design that would be my first concern : it could end up being an inefficiency worth the trade-off for simplicity.

I did not measure the overhead and I'm assuming it is significant enough to justify RGW implemented packing.

...

I'm unfamiliar with the gateway and how well and easily it can scale so my first impulse was to bypass RGW to use the librados interface directly.

Using librados directly would work but the caller would have to implement packing in the same way RBD or RGW does. It is a lot of work to do that properly.

...

You can definitely begin with a RGW solution as it is a bit easier to implement and switch to librados later if RGW ever becomes a bottleneck. If you need speed either writing or reading, both RGW and librados would work : you can have as many clients managing objects in parallel without any lock on writes on your end to manage. This is a very simple storage design and simplicity can't be overrated :-) The only potential downside (in addition to space inefficiency) that I can see would be walking the list of objects. This is doable but with billions of them this could be very slow. Not sure if it could become a need given your use case though.

I'll research more and try to figure out a way to compare write/read speed in both cases.

...

For reference, I just found the results of a test with a moderately comparable test set : https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond. I didn't finish reading it yet but the volume seems comparable to your use case although with 64kB objects.

That's a significant difference but the benchmarks results are still very useful.

...

Note : I've seen questions about 100TB RBDs in the thread. We use such beasts in two clusters : they work fine but are a pain when deleting or downsizing them. During one downsize on the slowest cluster we had to pause the operation manually (SIGSTOP to the rbd process) during periods of high loads and let it continue after. This took about a week (but the cluster was admittedly underpowered for its use at the time).

Intersting ! In this use case having a single RBD image does not seem to be a good idea. Ceph is designed to scale out. But RBD images are not designed to grow indefinitely. Having multiple 1TB images sounds like a sane tradeoff.

...

Best regards,

Thanks for taking the time to think about this use case :-) Cheers

...

-- Lionel Bouton _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

-- Loïc Dachary, Artisan Logiciel Libre

Matthew Vernon

6:46 p.m.

New subject: Using RBD to pack billions of small files

Hi, On 04/02/2021 07:41, Loïc Dachary wrote:

...

On 04/02/2021 05:51, Federico Lucifredi wrote:

Hi Loïc, I am intrigued, but am missing something: why not using RGW, and store the source code files as objects? RGW has native compression and can take care of that behind the scenes.

Excellent question!

Is the desire to use RBD only due to minimum allocation sizes?

RGW addition is pretty quick up to fairly large buckets; and if you're not expecting to want to list the bucket contents often, then RGW might well be a good option for your object store with small files. Or at least, using some of the RGW code (I think there's a librgw) to re-use a bunch of its code for your use case; this feels more natural to me than using RBD for this. Regards, Matthew [pleased software heritage are still looking at Ceph :) ] -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

1178

days inactive

1178

days old

ceph-users@ceph.io

Manage subscription

3 comments

3 participants

tags (0)

participants (3)

Lionel Bouton
Loïc Dachary
Matthew Vernon