On 04/02/2021 12:08, Lionel Bouton wrote:
Hi,
Le 04/02/2021 à 08:41, Loïc Dachary a écrit :
Hi Frederico,
On 04/02/2021 05:51, Federico Lucifredi wrote:
Hi Loïc,
I am intrigued, but am missing something: why not using RGW, and store the source code
files as objects? RGW has native compression and can take care of that behind the scenes.
Excellent question!
Is the desire to use RBD only due to minimum
allocation sizes?
I *assume* that since RGW does have
If I understand
correctly I assume that you are missing a "not" here.
Yes :-)
specific strategies to take advantage of the
fact that objects are immutable and will never be removed:
* It will be slower to add artifacts in RGW than in an RBD image + index
* The metadata in RGW will be larger than an RBD image + index
However I have not verified this and if you have an opinion I'd love to hear it :-)
Reading the exchanges I believe you are focused on the reading speed and
space efficiency. Did you consider the writing speed with such a scheme ?
Yes and
the goal is to achieve 100MB/s write speed.
Depending on how you store the index, you could block on each write and
would have to consider Ceph latency (ie: if your writer fails recovering
can be tricky without having waited for writes to update your index).
With your 100TB target and 3kb artifact size a 1ms latency and blocking
writes translate to a whole year spent writing. If you manage to get to
a 0.1ms latency (not sure if this is achievable with Ceph yet) you end
with a month. Depending on how you plan to populate the store this could
be a problem. You'll have to consider if the artifact writing rate limit
can become a bottleneck during normal use too.
You can probably design a scheme supporting storing multiple values in a
single write but it seems to add complexity which might come with
unwanted performance problems and space use itself.
I'm not familiar with space efficiency on modern Ceph versions (still
using filestore on Hammer...), do you have a ballpark estimation of the
costs of storing artifacts as simple objects ? Unless you already worked
out the whole design that would be my first concern : it could end up
being an inefficiency worth the trade-off for simplicity.
I did not measure the
overhead and I'm assuming it is significant
enough to justify RGW implemented packing.
I'm unfamiliar with the gateway and how well and easily it can scale so
my first impulse was to bypass RGW to use the librados interface
directly.
Using librados directly would work but the caller would have to
implement
packing in the same way RBD or RGW does. It is a lot of work to do that
properly.
You can definitely begin with a RGW solution as it is
a bit
easier to implement and switch to librados later if RGW ever becomes a
bottleneck. If you need speed either writing or reading, both RGW and
librados would work : you can have as many clients managing objects in
parallel without any lock on writes on your end to manage. This is a
very simple storage design and simplicity can't be overrated :-)
The only potential downside (in addition to space inefficiency) that I
can see would be walking the list of objects. This is doable but with
billions of them this could be very slow. Not sure if it could become a
need given your use case though.
I'll research more and try to figure out a way
to compare write/read speed in both
cases.
For reference, I just found the results of a test with
a moderately
comparable test set :
https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond.
I didn't finish reading it yet but the volume seems comparable to your
use case although with 64kB objects.
That's a significant difference but the
benchmarks results are still
very useful.
Note : I've seen questions about 100TB RBDs in the
thread. We use such
beasts in two clusters : they work fine but are a pain when deleting or
downsizing them. During one downsize on the slowest cluster we had to
pause the operation manually (SIGSTOP to the rbd process) during periods
of high loads and let it continue after. This took about a week (but the
cluster was admittedly underpowered for its use at the time).
Intersting ! In this
use case having a single RBD image does not
seem to be a good idea. Ceph is designed to scale out. But RBD images
are not designed to grow indefinitely. Having multiple 1TB images sounds like
a sane tradeoff.
Best regards,
Thanks for taking the time to think about this use case :-)
Cheers
--
Lionel Bouton
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
--
Loïc Dachary, Artisan Logiciel Libre