Hello Loic!
We have developed a strategy in pacific - reducing the min_alloc_size
for HDD to 4KB by default.
Igor Fedotov did a lot of investigation and benchmarking, and came up
with some improvements to bluestore [1][2] to make this change have
little performance impact (it even increases performance in many cases).
Josh
[0]
Bonjour,
Reading Karan's blog post about benchmarking the insertion of billions objects to
Ceph via S3 / RGW[0] from last year, it reads:
we decided to lower bluestore_min_alloc_size_hdd
to 18KB and re-test. As represented in chart-5, the object creation rate found to be
notably reduced after lowering the bluestore_min_alloc_size_hdd parameter from 64KB
(default) to 18KB. As such, for objects larger than the bluestore_min_alloc_size_hdd , the
default values seems to be optimal, smaller objects further require more investigation if
you intended to reduce bluestore_min_alloc_size_hdd parameter.
There also is a mail thread dated 2018 on this topic as well, with the same conclusion
although using RADOS directly and not RGW[3]. I read the RGW data layout page in the
documentation[1] and concluded that by default every object inserted with S3 / RGW will
indeed use at least 64kb. A pull request from last year[2] seems to confirm it and also
suggests modifying bluestore_min_alloc_size_hdd has adverse side effects.
That being said, I'm curious to know if people developed strategies to cope with this
overhead. Someone mentioned packing objects together client side to make them larger. But
maybe there are simpler ways to do the same?
Cheers
[0]
https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond
[1]
https://docs.ceph.com/en/latest/radosgw/layout/
[2]
https://github.com/ceph/ceph/pull/32809
[3]
https://www.spinics.net/lists/ceph-users/msg45755.html