Small RGW objects and RADOS 64KB minimun size

List overview All Threads
Download

newer

older

How to specify to only build...

BlueStore fragmentation woes

Loïc Dachary

14 Feb 2021 14 Feb '21

10:51 p.m.

Bonjour, Reading Karan's blog post about benchmarking the insertion of billions objects to Ceph via S3 / RGW[0] from last year, it reads:

...

we decided to lower bluestore_min_alloc_size_hdd to 18KB and re-test. As represented in chart-5, the object creation rate found to be notably reduced after lowering the bluestore_min_alloc_size_hdd parameter from 64KB (default) to 18KB. As such, for objects larger than the bluestore_min_alloc_size_hdd , the default values seems to be optimal, smaller objects further require more investigation if you intended to reduce bluestore_min_alloc_size_hdd parameter.

There also is a mail thread dated 2018 on this topic as well, with the same conclusion although using RADOS directly and not RGW[3]. I read the RGW data layout page in the documentation[1] and concluded that by default every object inserted with S3 / RGW will indeed use at least 64kb. A pull request from last year[2] seems to confirm it and also suggests modifying bluestore_min_alloc_size_hdd has adverse side effects. That being said, I'm curious to know if people developed strategies to cope with this overhead. Someone mentioned packing objects together client side to make them larger. But maybe there are simpler ways to do the same? Cheers [0] https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond [1] https://docs.ceph.com/en/latest/radosgw/layout/ [2] https://github.com/ceph/ceph/pull/32809 [3] https://www.spinics.net/lists/ceph-users/msg45755.html -- Loïc Dachary, Artisan Logiciel Libre

Attachments:

OpenPGP_signature.sig (application/pgp-signature — 840 bytes)

Show replies by date

Josh Durgin

16 Feb 16 Feb

5:13 a.m.

Hello Loic! We have developed a strategy in pacific - reducing the min_alloc_size for HDD to 4KB by default. Igor Fedotov did a lot of investigation and benchmarking, and came up with some improvements to bluestore [1][2] to make this change have little performance impact (it even increases performance in many cases). Josh [0] https://github.com/ceph/ceph/pull/34588 [1] https://github.com/ceph/ceph/pull/33434 [2] https://github.com/ceph/ceph/pull/33365 On 2/14/21 9:21 AM, Loïc Dachary wrote:

...

Bonjour, Reading Karan's blog post about benchmarking the insertion of billions objects to Ceph via S3 / RGW[0] from last year, it reads:

Loïc Dachary

2:10 p.m.

Hi Josh :-) Thanks for the update: this is great news and I look forward to using this once Pacific is released. Cheers On 16/02/2021 00:43, Josh Durgin wrote:

...

Bonjour, Reading Karan's blog post about benchmarking the insertion of billions objects to Ceph via S3 / RGW[0] from last year, it reads:

-- Loïc Dachary, Artisan Logiciel Libre

Steven Pine

11:28 p.m.

Will there be a well documented strategy / method for changing block sizes on existing clusters? Is there anything that could be done to optimize or assist clusters in the cut over? On Tue, Feb 16, 2021 at 3:41 AM Loïc Dachary <loic(a)dachary.org> wrote:

...

Hi Josh :-) Thanks for the update: this is great news and I look forward to using this once Pacific is released. Cheers On 16/02/2021 00:43, Josh Durgin wrote:

Hello Loic! We have developed a strategy in pacific - reducing the min_alloc_size

for HDD to 4KB by default.

Igor Fedotov did a lot of investigation and benchmarking, and came up with some improvements to bluestore [1][2] to make this change have little performance impact (it even increases performance in many cases). Josh [0] https://github.com/ceph/ceph/pull/34588 [1] https://github.com/ceph/ceph/pull/33434 [2] https://github.com/ceph/ceph/pull/33365 On 2/14/21 9:21 AM, Loïc Dachary wrote: > Bonjour, > > Reading Karan's blog post about benchmarking the insertion of billions

objects to Ceph via S3 / RGW[0] from last year, it reads:

> >> we decided to lower bluestore_min_alloc_size_hdd to 18KB and re-test.

As represented in chart-5, the object creation rate found to be notably reduced after lowering the bluestore_min_alloc_size_hdd parameter from 64KB (default) to 18KB. As such, for objects larger than the bluestore_min_alloc_size_hdd , the default values seems to be optimal, smaller objects further require more investigation if you intended to reduce bluestore_min_alloc_size_hdd parameter.

> > There also is a mail thread dated 2018 on this topic as well, with the

same conclusion although using RADOS directly and not RGW[3]. I read the RGW data layout page in the documentation[1] and concluded that by default every object inserted with S3 / RGW will indeed use at least 64kb. A pull request from last year[2] seems to confirm it and also suggests modifying bluestore_min_alloc_size_hdd has adverse side effects.

> > That being said, I'm curious to know if people developed strategies to

cope with this overhead. Someone mentioned packing objects together client side to make them larger. But maybe there are simpler ways to do the same?

> > Cheers > > [0]

https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond

[1] https://docs.ceph.com/en/latest/radosgw/layout/ [2] https://github.com/ceph/ceph/pull/32809 [3] https://www.spinics.net/lists/ceph-users/msg45755.html

-- Loïc Dachary, Artisan Logiciel Libre _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

-- Steven Pine *E * steven.pine(a)webair.com | *P * 516.938.4100 x *Webair* | 501 Franklin Avenue Suite 200, Garden City NY, 11530 webair.com [image: Facebook icon] <https://www.facebook.com/WebairInc/> [image: Twitter icon] <https://twitter.com/WebairInc> [image: Linkedin icon] <https://www.linkedin.com/company/webair> NOTICE: This electronic mail message and all attachments transmitted with it are intended solely for the use of the addressee and may contain legally privileged proprietary and confidential information. If the reader of this message is not the intended recipient, or if you are an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution, copying, or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender immediately by replying to this message and delete it from your computer.

Josh Durgin

11:44 p.m.

Changing min_alloc_size in bluestore requires redeploying the OSD. There's no other way to regain the space that's already allocated. In terms of making this easier, we're looking to automate rolling format changes across a cluster with cephadm in the future. Josh On 2/16/21 9:58 AM, Steven Pine wrote:

...

Hi Josh :-) Thanks for the update: this is great news and I look forward to using this once Pacific is released. Cheers On 16/02/2021 00:43, Josh Durgin wrote:

Hello Loic! We have developed a strategy in pacific - reducing the min_alloc_size

for HDD to 4KB by default.

objects to Ceph via S3 / RGW[0] from last year, it reads:

> >> we decided to lower bluestore_min_alloc_size_hdd to 18KB and re-test.

> > There also is a mail thread dated 2018 on this topic as well, with the

> > That being said, I'm curious to know if people developed strategies to

cope with this overhead. Someone mentioned packing objects together client side to make them larger. But maybe there are simpler ways to do the same?

> > Cheers > > [0]

https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond

[1] https://docs.ceph.com/en/latest/radosgw/layout/ [2] https://github.com/ceph/ceph/pull/32809 [3] https://www.spinics.net/lists/ceph-users/msg45755.html

-- Loïc Dachary, Artisan Logiciel Libre _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Steven Pine

11:45 p.m.

Yes please, assisting clusters in moving over to a 4k block size would be greatly appreciated. On Tue, Feb 16, 2021 at 1:14 PM Josh Durgin <jdurgin(a)redhat.com> wrote:

...

Will there be a well documented strategy / method for changing block

sizes

on existing clusters? Is there anything that could be done to optimize or assist clusters in the cut over? On Tue, Feb 16, 2021 at 3:41 AM Loïc Dachary <loic(a)dachary.org> wrote: > Hi Josh :-) > > Thanks for the update: this is great news and I look forward to using

this

> once Pacific is released. > > Cheers > > On 16/02/2021 00:43, Josh Durgin wrote: >> Hello Loic! >> >> We have developed a strategy in pacific - reducing the min_alloc_size > for HDD to 4KB by default. >> >> Igor Fedotov did a lot of investigation and benchmarking, and came up >> with some improvements to bluestore [1][2] to make this change have >> little performance impact (it even increases performance in many

cases).

>> >> Josh >> >> [0] https://github.com/ceph/ceph/pull/34588 >> [1] https://github.com/ceph/ceph/pull/33434 >> [2] https://github.com/ceph/ceph/pull/33365 >> >> On 2/14/21 9:21 AM, Loïc Dachary wrote: >>> Bonjour, >>> >>> Reading Karan's blog post about benchmarking the insertion of billions > objects to Ceph via S3 / RGW[0] from last year, it reads: >>> >>>> we decided to lower bluestore_min_alloc_size_hdd to 18KB and re-test. > As represented in chart-5, the object creation rate found to be notably > reduced after lowering the bluestore_min_alloc_size_hdd parameter from

64KB

> (default) to 18KB. As such, for objects larger than the > bluestore_min_alloc_size_hdd , the default values seems to be optimal, > smaller objects further require more investigation if you intended to > reduce bluestore_min_alloc_size_hdd parameter. >>> >>> There also is a mail thread dated 2018 on this topic as well, with the > same conclusion although using RADOS directly and not RGW[3]. I read the > RGW data layout page in the documentation[1] and concluded that by

default

> every object inserted with S3 / RGW will indeed use at least 64kb. A

pull

> request from last year[2] seems to confirm it and also suggests

modifying

> bluestore_min_alloc_size_hdd has adverse side effects. >>> >>> That being said, I'm curious to know if people developed strategies to > cope with this overhead. Someone mentioned packing objects together

client

> side to make them larger. But maybe there are simpler ways to do the

same?

> > Cheers > > [0]

https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond

> [1] https://docs.ceph.com/en/latest/radosgw/layout/ > [2] https://github.com/ceph/ceph/pull/32809 > [3] https://www.spinics.net/lists/ceph-users/msg45755.html

-- Loïc Dachary, Artisan Logiciel Libre _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

David Oganezov

1 Jun 1 Jun

1:37 a.m.

Hey Josh! Sorry for necroing this thread, but my team is currently running a Pacific cluster that was updated from Nautilus, and we are rebuilding hosts one by one to reclaim the space in the OSDs. We might have missed it, but was the automated rolling format with cephadm eventually implemented? Thanks!

331

days inactive

1167

days old

ceph-users@ceph.io

Manage subscription

6 comments

4 participants

tags (0)

participants (4)

David Oganezov
Josh Durgin
Loïc Dachary
Steven Pine