ceph df (octopus) shows USED is 7 times higher than STORED in erasure coded pool

List overview All Threads
Download

newer

older

Issue with cephadm not finding...

Pool size

Arkadiy Kulev

30 Jun 2021 30 Jun '21

12:07 a.m.

The pool *default.rgw.buckets.data* has *501 GiB* stored, but USED shows *3.5 TiB *(7 times higher!)*:* root@ceph-01:~# ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 196 TiB 193 TiB 3.5 TiB 3.6 TiB 1.85 TOTAL 196 TiB 193 TiB 3.5 TiB 3.6 TiB 1.85 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 19 KiB 12 56 KiB 0 61 TiB .rgw.root 2 32 2.6 KiB 6 1.1 MiB 0 61 TiB default.rgw.log 3 32 168 KiB 210 13 MiB 0 61 TiB default.rgw.control 4 32 0 B 8 0 B 0 61 TiB default.rgw.meta 5 8 4.8 KiB 11 1.9 MiB 0 61 TiB default.rgw.buckets.index 6 8 1.6 GiB 211 4.7 GiB 0 61 TiB default.rgw.buckets.data 10 128 501 GiB 5.36M 3.5 TiB 1.90 110 TiB The *default.rgw.buckets.data* pool is using erasure coding: root@ceph-01:~# ceph osd erasure-code-profile get EC_RGW_HOST crush-device-class=hdd crush-failure-domain=host crush-root=default jerasure-per-chunk-alignment=false k=6 m=4 plugin=jerasure technique=reed_sol_van w=8 If anyone could help explain why it's using up 7 times more space, it would help a lot. Versioning is disabled. ceph version 15.2.13 (octopus stable). Sincerely, Ark.

Show replies by thread

Josh Baergen

30 Jun 30 Jun

1:15 a.m.

Hey Arkadiy, If the OSDs are on HDDs and were created with the default bluestore_min_alloc_size_hdd, which is still 64KiB in Octopus, then in effect data will be allocated from the pool in 640KiB chunks (64KiB * (k+m)). 5.36M objects taking up 501GiB is an average object size of 98KiB which results in a ratio of 6.53:1 allocated:stored, which is pretty close to the 7:1 observed. If my assumption about your configuration is correct, then the only way to fix this is to adjust bluestore_min_alloc_size_hdd and recreate all your OSDs, which will take a while... Josh On Tue, Jun 29, 2021 at 3:07 PM Arkadiy Kulev <eth(a)ethaniel.com> wrote:

...

Arkadiy Kulev

1:46 a.m.

Dear Josh, Thank you! I will be upgrading to Pacific and lowering bluestore_min_alloc_size_hdd down to 4K. Will report back with the results.

Wladimir Mutel

4 Jul 4 Jul

3:52 p.m.

Dear Mr Baergen, thanks a lot for your very concise explanation, however I would like to learn more why default Bluestore alloc.size causes such a big storage overhead, and where in the Ceph docs it is explained how and what to watch for to avoid hitting this phenomenon again and again. I have a feeling this is what I get on my experimental Ceph setup with simplest JErasure 2+1 data pool. Could it be caused by journaled RBD writes to EC data-pool ? Josh Baergen wrote:

...

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Josh Baergen

5 Jul 5 Jul

6:28 p.m.

...

to the 7:1 observed. If my assumption about your configuration is correct, then the only way

fix this is to adjust bluestore_min_alloc_size_hdd and recreate all your OSDs, which will take a while... Josh On Tue, Jun 29, 2021 at 3:07 PM Arkadiy Kulev <eth(a)ethaniel.com> wrote: > The pool *default.rgw.buckets.data* has *501 GiB* stored, but USED shows > *3.5 > TiB *(7 times higher!)*:* > > root@ceph-01:~# ceph df > --- RAW STORAGE --- > CLASS SIZE AVAIL USED RAW USED %RAW USED > hdd 196 TiB 193 TiB 3.5 TiB 3.6 TiB 1.85 > TOTAL 196 TiB 193 TiB 3.5 TiB 3.6 TiB 1.85 > > --- POOLS --- > POOL ID PGS STORED OBJECTS USED %USED

MAX

> AVAIL > device_health_metrics 1 1 19 KiB 12 56 KiB 0 > 61 TiB > .rgw.root 2 32 2.6 KiB 6 1.1 MiB 0 > 61 TiB > default.rgw.log 3 32 168 KiB 210 13 MiB 0 > 61 TiB > default.rgw.control 4 32 0 B 8 0 B 0 > 61 TiB > default.rgw.meta 5 8 4.8 KiB 11 1.9 MiB 0 > 61 TiB > default.rgw.buckets.index 6 8 1.6 GiB 211 4.7 GiB 0 > 61 TiB > > default.rgw.buckets.data 10 128 501 GiB 5.36M 3.5 TiB 1.90 > 110 TiB > > The *default.rgw.buckets.data* pool is using erasure coding: > > root@ceph-01:~# ceph osd erasure-code-profile get EC_RGW_HOST > crush-device-class=hdd > crush-failure-domain=host > crush-root=default > jerasure-per-chunk-alignment=false > k=6 > m=4 > plugin=jerasure > technique=reed_sol_van > w=8 > > If anyone could help explain why it's using up 7 times more space, it

would

> help a lot. Versioning is disabled. ceph version 15.2.13 (octopus

stable).

Sincerely, Ark. _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Wladimir Mutel

6 Jul 6 Jul

2:51 p.m.

I started my experimental 1-host/8-HDDs setup in 2018 with Luminous, and I read https://ceph.io/community/new-luminous-erasure-coding-rbd-cephfs/ , which had interested me in using Bluestore and rewriteable EC pools for RBD data. I have about 22 TiB or raw storage, and ceph df shows this : --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 22 TiB 2.7 TiB 19 TiB 19 TiB 87.78 TOTAL 22 TiB 2.7 TiB 19 TiB 19 TiB 87.78 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL jerasure21 1 256 9.0 TiB 2.32M 13 TiB 97.06 276 GiB libvirt 2 128 1.5 TiB 413.60k 4.5 TiB 91.77 140 GiB rbd 3 32 798 KiB 5 2.7 MiB 0 138 GiB iso 4 32 2.3 MiB 10 8.0 MiB 0 138 GiB device_health_metrics 5 1 31 MiB 9 94 MiB 0.02 138 GiB If I add USED for libvirt and jerasure21 , I get 17.5 TiB, and 2.7 TiB is shown at RAW STORAGE/AVAIL Sum of POOLS/MAX AVAIL is about 840 GiB, where are my other 2.7-0.840 =~ 1.86 TiB ??? Or in different words, where are my (RAW STORAGE/RAW USED)-(SUM(POOLS/USED)) = 19-17.5 = 1.5 TiB ? As it does not seem I would get any more hosts for this setup, I am seriously thinking of bringing down this Ceph and setting up instead a Btrfs storing qcow2 images served over iSCSI which looks simpler to me for single-host situation. Josh Baergen wrote:

...

Hey Wladimir, I actually don't know where this is referenced in the docs, if anywhere. Googling around shows many people discovering this overhead the hard way on ceph-users. I also don't know the rbd journaling mechanism in enough depth to comment on whether it could be causing this issue for you. Are you seeing a high allocated:stored ratio on your cluster? Josh On Sun, Jul 4, 2021 at 6:52 AM Wladimir Mutel <mwg(a)mwg.dp.ua <mailto:mwg@mwg.dp.ua>> wrote: Dear Mr Baergen, thanks a lot for your very concise explanation, however I would like to learn more why default Bluestore alloc.size causes such a big storage overhead, and where in the Ceph docs it is explained how and what to watch for to avoid hitting this phenomenon again and again. I have a feeling this is what I get on my experimental Ceph setup with simplest JErasure 2+1 data pool. Could it be caused by journaled RBD writes to EC data-pool ? Josh Baergen wrote:

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

Josh Baergen

4:41 p.m.

...

Hey Wladimir, I actually don't know where this is referenced in the docs, if anywhere.

Googling around shows many people discovering this overhead the hard way on ceph-users.

I also don't know the rbd journaling mechanism in enough depth to

comment on whether it could be causing this issue for you. Are you seeing a high

allocated:stored ratio on your cluster? Josh On Sun, Jul 4, 2021 at 6:52 AM Wladimir Mutel <mwg(a)mwg.dp.ua <mailto:

mwg(a)mwg.dp.ua>> wrote:

Dear Mr Baergen, thanks a lot for your very concise explanation, however I would like to learn more why default Bluestore alloc.size

causes such a big storage overhead,

and where in the Ceph docs it is explained how and what to watch for

to avoid hitting this phenomenon again and again.

I have a feeling this is what I get on my experimental Ceph setup

with simplest JErasure 2+1 data pool.

Could it be caused by journaled RBD writes to EC data-pool ? Josh Baergen wrote: > Hey Arkadiy, > > If the OSDs are on HDDs and were created with the default > bluestore_min_alloc_size_hdd, which is still 64KiB in Octopus,

then in

> effect data will be allocated from the pool in 640KiB chunks

(64KiB *

> (k+m)). 5.36M objects taking up 501GiB is an average object size

of 98KiB

> which results in a ratio of 6.53:1 allocated:stored, which is

pretty close

> to the 7:1 observed. > > If my assumption about your configuration is correct, then the

only way to

> fix this is to adjust bluestore_min_alloc_size_hdd and recreate

all your

> OSDs, which will take a while... > > Josh > > On Tue, Jun 29, 2021 at 3:07 PM Arkadiy Kulev <eth(a)ethaniel.com

<mailto:eth@ethaniel.com>> wrote:

> >> The pool *default.rgw.buckets.data* has *501 GiB* stored, but

USED shows

>> *3.5 >> TiB *(7 times higher!)*:* >> >> root@ceph-01:~# ceph df >> --- RAW STORAGE --- >> CLASS SIZE AVAIL USED RAW USED %RAW USED >> hdd 196 TiB 193 TiB 3.5 TiB 3.6 TiB 1.85 >> TOTAL 196 TiB 193 TiB 3.5 TiB 3.6 TiB 1.85 >> >> --- POOLS --- >> POOL ID PGS STORED OBJECTS USED

%USED MAX

>> AVAIL >> device_health_metrics 1 1 19 KiB 12 56 KiB

>> 61 TiB >> .rgw.root 2 32 2.6 KiB 6 1.1 MiB

>> 61 TiB >> default.rgw.log 3 32 168 KiB 210 13 MiB

>> 61 TiB >> default.rgw.control 4 32 0 B 8 0 B

>> 61 TiB >> default.rgw.meta 5 8 4.8 KiB 11 1.9 MiB

>> 61 TiB >> default.rgw.buckets.index 6 8 1.6 GiB 211 4.7 GiB

>> 61 TiB >> >> default.rgw.buckets.data 10 128 501 GiB 5.36M 3.5 TiB

1.90

>> 110 TiB >> >> The *default.rgw.buckets.data* pool is using erasure coding: >> >> root@ceph-01:~# ceph osd erasure-code-profile get EC_RGW_HOST >> crush-device-class=hdd >> crush-failure-domain=host >> crush-root=default >> jerasure-per-chunk-alignment=false >> k=6 >> m=4 >> plugin=jerasure >> technique=reed_sol_van >> w=8 >> >> If anyone could help explain why it's using up 7 times more

space, it would

>> help a lot. Versioning is disabled. ceph version 15.2.13

(octopus stable).

>> >> Sincerely, >> Ark. >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io <mailto:

ceph-users(a)ceph.io>

>> To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io>

>> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io <mailto:

ceph-users(a)ceph.io>

> To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:

ceph-users-leave(a)ceph.io>

Josh Baergen

4:49 p.m.

Oh, I just read your message again, and I see that I didn't answer your question. :D I admit I don't know how MAX AVAIL is calculated, and whether it takes things like imbalance into account (it might). Josh On Tue, Jul 6, 2021 at 7:41 AM Josh Baergen <jbaergen(a)digitalocean.com> wrote:

...

Hey Wladimir, That output looks like it's from Nautilus or later. My understanding is that the USED column is in raw bytes, whereas STORED is "user" bytes. If you're using EC 2:1 for all of those pools, I would expect USED to be at least 1.5x STORED, which looks to be the case for jerasure21. Perhaps your libvirt pool is 3x replicated, in which case the numbers add up as well. Josh On Tue, Jul 6, 2021 at 5:51 AM Wladimir Mutel <mwg(a)mwg.dp.ua> wrote: > I started my experimental 1-host/8-HDDs setup in 2018 with > Luminous, > and I read > https://ceph.io/community/new-luminous-erasure-coding-rbd-cephfs/ , > which had interested me in using Bluestore and rewriteable EC > pools for RBD data. > I have about 22 TiB or raw storage, and ceph df shows this : > > --- RAW STORAGE --- > CLASS SIZE AVAIL USED RAW USED %RAW USED > hdd 22 TiB 2.7 TiB 19 TiB 19 TiB 87.78 > TOTAL 22 TiB 2.7 TiB 19 TiB 19 TiB 87.78 > > --- POOLS --- > POOL ID PGS STORED OBJECTS USED %USED MAX > AVAIL > jerasure21 1 256 9.0 TiB 2.32M 13 TiB 97.06 276 > GiB > libvirt 2 128 1.5 TiB 413.60k 4.5 TiB 91.77 140 > GiB > rbd 3 32 798 KiB 5 2.7 MiB 0 138 > GiB > iso 4 32 2.3 MiB 10 8.0 MiB 0 138 > GiB > device_health_metrics 5 1 31 MiB 9 94 MiB 0.02 138 > GiB > > If I add USED for libvirt and jerasure21 , I get 17.5 TiB, and > 2.7 TiB is shown at RAW STORAGE/AVAIL > Sum of POOLS/MAX AVAIL is about 840 GiB, where are my other > 2.7-0.840 =~ 1.86 TiB ??? > Or in different words, where are my (RAW STORAGE/RAW > USED)-(SUM(POOLS/USED)) = 19-17.5 = 1.5 TiB ? > > As it does not seem I would get any more hosts for this setup, > I am seriously thinking of bringing down this Ceph > and setting up instead a Btrfs storing qcow2 images served over > iSCSI > which looks simpler to me for single-host situation. > > Josh Baergen wrote: > > Hey Wladimir, > > > > I actually don't know where this is referenced in the docs, if > anywhere. Googling around shows many people discovering this overhead the > hard way on ceph-users. > > > > I also don't know the rbd journaling mechanism in enough depth to > comment on whether it could be causing this issue for you. Are you seeing a > high > > allocated:stored ratio on your cluster? > > > > Josh > > > > On Sun, Jul 4, 2021 at 6:52 AM Wladimir Mutel <mwg(a)mwg.dp.ua <mailto: > mwg(a)mwg.dp.ua>> wrote: > > > > Dear Mr Baergen, > > > > thanks a lot for your very concise explanation, > > however I would like to learn more why default Bluestore alloc.size > causes such a big storage overhead, > > and where in the Ceph docs it is explained how and what to watch > for to avoid hitting this phenomenon again and again. > > I have a feeling this is what I get on my experimental Ceph setup > with simplest JErasure 2+1 data pool. > > Could it be caused by journaled RBD writes to EC data-pool ? > > > > Josh Baergen wrote: > > > Hey Arkadiy, > > > > > > If the OSDs are on HDDs and were created with the default > > > bluestore_min_alloc_size_hdd, which is still 64KiB in Octopus, > then in > > > effect data will be allocated from the pool in 640KiB chunks > (64KiB * > > > (k+m)). 5.36M objects taking up 501GiB is an average object size > of 98KiB > > > which results in a ratio of 6.53:1 allocated:stored, which is > pretty close > > > to the 7:1 observed. > > > > > > If my assumption about your configuration is correct, then the > only way to > > > fix this is to adjust bluestore_min_alloc_size_hdd and recreate > all your > > > OSDs, which will take a while... > > > > > > Josh > > > > > > On Tue, Jun 29, 2021 at 3:07 PM Arkadiy Kulev <eth(a)ethaniel.com > <mailto:eth@ethaniel.com>> wrote: > > > > > >> The pool *default.rgw.buckets.data* has *501 GiB* stored, but > USED shows > > >> *3.5 > > >> TiB *(7 times higher!)*:* > > >> > > >> root@ceph-01:~# ceph df > > >> --- RAW STORAGE --- > > >> CLASS SIZE AVAIL USED RAW USED %RAW USED > > >> hdd 196 TiB 193 TiB 3.5 TiB 3.6 TiB 1.85 > > >> TOTAL 196 TiB 193 TiB 3.5 TiB 3.6 TiB 1.85 > > >> > > >> --- POOLS --- > > >> POOL ID PGS STORED OBJECTS USED > %USED MAX > > >> AVAIL > > >> device_health_metrics 1 1 19 KiB 12 56 KiB > 0 > > >> 61 TiB > > >> .rgw.root 2 32 2.6 KiB 6 1.1 MiB > 0 > > >> 61 TiB > > >> default.rgw.log 3 32 168 KiB 210 13 MiB > 0 > > >> 61 TiB > > >> default.rgw.control 4 32 0 B 8 0 B > 0 > > >> 61 TiB > > >> default.rgw.meta 5 8 4.8 KiB 11 1.9 MiB > 0 > > >> 61 TiB > > >> default.rgw.buckets.index 6 8 1.6 GiB 211 4.7 GiB > 0 > > >> 61 TiB > > >> > > >> default.rgw.buckets.data 10 128 501 GiB 5.36M 3.5 TiB > 1.90 > > >> 110 TiB > > >> > > >> The *default.rgw.buckets.data* pool is using erasure coding: > > >> > > >> root@ceph-01:~# ceph osd erasure-code-profile get EC_RGW_HOST > > >> crush-device-class=hdd > > >> crush-failure-domain=host > > >> crush-root=default > > >> jerasure-per-chunk-alignment=false > > >> k=6 > > >> m=4 > > >> plugin=jerasure > > >> technique=reed_sol_van > > >> w=8 > > >> > > >> If anyone could help explain why it's using up 7 times more > space, it would > > >> help a lot. Versioning is disabled. ceph version 15.2.13 > (octopus stable). > > >> > > >> Sincerely, > > >> Ark. > > >> _______________________________________________ > > >> ceph-users mailing list -- ceph-users(a)ceph.io <mailto: > ceph-users(a)ceph.io> > > >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > <mailto:ceph-users-leave@ceph.io> > > >> > > > _______________________________________________ > > > ceph-users mailing list -- ceph-users(a)ceph.io <mailto: > ceph-users(a)ceph.io> > > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > <mailto:ceph-users-leave@ceph.io> > > > > > > > >

Anthony D'Atri

6:05 p.m.

...

It does. It’s calculated relative to the most-full OSD in the pool, and the full_ratio is also applied to account for the space it effectively reserves. Significant imbalance in OSD fullness will result in this figure being smaller than one expects from the amount of raw capacity unused. This is subtle and often misunderstood.

...

>> Sum of POOLS/MAX AVAIL is about 840 GiB

Those numbers can’t be meaningfully summed, since most or all of those pools are sharing the same OSDs.

Christian Wuerdig

7 Jul 7 Jul

1:52 a.m.

Ceph on a single host makes little to no sense. You're better of running something like ZFS On Tue, 6 Jul 2021 at 23:52, Wladimir Mutel <mwg(a)mwg.dp.ua> wrote:

...

Hey Wladimir, I actually don't know where this is referenced in the docs, if anywhere.

Googling around shows many people discovering this overhead the hard way on ceph-users.

I also don't know the rbd journaling mechanism in enough depth to

comment on whether it could be causing this issue for you. Are you seeing a high

allocated:stored ratio on your cluster? Josh On Sun, Jul 4, 2021 at 6:52 AM Wladimir Mutel <mwg(a)mwg.dp.ua <mailto:

mwg(a)mwg.dp.ua>> wrote:

Dear Mr Baergen, thanks a lot for your very concise explanation, however I would like to learn more why default Bluestore alloc.size

causes such a big storage overhead,

and where in the Ceph docs it is explained how and what to watch for

to avoid hitting this phenomenon again and again.

I have a feeling this is what I get on my experimental Ceph setup

with simplest JErasure 2+1 data pool.

then in

> effect data will be allocated from the pool in 640KiB chunks

(64KiB *

> (k+m)). 5.36M objects taking up 501GiB is an average object size

of 98KiB

> which results in a ratio of 6.53:1 allocated:stored, which is

pretty close

> to the 7:1 observed. > > If my assumption about your configuration is correct, then the

only way to

> fix this is to adjust bluestore_min_alloc_size_hdd and recreate

all your

> OSDs, which will take a while... > > Josh > > On Tue, Jun 29, 2021 at 3:07 PM Arkadiy Kulev <eth(a)ethaniel.com

<mailto:eth@ethaniel.com>> wrote:

> >> The pool *default.rgw.buckets.data* has *501 GiB* stored, but

USED shows

%USED MAX

>> AVAIL >> device_health_metrics 1 1 19 KiB 12 56 KiB

>> 61 TiB >> .rgw.root 2 32 2.6 KiB 6 1.1 MiB

>> 61 TiB >> default.rgw.log 3 32 168 KiB 210 13 MiB

>> 61 TiB >> default.rgw.control 4 32 0 B 8 0 B

>> 61 TiB >> default.rgw.meta 5 8 4.8 KiB 11 1.9 MiB

>> 61 TiB >> default.rgw.buckets.index 6 8 1.6 GiB 211 4.7 GiB

>> 61 TiB >> >> default.rgw.buckets.data 10 128 501 GiB 5.36M 3.5 TiB

1.90

space, it would

>> help a lot. Versioning is disabled. ceph version 15.2.13

(octopus stable).

>> >> Sincerely, >> Ark. >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io <mailto:

ceph-users(a)ceph.io>

>> To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io>

>> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io <mailto:

ceph-users(a)ceph.io>

> To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:

ceph-users-leave(a)ceph.io>

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

1030

days inactive

1037

days old

ceph-users@ceph.io

Manage subscription

9 comments

5 participants

tags (0)

participants (5)

Anthony D'Atri
Arkadiy Kulev
Christian Wuerdig
Josh Baergen
Wladimir Mutel