BlueFS spillover detected, why, what?

List overview All Threads
Download

newer

older

OSDs get full with bluestore logs

How do I configure a Roadrunner...

Simon Oosthoek

20 Aug 2020 20 Aug '20

11:38 a.m.

Hi Recently our ceph cluster (nautilus) is experiencing bluefs spillovers, just 2 osd's and I disabled the warning for these osds. (ceph config set osd.125 bluestore_warn_on_bluefs_spillover false) I'm wondering what causes this and how this can be prevented. As I understand it the rocksdb for the OSD needs to store more than fits on the NVME logical volume (123G for 12T OSD). A way to fix it could be to increase the logical volume on the nvme (if there was space on the nvme, which there isn't at the moment). This is the current size of the cluster and how much is free: [root@cephmon1 ~]# ceph df RAW STORAGE: CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 1.8 PiB 842 TiB 974 TiB 974 TiB 53.63 TOTAL 1.8 PiB 842 TiB 974 TiB 974 TiB 53.63 POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL cephfs_data 1 572 MiB 121.26M 2.4 GiB 0 167 TiB cephfs_metadata 2 56 GiB 5.15M 57 GiB 0 167 TiB cephfs_data_3copy 8 201 GiB 51.68k 602 GiB 0.09 222 TiB cephfs_data_ec83 13 643 TiB 279.75M 953 TiB 58.86 485 TiB rbd 14 21 GiB 5.66k 64 GiB 0 222 TiB .rgw.root 15 1.2 KiB 4 1 MiB 0 167 TiB default.rgw.control 16 0 B 8 0 B 0 167 TiB default.rgw.meta 17 765 B 4 1 MiB 0 167 TiB default.rgw.log 18 0 B 207 0 B 0 167 TiB cephfs_data_ec57 20 433 MiB 230 1.2 GiB 0 278 TiB The amount used can still grow a bit before we need to add nodes, but apparently we are running into the limits of our rocskdb partitions. Did we choose a parameter (e.g. minimal object size) too small, so we have too much objects on these spillover OSDs? Or is it that too many small files are stored on the cephfs filesystems? When we expand the cluster, we can choose larger nvme devices to allow larger rocksdb partitions, but is that the right way to deal with this, or should we adjust some parameters on the cluster that will reduce the rocksdb size? Cheers /Simon

Show replies by date

Michael Bisig

20 Aug 20 Aug

12:46 p.m.

Hi Simon As far as I know, RocksDB only uses "leveled" space on the NVME partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every DB space above such a limit will automatically end up on slow devices. In your setup where you have 123GB per OSD that means you only use 30GB of fast device. The DB which spills over this limit will be offloaded to the HDD and accordingly, it slows down requests and compactions. You can proof what your OSD currently consumes with: ceph daemon osd.X perf dump Informative values are `db_total_bytes`, `db_used_bytes` and `slow_used_bytes`. This changes regularly because of the ongoing compactions but Prometheus mgr module exports these values such that you can track it. Small files generally leads to bigger RocksDB, especially when you use EC, but this depends on the actual amount and file sizes. I hope this helps. Regards, Michael On 20.08.20, 09:10, "Simon Oosthoek" <s.oosthoek(a)science.ru.nl> wrote: Hi Recently our ceph cluster (nautilus) is experiencing bluefs spillovers, just 2 osd's and I disabled the warning for these osds. (ceph config set osd.125 bluestore_warn_on_bluefs_spillover false) I'm wondering what causes this and how this can be prevented. As I understand it the rocksdb for the OSD needs to store more than fits on the NVME logical volume (123G for 12T OSD). A way to fix it could be to increase the logical volume on the nvme (if there was space on the nvme, which there isn't at the moment). This is the current size of the cluster and how much is free: [root@cephmon1 ~]# ceph df RAW STORAGE: CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 1.8 PiB 842 TiB 974 TiB 974 TiB 53.63 TOTAL 1.8 PiB 842 TiB 974 TiB 974 TiB 53.63 POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL cephfs_data 1 572 MiB 121.26M 2.4 GiB 0 167 TiB cephfs_metadata 2 56 GiB 5.15M 57 GiB 0 167 TiB cephfs_data_3copy 8 201 GiB 51.68k 602 GiB 0.09 222 TiB cephfs_data_ec83 13 643 TiB 279.75M 953 TiB 58.86 485 TiB rbd 14 21 GiB 5.66k 64 GiB 0 222 TiB .rgw.root 15 1.2 KiB 4 1 MiB 0 167 TiB default.rgw.control 16 0 B 8 0 B 0 167 TiB default.rgw.meta 17 765 B 4 1 MiB 0 167 TiB default.rgw.log 18 0 B 207 0 B 0 167 TiB cephfs_data_ec57 20 433 MiB 230 1.2 GiB 0 278 TiB The amount used can still grow a bit before we need to add nodes, but apparently we are running into the limits of our rocskdb partitions. Did we choose a parameter (e.g. minimal object size) too small, so we have too much objects on these spillover OSDs? Or is it that too many small files are stored on the cephfs filesystems? When we expand the cluster, we can choose larger nvme devices to allow larger rocksdb partitions, but is that the right way to deal with this, or should we adjust some parameters on the cluster that will reduce the rocksdb size? Cheers /Simon _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Simon Oosthoek

1:09 p.m.

...

Michael Bisig

1:36 p.m.

Hi Simon Unfortunately, the other NVME space is wasted or at least, this is the information we gathered during our research. This fact is due to the RocksDB level management which is explained here (https://github.com/facebook/rocksdb/wiki/Leveled-Compaction). I don't think it's a hard limit but it will be something above these values. Also consult this thread (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-February/033286.ht…). It's probably better to go a bit over these limits to be on the safe side. Exactly, reality is always different. We also struggle with small files which lead to further problems. Accordingly, the right initial setting is pretty important and depends on your individual usecase. Regards, Michael On 20.08.20, 10:40, "Simon Oosthoek" <s.oosthoek(a)science.ru.nl> wrote: Hi Michael, thanks for the explanation! So if I understand correctly, we waste 93 GB per OSD on unused NVME space, because only 30GB is actually used...? And to improve the space for rocksdb, we need to plan for 300GB per rocksdb partition in order to benefit from this advantage.... Reducing the number of small files is something we always ask of our users, but reality is what it is ;-) I'll have to look into how I can get an informative view on these metrics... It's pretty overwhelming the amount of information coming out of the ceph cluster, even when you look only superficially... Cheers, /Simon On 20/08/2020 10:16, Michael Bisig wrote:

...

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Simon Oosthoek

1:42 p.m.

Hi Michael, thanks for the pointers! This is our first production ceph cluster and we have to learn as we go... Small files is always a problem for all (networked) filesystems, usually it just trashes performance, but in this case it has another unfortunate side effect with the rocksdb :-( Cheers /Simon On 20/08/2020 11:06, Michael Bisig wrote:

...

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Igor Fedotov

1:57 p.m.

Hi Simon, starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space at DB volume. see this PR: https://github.com/ceph/ceph/pull/29687 Nice overview on the overall BlueFS/RocksDB design can be find here: https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b… Which also includes some overview (as well as additional concerns) for changes brought by the above-mentioned PR. Thanks, Igor On 8/20/2020 11:39 AM, Simon Oosthoek wrote: > Hi Michael, > > thanks for the explanation! So if I understand correctly, we waste 93 > GB per OSD on unused NVME space, because only 30GB is actually used...? > > And to improve the space for rocksdb, we need to plan for 300GB per > rocksdb partition in order to benefit from this advantage.... > > Reducing the number of small files is something we always ask of our > users, but reality is what it is ;-) > > I'll have to look into how I can get an informative view on these > metrics... It's pretty overwhelming the amount of information coming > out of the ceph cluster, even when you look only superficially... > > Cheers, > > /Simon > > On 20/08/2020 10:16, Michael Bisig wrote: >> Hi Simon >> >> As far as I know, RocksDB only uses "leveled" space on the NVME >> partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every >> DB space above such a limit will automatically end up on slow devices. >> In your setup where you have 123GB per OSD that means you only use >> 30GB of fast device. The DB which spills over this limit will be >> offloaded to the HDD and accordingly, it slows down requests and >> compactions. >> >> You can proof what your OSD currently consumes with: >> ceph daemon osd.X perf dump >> >> Informative values are `db_total_bytes`, `db_used_bytes` and >> `slow_used_bytes`. This changes regularly because of the ongoing >> compactions but Prometheus mgr module exports these values such that >> you can track it. >> >> Small files generally leads to bigger RocksDB, especially when you >> use EC, but this depends on the actual amount and file sizes. >> >> I hope this helps. >> Regards, >> Michael >> >> On 20.08.20, 09:10, "Simon Oosthoek" <s.oosthoek(a)science.ru.nl> wrote: >> >> Hi >> >> Recently our ceph cluster (nautilus) is experiencing bluefs >> spillovers, >> just 2 osd's and I disabled the warning for these osds. >> (ceph config set osd.125 bluestore_warn_on_bluefs_spillover false) >> >> I'm wondering what causes this and how this can be prevented. >> >> As I understand it the rocksdb for the OSD needs to store more >> than fits >> on the NVME logical volume (123G for 12T OSD). A way to fix it >> could be >> to increase the logical volume on the nvme (if there was space >> on the >> nvme, which there isn't at the moment). >> >> This is the current size of the cluster and how much is free: >> >> [root@cephmon1 ~]# ceph df >> RAW STORAGE: >> CLASS SIZE AVAIL USED RAW USED >> %RAW USED >> hdd 1.8 PiB 842 TiB 974 TiB 974 >> TiB 53.63 >> TOTAL 1.8 PiB 842 TiB 974 TiB 974 >> TiB 53.63 >> >> POOLS: >> POOL ID STORED OBJECTS USED >> %USED MAX AVAIL >> cephfs_data 1 572 MiB 121.26M 2.4 GiB >> 0 167 TiB >> cephfs_metadata 2 56 GiB 5.15M 57 GiB >> 0 167 TiB >> cephfs_data_3copy 8 201 GiB 51.68k 602 GiB >> 0.09 222 TiB >> cephfs_data_ec83 13 643 TiB 279.75M 953 TiB >> 58.86 485 TiB >> rbd 14 21 GiB 5.66k 64 GiB >> 0 222 TiB >> .rgw.root 15 1.2 KiB 4 1 MiB >> 0 167 TiB >> default.rgw.control 16 0 B 8 0 B >> 0 167 TiB >> default.rgw.meta 17 765 B 4 1 MiB >> 0 167 TiB >> default.rgw.log 18 0 B 207 0 B >> 0 167 TiB >> cephfs_data_ec57 20 433 MiB 230 1.2 GiB >> 0 278 TiB >> >> The amount used can still grow a bit before we need to add >> nodes, but >> apparently we are running into the limits of our rocskdb >> partitions. >> >> Did we choose a parameter (e.g. minimal object size) too small, >> so we >> have too much objects on these spillover OSDs? Or is it that too >> many >> small files are stored on the cephfs filesystems? >> >> When we expand the cluster, we can choose larger nvme devices to >> allow >> larger rocksdb partitions, but is that the right way to deal >> with this, >> or should we adjust some parameters on the cluster that will >> reduce the >> rocksdb size? >> >> Cheers >> >> /Simon >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Seena Fallah

3:14 p.m.

...

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Igor Fedotov

5:24 p.m.

...

Hi Igor. Could you please tell why this config is in LEVEL_DEV (https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe66… As it is documented in Ceph we can't use LEVEL_DEV in production environments! Thanks On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov <ifedotov(a)suse.de <mailto:ifedotov@suse.de>> wrote: Hi Simon, starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space at DB volume. see this PR: https://github.com/ceph/ceph/pull/29687 Nice overview on the overall BlueFS/RocksDB design can be find here: https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b… Which also includes some overview (as well as additional concerns) for changes brought by the above-mentioned PR. Thanks, Igor On 8/20/2020 11:39 AM, Simon Oosthoek wrote:

Hi Michael, thanks for the explanation! So if I understand correctly, we

waste 93

GB per OSD on unused NVME space, because only 30GB is actually

used...?

And to improve the space for rocksdb, we need to plan for 300GB per rocksdb partition in order to benefit from this advantage.... Reducing the number of small files is something we always ask of

our

users, but reality is what it is ;-) I'll have to look into how I can get an informative view on these metrics... It's pretty overwhelming the amount of information

coming

out of the ceph cluster, even when you look only superficially... Cheers, /Simon On 20/08/2020 10:16, Michael Bisig wrote: > Hi Simon > > As far as I know, RocksDB only uses "leveled" space on the NVME > partition. The values are set to be 300MB, 3GB, 30GB and 300GB.

Every

> DB space above such a limit will automatically end up on slow

devices.

> In your setup where you have 123GB per OSD that means you only use > 30GB of fast device. The DB which spills over this limit will be > offloaded to the HDD and accordingly, it slows down requests and > compactions. > > You can proof what your OSD currently consumes with: > ceph daemon osd.X perf dump > > Informative values are `db_total_bytes`, `db_used_bytes` and > `slow_used_bytes`. This changes regularly because of the ongoing > compactions but Prometheus mgr module exports these values such

that

> you can track it. > > Small files generally leads to bigger RocksDB, especially when you > use EC, but this depends on the actual amount and file sizes. > > I hope this helps. > Regards, > Michael > > On 20.08.20, 09:10, "Simon Oosthoek" <s.oosthoek(a)science.ru.nl

<mailto:s.oosthoek@science.ru.nl>> wrote:

> > Hi > > Recently our ceph cluster (nautilus) is experiencing bluefs > spillovers, > just 2 osd's and I disabled the warning for these osds. > (ceph config set osd.125

bluestore_warn_on_bluefs_spillover false)

> > I'm wondering what causes this and how this can be prevented. > > As I understand it the rocksdb for the OSD needs to store

> than fits > on the NVME logical volume (123G for 12T OSD). A way to

fix it

> could be > to increase the logical volume on the nvme (if there was

space

> on the > nvme, which there isn't at the moment). > > This is the current size of the cluster and how much is free: > > [root@cephmon1 ~]# ceph df > RAW STORAGE: > CLASS SIZE AVAIL USED RAW USED > %RAW USED > hdd 1.8 PiB 842 TiB 974 TiB 974 > TiB 53.63 > TOTAL 1.8 PiB 842 TiB 974 TiB 974 > TiB 53.63 > > POOLS: > POOL ID STORED OBJECTS USED > %USED MAX AVAIL > cephfs_data 1 572 MiB 121.26M 2.4 GiB > 0 167 TiB > cephfs_metadata 2 56 GiB 5.15M 57 GiB > 0 167 TiB > cephfs_data_3copy 8 201 GiB 51.68k 602 GiB > 0.09 222 TiB > cephfs_data_ec83 13 643 TiB 279.75M 953 TiB > 58.86 485 TiB > rbd 14 21 GiB 5.66k 64 GiB > 0 222 TiB > .rgw.root 15 1.2 KiB 4 1 MiB > 0 167 TiB > default.rgw.control 16 0 B 8 0 B > 0 167 TiB > default.rgw.meta 17 765 B 4 1 MiB > 0 167 TiB > default.rgw.log 18 0 B 207 0 B > 0 167 TiB > cephfs_data_ec57 20 433 MiB 230

1.2 GiB

> 0 278 TiB > > The amount used can still grow a bit before we need to add > nodes, but > apparently we are running into the limits of our rocskdb > partitions. > > Did we choose a parameter (e.g. minimal object size) too

small,

> so we > have too much objects on these spillover OSDs? Or is it

that too

> many > small files are stored on the cephfs filesystems? > > When we expand the cluster, we can choose larger nvme

devices to

> allow > larger rocksdb partitions, but is that the right way to deal > with this, > or should we adjust some parameters on the cluster that will > reduce the > rocksdb size? > > Cheers > > /Simon > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

> To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io>

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io> _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

Seena Fallah

5:36 p.m.

...

Hi Seena, this parameter isn't intended to be adjusted in production environments - it's supposed that default behavior covers all regular customers' needs. The issue though is that default setting is invalid. It should be 'use_some_extra'. Gonna fix that shortly... Thanks, Igor On 8/20/2020 1:44 PM, Seena Fallah wrote: Hi Igor. Could you please tell why this config is in LEVEL_DEV ( https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe66… As it is documented in Ceph we can't use LEVEL_DEV in production environments! Thanks On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov <ifedotov(a)suse.de> wrote:

Hi Michael, thanks for the explanation! So if I understand correctly, we waste 93 GB per OSD on unused NVME space, because only 30GB is actually used...? And to improve the space for rocksdb, we need to plan for 300GB per rocksdb partition in order to benefit from this advantage.... Reducing the number of small files is something we always ask of our users, but reality is what it is ;-) I'll have to look into how I can get an informative view on these metrics... It's pretty overwhelming the amount of information coming out of the ceph cluster, even when you look only superficially... Cheers, /Simon On 20/08/2020 10:16, Michael Bisig wrote: > Hi Simon > > As far as I know, RocksDB only uses "leveled" space on the NVME > partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every > DB space above such a limit will automatically end up on slow devices. > In your setup where you have 123GB per OSD that means you only use > 30GB of fast device. The DB which spills over this limit will be > offloaded to the HDD and accordingly, it slows down requests and > compactions. > > You can proof what your OSD currently consumes with: > ceph daemon osd.X perf dump > > Informative values are `db_total_bytes`, `db_used_bytes` and > `slow_used_bytes`. This changes regularly because of the ongoing > compactions but Prometheus mgr module exports these values such that > you can track it. > > Small files generally leads to bigger RocksDB, especially when you > use EC, but this depends on the actual amount and file sizes. > > I hope this helps. > Regards, > Michael > > On 20.08.20, 09:10, "Simon Oosthoek" <s.oosthoek(a)science.ru.nl>

wrote:

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Igor Fedotov

6 p.m.

...

Greate, thanks. Is it safe to change it manually in ceph.conf until next nautilus release or should I wait for the next nautilus release for this change? I mean does qa run on this value for this config that we could trust and change it or should we wait until the next nautilus release that qa ran on this value? On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov <ifedotov(a)suse.de <mailto:ifedotov@suse.de>> wrote: Hi Seena, this parameter isn't intended to be adjusted in production environments - it's supposed that default behavior covers all regular customers' needs. The issue though is that default setting is invalid. It should be 'use_some_extra'. Gonna fix that shortly... Thanks, Igor On 8/20/2020 1:44 PM, Seena Fallah wrote: > Hi Igor. > > Could you please tell why this config is in LEVEL_DEV > (https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe66… > As it is documented in Ceph we can't use LEVEL_DEV in production > environments! > > Thanks > > On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov <ifedotov(a)suse.de > <mailto:ifedotov@suse.de>> wrote: > > Hi Simon, > > > starting Nautlus v14.2.10 Bluestore is able to use 'wasted' > space at DB > volume. > > see this PR: https://github.com/ceph/ceph/pull/29687 > > Nice overview on the overall BlueFS/RocksDB design can be > find here: > > https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b… > > Which also includes some overview (as well as additional > concerns) for > changes brought by the above-mentioned PR. > > > Thanks, > > Igor > > > On 8/20/2020 11:39 AM, Simon Oosthoek wrote: > > Hi Michael, > > > > thanks for the explanation! So if I understand correctly, > we waste 93 > > GB per OSD on unused NVME space, because only 30GB is > actually used...? > > > > And to improve the space for rocksdb, we need to plan for > 300GB per > > rocksdb partition in order to benefit from this advantage.... > > > > Reducing the number of small files is something we always > ask of our > > users, but reality is what it is ;-) > > > > I'll have to look into how I can get an informative view on > these > > metrics... It's pretty overwhelming the amount of > information coming > > out of the ceph cluster, even when you look only > superficially... > > > > Cheers, > > > > /Simon > > > > On 20/08/2020 10:16, Michael Bisig wrote: > >> Hi Simon > >> > >> As far as I know, RocksDB only uses "leveled" space on the > NVME > >> partition. The values are set to be 300MB, 3GB, 30GB and > 300GB. Every > >> DB space above such a limit will automatically end up on > slow devices. > >> In your setup where you have 123GB per OSD that means you > only use > >> 30GB of fast device. The DB which spills over this limit > will be > >> offloaded to the HDD and accordingly, it slows down > requests and > >> compactions. > >> > >> You can proof what your OSD currently consumes with: > >> ceph daemon osd.X perf dump > >> > >> Informative values are `db_total_bytes`, `db_used_bytes` and > >> `slow_used_bytes`. This changes regularly because of the > ongoing > >> compactions but Prometheus mgr module exports these values > such that > >> you can track it. > >> > >> Small files generally leads to bigger RocksDB, especially > when you > >> use EC, but this depends on the actual amount and file sizes. > >> > >> I hope this helps. > >> Regards, > >> Michael > >> > >> On 20.08.20, 09:10, "Simon Oosthoek" > <s.oosthoek(a)science.ru.nl <mailto:s.oosthoek@science.ru.nl>> > wrote: > >> > >> Hi > >> > >> Recently our ceph cluster (nautilus) is experiencing > bluefs > >> spillovers, > >> just 2 osd's and I disabled the warning for these osds. > >> (ceph config set osd.125 > bluestore_warn_on_bluefs_spillover false) > >> > >> I'm wondering what causes this and how this can be > prevented. > >> > >> As I understand it the rocksdb for the OSD needs to > store more > >> than fits > >> on the NVME logical volume (123G for 12T OSD). A way > to fix it > >> could be > >> to increase the logical volume on the nvme (if there > was space > >> on the > >> nvme, which there isn't at the moment). > >> > >> This is the current size of the cluster and how much > is free: > >> > >> [root@cephmon1 ~]# ceph df > >> RAW STORAGE: > >> CLASS SIZE AVAIL USED RAW USED > >> %RAW USED > >> hdd 1.8 PiB 842 TiB 974 TiB 974 > >> TiB 53.63 > >> TOTAL 1.8 PiB 842 TiB 974 TiB 974 > >> TiB 53.63 > >> > >> POOLS: > >> POOL ID STORED OBJECTS USED > >> %USED MAX AVAIL > >> cephfs_data 1 572 MiB > 121.26M 2.4 GiB > >> 0 167 TiB > >> cephfs_metadata 2 56 GiB > 5.15M 57 GiB > >> 0 167 TiB > >> cephfs_data_3copy 8 201 GiB > 51.68k 602 GiB > >> 0.09 222 TiB > >> cephfs_data_ec83 13 643 TiB > 279.75M 953 TiB > >> 58.86 485 TiB > >> rbd 14 21 GiB > 5.66k 64 GiB > >> 0 222 TiB > >> .rgw.root 15 1.2 KiB 4 1 MiB > >> 0 167 TiB > >> default.rgw.control 16 0 B 8 0 B > >> 0 167 TiB > >> default.rgw.meta 17 765 B 4 1 MiB > >> 0 167 TiB > >> default.rgw.log 18 0 B 207 0 B > >> 0 167 TiB > >> cephfs_data_ec57 20 433 MiB > 230 1.2 GiB > >> 0 278 TiB > >> > >> The amount used can still grow a bit before we need > to add > >> nodes, but > >> apparently we are running into the limits of our rocskdb > >> partitions. > >> > >> Did we choose a parameter (e.g. minimal object size) > too small, > >> so we > >> have too much objects on these spillover OSDs? Or is > it that too > >> many > >> small files are stored on the cephfs filesystems? > >> > >> When we expand the cluster, we can choose larger nvme > devices to > >> allow > >> larger rocksdb partitions, but is that the right way > to deal > >> with this, > >> or should we adjust some parameters on the cluster > that will > >> reduce the > >> rocksdb size? > >> > >> Cheers > >> > >> /Simon > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users(a)ceph.io > <mailto:ceph-users@ceph.io> > >> To unsubscribe send an email to > ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io> > >> > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > <mailto:ceph-users@ceph.io> > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > <mailto:ceph-users-leave@ceph.io> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > <mailto:ceph-users@ceph.io> > To unsubscribe send an email to ceph-users-leave(a)ceph.io > <mailto:ceph-users-leave@ceph.io> >

Seena Fallah

6:45 p.m.

So you won't backport it to nautilus until it gets default to master for a while? On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov <ifedotov(a)suse.de> wrote:

...

From technical/developer's point of view I don't see any issues with tuning this option. But since now I wouldn't recommend to enable it in production as it partially bypassed our regular development cycle. Being enabled in master for a while by default allows more develpers to use/try the feature before release. This can be considered as an additional implicit QA process. But as we just discovered this hasn't happened. Hence you can definitely try it but this exposes your cluster(s) to some risk as for any new (and incompletely tested) feature.... Thanks, Igor On 8/20/2020 4:06 PM, Seena Fallah wrote: Greate, thanks. Is it safe to change it manually in ceph.conf until next nautilus release or should I wait for the next nautilus release for this change? I mean does qa run on this value for this config that we could trust and change it or should we wait until the next nautilus release that qa ran on this value? On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov <ifedotov(a)suse.de> wrote: > Hi Seena, > > this parameter isn't intended to be adjusted in production environments - > it's supposed that default behavior covers all regular customers' needs. > > The issue though is that default setting is invalid. It should be > 'use_some_extra'. Gonna fix that shortly... > > > Thanks, > > Igor > > > > > On 8/20/2020 1:44 PM, Seena Fallah wrote: > > Hi Igor. > > Could you please tell why this config is in LEVEL_DEV ( > https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe66… > As it is documented in Ceph we can't use LEVEL_DEV in production > environments! > > Thanks > > On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov <ifedotov(a)suse.de> wrote: > >> Hi Simon, >> >> >> starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space at DB >> volume. >> >> see this PR: https://github.com/ceph/ceph/pull/29687 >> >> Nice overview on the overall BlueFS/RocksDB design can be find here: >> >> >> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b… >> >> Which also includes some overview (as well as additional concerns) for >> changes brought by the above-mentioned PR. >> >> >> Thanks, >> >> Igor >> >> >> On 8/20/2020 11:39 AM, Simon Oosthoek wrote: >> > Hi Michael, >> > >> > thanks for the explanation! So if I understand correctly, we waste 93 >> > GB per OSD on unused NVME space, because only 30GB is actually used...? >> > >> > And to improve the space for rocksdb, we need to plan for 300GB per >> > rocksdb partition in order to benefit from this advantage.... >> > >> > Reducing the number of small files is something we always ask of our >> > users, but reality is what it is ;-) >> > >> > I'll have to look into how I can get an informative view on these >> > metrics... It's pretty overwhelming the amount of information coming >> > out of the ceph cluster, even when you look only superficially... >> > >> > Cheers, >> > >> > /Simon >> > >> > On 20/08/2020 10:16, Michael Bisig wrote: >> >> Hi Simon >> >> >> >> As far as I know, RocksDB only uses "leveled" space on the NVME >> >> partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every >> >> DB space above such a limit will automatically end up on slow devices. >> >> In your setup where you have 123GB per OSD that means you only use >> >> 30GB of fast device. The DB which spills over this limit will be >> >> offloaded to the HDD and accordingly, it slows down requests and >> >> compactions. >> >> >> >> You can proof what your OSD currently consumes with: >> >> ceph daemon osd.X perf dump >> >> >> >> Informative values are `db_total_bytes`, `db_used_bytes` and >> >> `slow_used_bytes`. This changes regularly because of the ongoing >> >> compactions but Prometheus mgr module exports these values such that >> >> you can track it. >> >> >> >> Small files generally leads to bigger RocksDB, especially when you >> >> use EC, but this depends on the actual amount and file sizes. >> >> >> >> I hope this helps. >> >> Regards, >> >> Michael >> >> >> >> On 20.08.20, 09:10, "Simon Oosthoek" <s.oosthoek(a)science.ru.nl> >> wrote: >> >> >> >> Hi >> >> >> >> Recently our ceph cluster (nautilus) is experiencing bluefs >> >> spillovers, >> >> just 2 osd's and I disabled the warning for these osds. >> >> (ceph config set osd.125 bluestore_warn_on_bluefs_spillover >> false) >> >> >> >> I'm wondering what causes this and how this can be prevented. >> >> >> >> As I understand it the rocksdb for the OSD needs to store more >> >> than fits >> >> on the NVME logical volume (123G for 12T OSD). A way to fix it >> >> could be >> >> to increase the logical volume on the nvme (if there was space >> >> on the >> >> nvme, which there isn't at the moment). >> >> >> >> This is the current size of the cluster and how much is free: >> >> >> >> [root@cephmon1 ~]# ceph df >> >> RAW STORAGE: >> >> CLASS SIZE AVAIL USED RAW USED >> >> %RAW USED >> >> hdd 1.8 PiB 842 TiB 974 TiB 974 >> >> TiB 53.63 >> >> TOTAL 1.8 PiB 842 TiB 974 TiB 974 >> >> TiB 53.63 >> >> >> >> POOLS: >> >> POOL ID STORED OBJECTS USED >> >> %USED MAX AVAIL >> >> cephfs_data 1 572 MiB 121.26M 2.4 GiB >> >> 0 167 TiB >> >> cephfs_metadata 2 56 GiB 5.15M 57 GiB >> >> 0 167 TiB >> >> cephfs_data_3copy 8 201 GiB 51.68k 602 GiB >> >> 0.09 222 TiB >> >> cephfs_data_ec83 13 643 TiB 279.75M 953 TiB >> >> 58.86 485 TiB >> >> rbd 14 21 GiB 5.66k 64 GiB >> >> 0 222 TiB >> >> .rgw.root 15 1.2 KiB 4 1 MiB >> >> 0 167 TiB >> >> default.rgw.control 16 0 B 8 0 B >> >> 0 167 TiB >> >> default.rgw.meta 17 765 B 4 1 MiB >> >> 0 167 TiB >> >> default.rgw.log 18 0 B 207 0 B >> >> 0 167 TiB >> >> cephfs_data_ec57 20 433 MiB 230 1.2 GiB >> >> 0 278 TiB >> >> >> >> The amount used can still grow a bit before we need to add >> >> nodes, but >> >> apparently we are running into the limits of our rocskdb >> >> partitions. >> >> >> >> Did we choose a parameter (e.g. minimal object size) too small, >> >> so we >> >> have too much objects on these spillover OSDs? Or is it that too >> >> many >> >> small files are stored on the cephfs filesystems? >> >> >> >> When we expand the cluster, we can choose larger nvme devices to >> >> allow >> >> larger rocksdb partitions, but is that the right way to deal >> >> with this, >> >> or should we adjust some parameters on the cluster that will >> >> reduce the >> >> rocksdb size? >> >> >> >> Cheers >> >> >> >> /Simon >> >> _______________________________________________ >> >> ceph-users mailing list -- ceph-users(a)ceph.io >> >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> >> >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users(a)ceph.io >> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> >

Igor Fedotov

6:52 p.m.

Correct. On 8/20/2020 5:15 PM, Seena Fallah wrote:

...

So you won't backport it to nautilus until it gets default to master for a while? On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov <ifedotov(a)suse.de <mailto:ifedotov@suse.de>> wrote: From technical/developer's point of view I don't see any issues with tuning this option. But since now I wouldn't recommend to enable it in production as it partially bypassed our regular development cycle. Being enabled in master for a while by default allows more develpers to use/try the feature before release. This can be considered as an additional implicit QA process. But as we just discovered this hasn't happened. Hence you can definitely try it but this exposes your cluster(s) to some risk as for any new (and incompletely tested) feature.... Thanks, Igor On 8/20/2020 4:06 PM, Seena Fallah wrote: > Greate, thanks. > > Is it safe to change it manually in ceph.conf until next nautilus > release or should I wait for the next nautilus release for this > change? I mean does qa run on this value for this config that we > could trust and change it or should we wait until the next > nautilus release that qa ran on this value? > > On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov <ifedotov(a)suse.de > <mailto:ifedotov@suse.de>> wrote: > > Hi Seena, > > this parameter isn't intended to be adjusted in production > environments - it's supposed that default behavior covers all > regular customers' needs. > > The issue though is that default setting is invalid. It > should be 'use_some_extra'. Gonna fix that shortly... > > > Thanks, > > Igor > > > > > On 8/20/2020 1:44 PM, Seena Fallah wrote: >> Hi Igor. >> >> Could you please tell why this config is in LEVEL_DEV >> (https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe66… >> As it is documented in Ceph we can't use LEVEL_DEV in >> production environments! >> >> Thanks >> >> On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov >> <ifedotov(a)suse.de <mailto:ifedotov@suse.de>> wrote: >> >> Hi Simon, >> >> >> starting Nautlus v14.2.10 Bluestore is able to use >> 'wasted' space at DB >> volume. >> >> see this PR: https://github.com/ceph/ceph/pull/29687 >> >> Nice overview on the overall BlueFS/RocksDB design can >> be find here: >> >> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b… >> >> Which also includes some overview (as well as additional >> concerns) for >> changes brought by the above-mentioned PR. >> >> >> Thanks, >> >> Igor >> >> >> On 8/20/2020 11:39 AM, Simon Oosthoek wrote: >> > Hi Michael, >> > >> > thanks for the explanation! So if I understand >> correctly, we waste 93 >> > GB per OSD on unused NVME space, because only 30GB is >> actually used...? >> > >> > And to improve the space for rocksdb, we need to plan >> for 300GB per >> > rocksdb partition in order to benefit from this >> advantage.... >> > >> > Reducing the number of small files is something we >> always ask of our >> > users, but reality is what it is ;-) >> > >> > I'll have to look into how I can get an informative >> view on these >> > metrics... It's pretty overwhelming the amount of >> information coming >> > out of the ceph cluster, even when you look only >> superficially... >> > >> > Cheers, >> > >> > /Simon >> > >> > On 20/08/2020 10:16, Michael Bisig wrote: >> >> Hi Simon >> >> >> >> As far as I know, RocksDB only uses "leveled" space >> on the NVME >> >> partition. The values are set to be 300MB, 3GB, 30GB >> and 300GB. Every >> >> DB space above such a limit will automatically end up >> on slow devices. >> >> In your setup where you have 123GB per OSD that means >> you only use >> >> 30GB of fast device. The DB which spills over this >> limit will be >> >> offloaded to the HDD and accordingly, it slows down >> requests and >> >> compactions. >> >> >> >> You can proof what your OSD currently consumes with: >> >> ceph daemon osd.X perf dump >> >> >> >> Informative values are `db_total_bytes`, >> `db_used_bytes` and >> >> `slow_used_bytes`. This changes regularly because of >> the ongoing >> >> compactions but Prometheus mgr module exports these >> values such that >> >> you can track it. >> >> >> >> Small files generally leads to bigger RocksDB, >> especially when you >> >> use EC, but this depends on the actual amount and >> file sizes. >> >> >> >> I hope this helps. >> >> Regards, >> >> Michael >> >> >> >> On 20.08.20, 09:10, "Simon Oosthoek" >> <s.oosthoek(a)science.ru.nl >> <mailto:s.oosthoek@science.ru.nl>> wrote: >> >> >> >> Hi >> >> >> >> Recently our ceph cluster (nautilus) is >> experiencing bluefs >> >> spillovers, >> >> just 2 osd's and I disabled the warning for >> these osds. >> >> (ceph config set osd.125 >> bluestore_warn_on_bluefs_spillover false) >> >> >> >> I'm wondering what causes this and how this can >> be prevented. >> >> >> >> As I understand it the rocksdb for the OSD needs >> to store more >> >> than fits >> >> on the NVME logical volume (123G for 12T OSD). A >> way to fix it >> >> could be >> >> to increase the logical volume on the nvme (if >> there was space >> >> on the >> >> nvme, which there isn't at the moment). >> >> >> >> This is the current size of the cluster and how >> much is free: >> >> >> >> [root@cephmon1 ~]# ceph df >> >> RAW STORAGE: >> >> CLASS SIZE AVAIL USED RAW >> USED >> >> %RAW USED >> >> hdd 1.8 PiB 842 TiB 974 >> TiB 974 >> >> TiB 53.63 >> >> TOTAL 1.8 PiB 842 TiB 974 >> TiB 974 >> >> TiB 53.63 >> >> >> >> POOLS: >> >> POOL ID STORED OBJECTS USED >> >> %USED MAX AVAIL >> >> cephfs_data 1 572 MiB 121.26M 2.4 GiB >> >> 0 167 TiB >> >> cephfs_metadata 2 56 GiB 5.15M 57 GiB >> >> 0 167 TiB >> >> cephfs_data_3copy 8 201 GiB 51.68k >> 602 GiB >> >> 0.09 222 TiB >> >> cephfs_data_ec83 13 643 TiB 279.75M >> 953 TiB >> >> 58.86 485 TiB >> >> rbd 14 21 GiB 5.66k 64 GiB >> >> 0 222 TiB >> >> .rgw.root 15 1.2 KiB 4 1 MiB >> >> 0 167 TiB >> >> default.rgw.control 16 0 B >> 8 0 B >> >> 0 167 TiB >> >> default.rgw.meta 17 765 B 4 1 MiB >> >> 0 167 TiB >> >> default.rgw.log 18 0 B 207 0 B >> >> 0 167 TiB >> >> cephfs_data_ec57 20 433 MiB 230 >> 1.2 GiB >> >> 0 278 TiB >> >> >> >> The amount used can still grow a bit before we >> need to add >> >> nodes, but >> >> apparently we are running into the limits of our >> rocskdb >> >> partitions. >> >> >> >> Did we choose a parameter (e.g. minimal object >> size) too small, >> >> so we >> >> have too much objects on these spillover OSDs? >> Or is it that too >> >> many >> >> small files are stored on the cephfs filesystems? >> >> >> >> When we expand the cluster, we can choose larger >> nvme devices to >> >> allow >> >> larger rocksdb partitions, but is that the right >> way to deal >> >> with this, >> >> or should we adjust some parameters on the >> cluster that will >> >> reduce the >> >> rocksdb size? >> >> >> >> Cheers >> >> >> >> /Simon >> >> _______________________________________________ >> >> ceph-users mailing list -- ceph-users(a)ceph.io >> <mailto:ceph-users@ceph.io> >> >> To unsubscribe send an email to >> ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io> >> >> >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users(a)ceph.io >> <mailto:ceph-users@ceph.io> >> > To unsubscribe send an email to >> ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io> >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> <mailto:ceph-users@ceph.io> >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> <mailto:ceph-users-leave@ceph.io> >>

Seena Fallah

7:06 p.m.

...

Correct. On 8/20/2020 5:15 PM, Seena Fallah wrote: So you won't backport it to nautilus until it gets default to master for a while? On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov <ifedotov(a)suse.de> wrote: > From technical/developer's point of view I don't see any issues with > tuning this option. But since now I wouldn't recommend to enable it in > production as it partially bypassed our regular development cycle. Being > enabled in master for a while by default allows more develpers to use/try > the feature before release. This can be considered as an additional > implicit QA process. But as we just discovered this hasn't happened. > > Hence you can definitely try it but this exposes your cluster(s) to some > risk as for any new (and incompletely tested) feature.... > > > Thanks, > > Igor > > > On 8/20/2020 4:06 PM, Seena Fallah wrote: > > Greate, thanks. > > Is it safe to change it manually in ceph.conf until next nautilus release > or should I wait for the next nautilus release for this change? I mean does > qa run on this value for this config that we could trust and change it or > should we wait until the next nautilus release that qa ran on this value? > > On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov <ifedotov(a)suse.de> wrote: > >> Hi Seena, >> >> this parameter isn't intended to be adjusted in production environments >> - it's supposed that default behavior covers all regular customers' needs. >> >> The issue though is that default setting is invalid. It should be >> 'use_some_extra'. Gonna fix that shortly... >> >> >> Thanks, >> >> Igor >> >> >> >> >> On 8/20/2020 1:44 PM, Seena Fallah wrote: >> >> Hi Igor. >> >> Could you please tell why this config is in LEVEL_DEV ( >> https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe66… >> As it is documented in Ceph we can't use LEVEL_DEV in production >> environments! >> >> Thanks >> >> On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov <ifedotov(a)suse.de> wrote: >> >>> Hi Simon, >>> >>> >>> starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space at DB >>> volume. >>> >>> see this PR: https://github.com/ceph/ceph/pull/29687 >>> >>> Nice overview on the overall BlueFS/RocksDB design can be find here: >>> >>> >>> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b… >>> >>> Which also includes some overview (as well as additional concerns) for >>> changes brought by the above-mentioned PR. >>> >>> >>> Thanks, >>> >>> Igor >>> >>> >>> On 8/20/2020 11:39 AM, Simon Oosthoek wrote: >>> > Hi Michael, >>> > >>> > thanks for the explanation! So if I understand correctly, we waste 93 >>> > GB per OSD on unused NVME space, because only 30GB is actually >>> used...? >>> > >>> > And to improve the space for rocksdb, we need to plan for 300GB per >>> > rocksdb partition in order to benefit from this advantage.... >>> > >>> > Reducing the number of small files is something we always ask of our >>> > users, but reality is what it is ;-) >>> > >>> > I'll have to look into how I can get an informative view on these >>> > metrics... It's pretty overwhelming the amount of information coming >>> > out of the ceph cluster, even when you look only superficially... >>> > >>> > Cheers, >>> > >>> > /Simon >>> > >>> > On 20/08/2020 10:16, Michael Bisig wrote: >>> >> Hi Simon >>> >> >>> >> As far as I know, RocksDB only uses "leveled" space on the NVME >>> >> partition. The values are set to be 300MB, 3GB, 30GB and 300GB. >>> Every >>> >> DB space above such a limit will automatically end up on slow >>> devices. >>> >> In your setup where you have 123GB per OSD that means you only use >>> >> 30GB of fast device. The DB which spills over this limit will be >>> >> offloaded to the HDD and accordingly, it slows down requests and >>> >> compactions. >>> >> >>> >> You can proof what your OSD currently consumes with: >>> >> ceph daemon osd.X perf dump >>> >> >>> >> Informative values are `db_total_bytes`, `db_used_bytes` and >>> >> `slow_used_bytes`. This changes regularly because of the ongoing >>> >> compactions but Prometheus mgr module exports these values such that >>> >> you can track it. >>> >> >>> >> Small files generally leads to bigger RocksDB, especially when you >>> >> use EC, but this depends on the actual amount and file sizes. >>> >> >>> >> I hope this helps. >>> >> Regards, >>> >> Michael >>> >> >>> >> On 20.08.20, 09:10, "Simon Oosthoek" <s.oosthoek(a)science.ru.nl> >>> wrote: >>> >> >>> >> Hi >>> >> >>> >> Recently our ceph cluster (nautilus) is experiencing bluefs >>> >> spillovers, >>> >> just 2 osd's and I disabled the warning for these osds. >>> >> (ceph config set osd.125 bluestore_warn_on_bluefs_spillover >>> false) >>> >> >>> >> I'm wondering what causes this and how this can be prevented. >>> >> >>> >> As I understand it the rocksdb for the OSD needs to store more >>> >> than fits >>> >> on the NVME logical volume (123G for 12T OSD). A way to fix it >>> >> could be >>> >> to increase the logical volume on the nvme (if there was space >>> >> on the >>> >> nvme, which there isn't at the moment). >>> >> >>> >> This is the current size of the cluster and how much is free: >>> >> >>> >> [root@cephmon1 ~]# ceph df >>> >> RAW STORAGE: >>> >> CLASS SIZE AVAIL USED RAW USED >>> >> %RAW USED >>> >> hdd 1.8 PiB 842 TiB 974 TiB 974 >>> >> TiB 53.63 >>> >> TOTAL 1.8 PiB 842 TiB 974 TiB 974 >>> >> TiB 53.63 >>> >> >>> >> POOLS: >>> >> POOL ID STORED OBJECTS USED >>> >> %USED MAX AVAIL >>> >> cephfs_data 1 572 MiB 121.26M 2.4 GiB >>> >> 0 167 TiB >>> >> cephfs_metadata 2 56 GiB 5.15M 57 GiB >>> >> 0 167 TiB >>> >> cephfs_data_3copy 8 201 GiB 51.68k 602 GiB >>> >> 0.09 222 TiB >>> >> cephfs_data_ec83 13 643 TiB 279.75M 953 TiB >>> >> 58.86 485 TiB >>> >> rbd 14 21 GiB 5.66k 64 GiB >>> >> 0 222 TiB >>> >> .rgw.root 15 1.2 KiB 4 1 MiB >>> >> 0 167 TiB >>> >> default.rgw.control 16 0 B 8 0 B >>> >> 0 167 TiB >>> >> default.rgw.meta 17 765 B 4 1 MiB >>> >> 0 167 TiB >>> >> default.rgw.log 18 0 B 207 0 B >>> >> 0 167 TiB >>> >> cephfs_data_ec57 20 433 MiB 230 1.2 GiB >>> >> 0 278 TiB >>> >> >>> >> The amount used can still grow a bit before we need to add >>> >> nodes, but >>> >> apparently we are running into the limits of our rocskdb >>> >> partitions. >>> >> >>> >> Did we choose a parameter (e.g. minimal object size) too small, >>> >> so we >>> >> have too much objects on these spillover OSDs? Or is it that >>> too >>> >> many >>> >> small files are stored on the cephfs filesystems? >>> >> >>> >> When we expand the cluster, we can choose larger nvme devices >>> to >>> >> allow >>> >> larger rocksdb partitions, but is that the right way to deal >>> >> with this, >>> >> or should we adjust some parameters on the cluster that will >>> >> reduce the >>> >> rocksdb size? >>> >> >>> >> Cheers >>> >> >>> >> /Simon >>> >> _______________________________________________ >>> >> ceph-users mailing list -- ceph-users(a)ceph.io >>> >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> >> >>> > _______________________________________________ >>> > ceph-users mailing list -- ceph-users(a)ceph.io >>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> >>

Igor Fedotov

21 Aug 21 Aug

1:45 a.m.

...

So what do you suggest for a short term solution? (I think you won't backport it to nautilus at least about 6 month) Changing db size is too expensive because I should buy new NVME devices with double size and also redeploy all my OSDs. Manual compaction will still have an impact on performance and doing it for a month doesn't look very good! On Thu, Aug 20, 2020 at 6:52 PM Igor Fedotov <ifedotov(a)suse.de <mailto:ifedotov@suse.de>> wrote: Correct. On 8/20/2020 5:15 PM, Seena Fallah wrote: > So you won't backport it to nautilus until it gets default to > master for a while? > > On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov <ifedotov(a)suse.de > <mailto:ifedotov@suse.de>> wrote: > > From technical/developer's point of view I don't see any > issues with tuning this option. But since now I wouldn't > recommend to enable it in production as it partially bypassed > our regular development cycle. Being enabled in master for a > while by default allows more develpers to use/try the feature > before release. This can be considered as an additional > implicit QA process. But as we just discovered this hasn't > happened. > > Hence you can definitely try it but this exposes your > cluster(s) to some risk as for any new (and incompletely > tested) feature.... > > > Thanks, > > Igor > > > On 8/20/2020 4:06 PM, Seena Fallah wrote: >> Greate, thanks. >> >> Is it safe to change it manually in ceph.conf until next >> nautilus release or should I wait for the next nautilus >> release for this change? I mean does qa run on this value >> for this config that we could trust and change it or should >> we wait until the next nautilus release that qa ran on this >> value? >> >> On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov >> <ifedotov(a)suse.de <mailto:ifedotov@suse.de>> wrote: >> >> Hi Seena, >> >> this parameter isn't intended to be adjusted in >> production environments - it's supposed that default >> behavior covers all regular customers' needs. >> >> The issue though is that default setting is invalid. It >> should be 'use_some_extra'. Gonna fix that shortly... >> >> >> Thanks, >> >> Igor >> >> >> >> >> On 8/20/2020 1:44 PM, Seena Fallah wrote: >>> Hi Igor. >>> >>> Could you please tell why this config is in LEVEL_DEV >>> (https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe66… >>> As it is documented in Ceph we can't use LEVEL_DEV in >>> production environments! >>> >>> Thanks >>> >>> On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov >>> <ifedotov(a)suse.de <mailto:ifedotov@suse.de>> wrote: >>> >>> Hi Simon, >>> >>> >>> starting Nautlus v14.2.10 Bluestore is able to use >>> 'wasted' space at DB >>> volume. >>> >>> see this PR: https://github.com/ceph/ceph/pull/29687 >>> >>> Nice overview on the overall BlueFS/RocksDB design >>> can be find here: >>> >>> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b… >>> >>> Which also includes some overview (as well as >>> additional concerns) for >>> changes brought by the above-mentioned PR. >>> >>> >>> Thanks, >>> >>> Igor >>> >>> >>> On 8/20/2020 11:39 AM, Simon Oosthoek wrote: >>> > Hi Michael, >>> > >>> > thanks for the explanation! So if I understand >>> correctly, we waste 93 >>> > GB per OSD on unused NVME space, because only >>> 30GB is actually used...? >>> > >>> > And to improve the space for rocksdb, we need to >>> plan for 300GB per >>> > rocksdb partition in order to benefit from this >>> advantage.... >>> > >>> > Reducing the number of small files is something >>> we always ask of our >>> > users, but reality is what it is ;-) >>> > >>> > I'll have to look into how I can get an >>> informative view on these >>> > metrics... It's pretty overwhelming the amount of >>> information coming >>> > out of the ceph cluster, even when you look only >>> superficially... >>> > >>> > Cheers, >>> > >>> > /Simon >>> > >>> > On 20/08/2020 10:16, Michael Bisig wrote: >>> >> Hi Simon >>> >> >>> >> As far as I know, RocksDB only uses "leveled" >>> space on the NVME >>> >> partition. The values are set to be 300MB, 3GB, >>> 30GB and 300GB. Every >>> >> DB space above such a limit will automatically >>> end up on slow devices. >>> >> In your setup where you have 123GB per OSD that >>> means you only use >>> >> 30GB of fast device. The DB which spills over >>> this limit will be >>> >> offloaded to the HDD and accordingly, it slows >>> down requests and >>> >> compactions. >>> >> >>> >> You can proof what your OSD currently consumes with: >>> >> ceph daemon osd.X perf dump >>> >> >>> >> Informative values are `db_total_bytes`, >>> `db_used_bytes` and >>> >> `slow_used_bytes`. This changes regularly >>> because of the ongoing >>> >> compactions but Prometheus mgr module exports >>> these values such that >>> >> you can track it. >>> >> >>> >> Small files generally leads to bigger RocksDB, >>> especially when you >>> >> use EC, but this depends on the actual amount >>> and file sizes. >>> >> >>> >> I hope this helps. >>> >> Regards, >>> >> Michael >>> >> >>> >> On 20.08.20, 09:10, "Simon Oosthoek" >>> <s.oosthoek(a)science.ru.nl >>> <mailto:s.oosthoek@science.ru.nl>> wrote: >>> >> >>> >> Hi >>> >> >>> >> Recently our ceph cluster (nautilus) is >>> experiencing bluefs >>> >> spillovers, >>> >> just 2 osd's and I disabled the warning for >>> these osds. >>> >> (ceph config set osd.125 >>> bluestore_warn_on_bluefs_spillover false) >>> >> >>> >> I'm wondering what causes this and how this >>> can be prevented. >>> >> >>> >> As I understand it the rocksdb for the OSD >>> needs to store more >>> >> than fits >>> >> on the NVME logical volume (123G for 12T >>> OSD). A way to fix it >>> >> could be >>> >> to increase the logical volume on the nvme >>> (if there was space >>> >> on the >>> >> nvme, which there isn't at the moment). >>> >> >>> >> This is the current size of the cluster and >>> how much is free: >>> >> >>> >> [root@cephmon1 ~]# ceph df >>> >> RAW STORAGE: >>> >> CLASS SIZE AVAIL USED RAW >>> USED >>> >> %RAW USED >>> >> hdd 1.8 PiB 842 TiB 974 >>> TiB 974 >>> >> TiB 53.63 >>> >> TOTAL 1.8 PiB 842 TiB 974 >>> TiB 974 >>> >> TiB 53.63 >>> >> >>> >> POOLS: >>> >> POOL ID STORED OBJECTS USED >>> >> %USED MAX AVAIL >>> >> cephfs_data 1 572 MiB >>> 121.26M 2.4 GiB >>> >> 0 167 TiB >>> >> cephfs_metadata 2 56 GiB >>> 5.15M 57 GiB >>> >> 0 167 TiB >>> >> cephfs_data_3copy 8 201 GiB >>> 51.68k 602 GiB >>> >> 0.09 222 TiB >>> >> cephfs_data_ec83 13 643 TiB >>> 279.75M 953 TiB >>> >> 58.86 485 TiB >>> >> rbd 14 21 GiB >>> 5.66k 64 GiB >>> >> 0 222 TiB >>> >> .rgw.root 15 1.2 KiB 4 1 MiB >>> >> 0 167 TiB >>> >> default.rgw.control 16 0 B 8 0 B >>> >> 0 167 TiB >>> >> default.rgw.meta 17 765 B 4 1 MiB >>> >> 0 167 TiB >>> >> default.rgw.log 18 0 B >>> 207 0 B >>> >> 0 167 TiB >>> >> cephfs_data_ec57 20 433 MiB >>> 230 1.2 GiB >>> >> 0 278 TiB >>> >> >>> >> The amount used can still grow a bit before >>> we need to add >>> >> nodes, but >>> >> apparently we are running into the limits >>> of our rocskdb >>> >> partitions. >>> >> >>> >> Did we choose a parameter (e.g. minimal >>> object size) too small, >>> >> so we >>> >> have too much objects on these spillover >>> OSDs? Or is it that too >>> >> many >>> >> small files are stored on the cephfs >>> filesystems? >>> >> >>> >> When we expand the cluster, we can choose >>> larger nvme devices to >>> >> allow >>> >> larger rocksdb partitions, but is that the >>> right way to deal >>> >> with this, >>> >> or should we adjust some parameters on the >>> cluster that will >>> >> reduce the >>> >> rocksdb size? >>> >> >>> >> Cheers >>> >> >>> >> /Simon >>> >> _______________________________________________ >>> >> ceph-users mailing list -- >>> ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> >>> >> To unsubscribe send an email to >>> ceph-users-leave(a)ceph.io >>> <mailto:ceph-users-leave@ceph.io> >>> >> >>> > _______________________________________________ >>> > ceph-users mailing list -- ceph-users(a)ceph.io >>> <mailto:ceph-users@ceph.io> >>> > To unsubscribe send an email to >>> ceph-users-leave(a)ceph.io >>> <mailto:ceph-users-leave@ceph.io> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> <mailto:ceph-users@ceph.io> >>> To unsubscribe send an email to >>> ceph-users-leave(a)ceph.io >>> <mailto:ceph-users-leave@ceph.io> >>>

Seena Fallah

2:21 a.m.

...

Honestly I don't have any perfect solution for now. If this is urgent you probably better to proceed with enabling the new DB space management feature. But please do that eventually, modify 1-2 OSDs at the first stage and test them for some period (may be a week or two). Thanks, Igor On 8/20/2020 5:36 PM, Seena Fallah wrote: So what do you suggest for a short term solution? (I think you won't backport it to nautilus at least about 6 month) Changing db size is too expensive because I should buy new NVME devices with double size and also redeploy all my OSDs. Manual compaction will still have an impact on performance and doing it for a month doesn't look very good! On Thu, Aug 20, 2020 at 6:52 PM Igor Fedotov <ifedotov(a)suse.de> wrote: > Correct. > On 8/20/2020 5:15 PM, Seena Fallah wrote: > > So you won't backport it to nautilus until it gets default to master for > a while? > > On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov <ifedotov(a)suse.de> wrote: > >> From technical/developer's point of view I don't see any issues with >> tuning this option. But since now I wouldn't recommend to enable it in >> production as it partially bypassed our regular development cycle. Being >> enabled in master for a while by default allows more develpers to use/try >> the feature before release. This can be considered as an additional >> implicit QA process. But as we just discovered this hasn't happened. >> >> Hence you can definitely try it but this exposes your cluster(s) to some >> risk as for any new (and incompletely tested) feature.... >> >> >> Thanks, >> >> Igor >> >> >> On 8/20/2020 4:06 PM, Seena Fallah wrote: >> >> Greate, thanks. >> >> Is it safe to change it manually in ceph.conf until next nautilus >> release or should I wait for the next nautilus release for this change? I >> mean does qa run on this value for this config that we could trust and >> change it or should we wait until the next nautilus release that qa ran on >> this value? >> >> On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov <ifedotov(a)suse.de> wrote: >> >>> Hi Seena, >>> >>> this parameter isn't intended to be adjusted in production environments >>> - it's supposed that default behavior covers all regular customers' needs. >>> >>> The issue though is that default setting is invalid. It should be >>> 'use_some_extra'. Gonna fix that shortly... >>> >>> >>> Thanks, >>> >>> Igor >>> >>> >>> >>> >>> On 8/20/2020 1:44 PM, Seena Fallah wrote: >>> >>> Hi Igor. >>> >>> Could you please tell why this config is in LEVEL_DEV ( >>> https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe66… >>> As it is documented in Ceph we can't use LEVEL_DEV in production >>> environments! >>> >>> Thanks >>> >>> On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov <ifedotov(a)suse.de> wrote: >>> >>>> Hi Simon, >>>> >>>> >>>> starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space at >>>> DB >>>> volume. >>>> >>>> see this PR: https://github.com/ceph/ceph/pull/29687 >>>> >>>> Nice overview on the overall BlueFS/RocksDB design can be find here: >>>> >>>> >>>> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b… >>>> >>>> Which also includes some overview (as well as additional concerns) for >>>> changes brought by the above-mentioned PR. >>>> >>>> >>>> Thanks, >>>> >>>> Igor >>>> >>>> >>>> On 8/20/2020 11:39 AM, Simon Oosthoek wrote: >>>> > Hi Michael, >>>> > >>>> > thanks for the explanation! So if I understand correctly, we waste >>>> 93 >>>> > GB per OSD on unused NVME space, because only 30GB is actually >>>> used...? >>>> > >>>> > And to improve the space for rocksdb, we need to plan for 300GB per >>>> > rocksdb partition in order to benefit from this advantage.... >>>> > >>>> > Reducing the number of small files is something we always ask of our >>>> > users, but reality is what it is ;-) >>>> > >>>> > I'll have to look into how I can get an informative view on these >>>> > metrics... It's pretty overwhelming the amount of information coming >>>> > out of the ceph cluster, even when you look only superficially... >>>> > >>>> > Cheers, >>>> > >>>> > /Simon >>>> > >>>> > On 20/08/2020 10:16, Michael Bisig wrote: >>>> >> Hi Simon >>>> >> >>>> >> As far as I know, RocksDB only uses "leveled" space on the NVME >>>> >> partition. The values are set to be 300MB, 3GB, 30GB and 300GB. >>>> Every >>>> >> DB space above such a limit will automatically end up on slow >>>> devices. >>>> >> In your setup where you have 123GB per OSD that means you only use >>>> >> 30GB of fast device. The DB which spills over this limit will be >>>> >> offloaded to the HDD and accordingly, it slows down requests and >>>> >> compactions. >>>> >> >>>> >> You can proof what your OSD currently consumes with: >>>> >> ceph daemon osd.X perf dump >>>> >> >>>> >> Informative values are `db_total_bytes`, `db_used_bytes` and >>>> >> `slow_used_bytes`. This changes regularly because of the ongoing >>>> >> compactions but Prometheus mgr module exports these values such >>>> that >>>> >> you can track it. >>>> >> >>>> >> Small files generally leads to bigger RocksDB, especially when you >>>> >> use EC, but this depends on the actual amount and file sizes. >>>> >> >>>> >> I hope this helps. >>>> >> Regards, >>>> >> Michael >>>> >> >>>> >> On 20.08.20, 09:10, "Simon Oosthoek" <s.oosthoek(a)science.ru.nl> >>>> wrote: >>>> >> >>>> >> Hi >>>> >> >>>> >> Recently our ceph cluster (nautilus) is experiencing bluefs >>>> >> spillovers, >>>> >> just 2 osd's and I disabled the warning for these osds. >>>> >> (ceph config set osd.125 bluestore_warn_on_bluefs_spillover >>>> false) >>>> >> >>>> >> I'm wondering what causes this and how this can be prevented. >>>> >> >>>> >> As I understand it the rocksdb for the OSD needs to store more >>>> >> than fits >>>> >> on the NVME logical volume (123G for 12T OSD). A way to fix it >>>> >> could be >>>> >> to increase the logical volume on the nvme (if there was space >>>> >> on the >>>> >> nvme, which there isn't at the moment). >>>> >> >>>> >> This is the current size of the cluster and how much is free: >>>> >> >>>> >> [root@cephmon1 ~]# ceph df >>>> >> RAW STORAGE: >>>> >> CLASS SIZE AVAIL USED RAW >>>> USED >>>> >> %RAW USED >>>> >> hdd 1.8 PiB 842 TiB 974 TiB 974 >>>> >> TiB 53.63 >>>> >> TOTAL 1.8 PiB 842 TiB 974 TiB 974 >>>> >> TiB 53.63 >>>> >> >>>> >> POOLS: >>>> >> POOL ID STORED OBJECTS USED >>>> >> %USED MAX AVAIL >>>> >> cephfs_data 1 572 MiB 121.26M 2.4 GiB >>>> >> 0 167 TiB >>>> >> cephfs_metadata 2 56 GiB 5.15M 57 GiB >>>> >> 0 167 TiB >>>> >> cephfs_data_3copy 8 201 GiB 51.68k 602 GiB >>>> >> 0.09 222 TiB >>>> >> cephfs_data_ec83 13 643 TiB 279.75M 953 TiB >>>> >> 58.86 485 TiB >>>> >> rbd 14 21 GiB 5.66k 64 GiB >>>> >> 0 222 TiB >>>> >> .rgw.root 15 1.2 KiB 4 1 MiB >>>> >> 0 167 TiB >>>> >> default.rgw.control 16 0 B 8 0 B >>>> >> 0 167 TiB >>>> >> default.rgw.meta 17 765 B 4 1 MiB >>>> >> 0 167 TiB >>>> >> default.rgw.log 18 0 B 207 0 B >>>> >> 0 167 TiB >>>> >> cephfs_data_ec57 20 433 MiB 230 1.2 GiB >>>> >> 0 278 TiB >>>> >> >>>> >> The amount used can still grow a bit before we need to add >>>> >> nodes, but >>>> >> apparently we are running into the limits of our rocskdb >>>> >> partitions. >>>> >> >>>> >> Did we choose a parameter (e.g. minimal object size) too >>>> small, >>>> >> so we >>>> >> have too much objects on these spillover OSDs? Or is it that >>>> too >>>> >> many >>>> >> small files are stored on the cephfs filesystems? >>>> >> >>>> >> When we expand the cluster, we can choose larger nvme devices >>>> to >>>> >> allow >>>> >> larger rocksdb partitions, but is that the right way to deal >>>> >> with this, >>>> >> or should we adjust some parameters on the cluster that will >>>> >> reduce the >>>> >> rocksdb size? >>>> >> >>>> >> Cheers >>>> >> >>>> >> /Simon >>>> >> _______________________________________________ >>>> >> ceph-users mailing list -- ceph-users(a)ceph.io >>>> >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> >> >>>> > _______________________________________________ >>>> > ceph-users mailing list -- ceph-users(a)ceph.io >>>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> >>>

Igor Fedotov

5:19 p.m.

Can't say nothing about "write_buffer_size" tuning.. Never tried that. But I presume that these are *"max_bytes_for_level_base*" and *"**max_bytes_for_level_multiplier*" params which rather should be tuned to modify RocksDB level granularity. But I have no ideas how safe this is in a production environment. Thanks, Igor On 8/21/2020 12:51 AM, Seena Fallah wrote:

...

Ok thanks. And also as you mentioned in the doc you shared from cloudferro, It's not good to change `write_buffer_size` for bluestore rocksdb to fit our db? On Fri, Aug 21, 2020 at 1:46 AM Igor Fedotov <ifedotov(a)suse.de <mailto:ifedotov@suse.de>> wrote: Honestly I don't have any perfect solution for now. If this is urgent you probably better to proceed with enabling the new DB space management feature. But please do that eventually, modify 1-2 OSDs at the first stage and test them for some period (may be a week or two). Thanks, Igor On 8/20/2020 5:36 PM, Seena Fallah wrote: > So what do you suggest for a short term solution? (I think you > won't backport it to nautilus at least about 6 month) > > Changing db size is too expensive because I should buy new NVME > devices with double size and also redeploy all my OSDs. > Manual compaction will still have an impact on performance and > doing it for a month doesn't look very good! > > On Thu, Aug 20, 2020 at 6:52 PM Igor Fedotov <ifedotov(a)suse.de > <mailto:ifedotov@suse.de>> wrote: > > Correct. > > On 8/20/2020 5:15 PM, Seena Fallah wrote: >> So you won't backport it to nautilus until it gets >> default to master for a while? >> >> On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov >> <ifedotov(a)suse.de <mailto:ifedotov@suse.de>> wrote: >> >> From technical/developer's point of view I don't see any >> issues with tuning this option. But since now I wouldn't >> recommend to enable it in production as it partially >> bypassed our regular development cycle. Being enabled in >> master for a while by default allows more develpers to >> use/try the feature before release. This can be >> considered as an additional implicit QA process. But as >> we just discovered this hasn't happened. >> >> Hence you can definitely try it but this exposes your >> cluster(s) to some risk as for any new (and incompletely >> tested) feature.... >> >> >> Thanks, >> >> Igor >> >> >> On 8/20/2020 4:06 PM, Seena Fallah wrote: >>> Greate, thanks. >>> >>> Is it safe to change it manually in ceph.conf >>> until next nautilus release or should I wait for the >>> next nautilus release for this change? I mean does qa >>> run on this value for this config that we could trust >>> and change it or should we wait until the next nautilus >>> release that qa ran on this value? >>> >>> On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov >>> <ifedotov(a)suse.de <mailto:ifedotov@suse.de>> wrote: >>> >>> Hi Seena, >>> >>> this parameter isn't intended to be adjusted in >>> production environments - it's supposed that >>> default behavior covers all regular customers' needs. >>> >>> The issue though is that default setting is >>> invalid. It should be 'use_some_extra'. Gonna fix >>> that shortly... >>> >>> >>> Thanks, >>> >>> Igor >>> >>> >>> >>> >>> On 8/20/2020 1:44 PM, Seena Fallah wrote: >>>> Hi Igor. >>>> >>>> Could you please tell why this config is in >>>> LEVEL_DEV >>>> (https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe66… >>>> As it is documented in Ceph we can't use LEVEL_DEV >>>> in production environments! >>>> >>>> Thanks >>>> >>>> On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov >>>> <ifedotov(a)suse.de <mailto:ifedotov@suse.de>> wrote: >>>> >>>> Hi Simon, >>>> >>>> >>>> starting Nautlus v14.2.10 Bluestore is able to >>>> use 'wasted' space at DB >>>> volume. >>>> >>>> see this PR: >>>> https://github.com/ceph/ceph/pull/29687 >>>> >>>> Nice overview on the overall BlueFS/RocksDB >>>> design can be find here: >>>> >>>> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b… >>>> >>>> Which also includes some overview (as well as >>>> additional concerns) for >>>> changes brought by the above-mentioned PR. >>>> >>>> >>>> Thanks, >>>> >>>> Igor >>>> >>>> >>>> On 8/20/2020 11:39 AM, Simon Oosthoek wrote: >>>> > Hi Michael, >>>> > >>>> > thanks for the explanation! So if I >>>> understand correctly, we waste 93 >>>> > GB per OSD on unused NVME space, because >>>> only 30GB is actually used...? >>>> > >>>> > And to improve the space for rocksdb, we >>>> need to plan for 300GB per >>>> > rocksdb partition in order to benefit from >>>> this advantage.... >>>> > >>>> > Reducing the number of small files is >>>> something we always ask of our >>>> > users, but reality is what it is ;-) >>>> > >>>> > I'll have to look into how I can get an >>>> informative view on these >>>> > metrics... It's pretty overwhelming the >>>> amount of information coming >>>> > out of the ceph cluster, even when you look >>>> only superficially... >>>> > >>>> > Cheers, >>>> > >>>> > /Simon >>>> > >>>> > On 20/08/2020 10:16, Michael Bisig wrote: >>>> >> Hi Simon >>>> >> >>>> >> As far as I know, RocksDB only uses >>>> "leveled" space on the NVME >>>> >> partition. The values are set to be 300MB, >>>> 3GB, 30GB and 300GB. Every >>>> >> DB space above such a limit will >>>> automatically end up on slow devices. >>>> >> In your setup where you have 123GB per OSD >>>> that means you only use >>>> >> 30GB of fast device. The DB which spills >>>> over this limit will be >>>> >> offloaded to the HDD and accordingly, it >>>> slows down requests and >>>> >> compactions. >>>> >> >>>> >> You can proof what your OSD currently >>>> consumes with: >>>> >> ceph daemon osd.X perf dump >>>> >> >>>> >> Informative values are `db_total_bytes`, >>>> `db_used_bytes` and >>>> >> `slow_used_bytes`. This changes regularly >>>> because of the ongoing >>>> >> compactions but Prometheus mgr module >>>> exports these values such that >>>> >> you can track it. >>>> >> >>>> >> Small files generally leads to bigger >>>> RocksDB, especially when you >>>> >> use EC, but this depends on the actual >>>> amount and file sizes. >>>> >> >>>> >> I hope this helps. >>>> >> Regards, >>>> >> Michael >>>> >> >>>> >> On 20.08.20, 09:10, "Simon Oosthoek" >>>> <s.oosthoek(a)science.ru.nl >>>> <mailto:s.oosthoek@science.ru.nl>> wrote: >>>> >> >>>> >> Hi >>>> >> >>>> >> Recently our ceph cluster (nautilus) >>>> is experiencing bluefs >>>> >> spillovers, >>>> >> just 2 osd's and I disabled the >>>> warning for these osds. >>>> >> (ceph config set osd.125 >>>> bluestore_warn_on_bluefs_spillover false) >>>> >> >>>> >> I'm wondering what causes this and how >>>> this can be prevented. >>>> >> >>>> >> As I understand it the rocksdb for the >>>> OSD needs to store more >>>> >> than fits >>>> >> on the NVME logical volume (123G for >>>> 12T OSD). A way to fix it >>>> >> could be >>>> >> to increase the logical volume on the >>>> nvme (if there was space >>>> >> on the >>>> >> nvme, which there isn't at the moment). >>>> >> >>>> >> This is the current size of the >>>> cluster and how much is free: >>>> >> >>>> >> [root@cephmon1 ~]# ceph df >>>> >> RAW STORAGE: >>>> >> CLASS SIZE AVAIL >>>> USED RAW USED >>>> >> %RAW USED >>>> >> hdd 1.8 PiB 842 TiB 974 >>>> TiB 974 >>>> >> TiB 53.63 >>>> >> TOTAL 1.8 PiB 842 TiB 974 >>>> TiB 974 >>>> >> TiB 53.63 >>>> >> >>>> >> POOLS: >>>> >> POOL ID STORED >>>> OBJECTS USED >>>> >> %USED MAX AVAIL >>>> >> cephfs_data 1 572 MiB >>>> 121.26M 2.4 GiB >>>> >> 0 167 TiB >>>> >> cephfs_metadata 2 56 GiB 5.15M 57 GiB >>>> >> 0 167 TiB >>>> >> cephfs_data_3copy 8 201 GiB >>>> 51.68k 602 GiB >>>> >> 0.09 222 TiB >>>> >> cephfs_data_ec83 13 643 TiB >>>> 279.75M 953 TiB >>>> >> 58.86 485 TiB >>>> >> rbd 14 21 GiB 5.66k 64 GiB >>>> >> 0 222 TiB >>>> >> .rgw.root 15 1.2 KiB 4 >>>> 1 MiB >>>> >> 0 167 TiB >>>> >> default.rgw.control 16 0 B >>>> 8 0 B >>>> >> 0 167 TiB >>>> >> default.rgw.meta 17 765 B 4 1 MiB >>>> >> 0 167 TiB >>>> >> default.rgw.log 18 0 B 207 0 B >>>> >> 0 167 TiB >>>> >> cephfs_data_ec57 20 433 MiB >>>> 230 1.2 GiB >>>> >> 0 278 TiB >>>> >> >>>> >> The amount used can still grow a bit >>>> before we need to add >>>> >> nodes, but >>>> >> apparently we are running into the >>>> limits of our rocskdb >>>> >> partitions. >>>> >> >>>> >> Did we choose a parameter (e.g. >>>> minimal object size) too small, >>>> >> so we >>>> >> have too much objects on these >>>> spillover OSDs? Or is it that too >>>> >> many >>>> >> small files are stored on the cephfs >>>> filesystems? >>>> >> >>>> >> When we expand the cluster, we can >>>> choose larger nvme devices to >>>> >> allow >>>> >> larger rocksdb partitions, but is that >>>> the right way to deal >>>> >> with this, >>>> >> or should we adjust some parameters on >>>> the cluster that will >>>> >> reduce the >>>> >> rocksdb size? >>>> >> >>>> >> Cheers >>>> >> >>>> >> /Simon >>>> >> _______________________________________________ >>>> >> ceph-users mailing list -- >>>> ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> >>>> >> To unsubscribe send an email to >>>> ceph-users-leave(a)ceph.io >>>> <mailto:ceph-users-leave@ceph.io> >>>> >> >>>> > _______________________________________________ >>>> > ceph-users mailing list -- >>>> ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> >>>> > To unsubscribe send an email to >>>> ceph-users-leave(a)ceph.io >>>> <mailto:ceph-users-leave@ceph.io> >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> <mailto:ceph-users@ceph.io> >>>> To unsubscribe send an email to >>>> ceph-users-leave(a)ceph.io >>>> <mailto:ceph-users-leave@ceph.io> >>>>

1338

days inactive

1339

days old

ceph-users@ceph.io

Manage subscription

15 comments

4 participants

tags (0)

participants (4)

Igor Fedotov
Michael Bisig
Seena Fallah
Simon Oosthoek