[ceph-users] Re: BlueFS spillover detected, why, what?

20 Aug 2020

 From technical/developer's point of view I don't see any issues with 
tuning this option. But since now I wouldn't  recommend to enable it in 
production as it partially bypassed our regular development cycle. Being 
enabled in master for a while by default allows more develpers to 
use/try the feature before release. This can be considered as an 
additional implicit QA process. But as we just discovered this hasn't 
happened.

Hence you can definitely try it but this exposes your cluster(s) to some 
risk as for any new (and incompletely tested) feature....

Thanks,

Igor

On 8/20/2020 4:06 PM, Seena Fallah wrote:
...
  Greate, thanks.

 Is it safe to change it manually in ceph.conf until next nautilus 
 release or should I wait for the next nautilus release for this 
 change? I mean does qa run on this value for this config that we could 
 trust and change it or should we wait until the next nautilus release 
 that qa ran on this value?

 On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov &lt;ifedotov(a)suse.de 
 <mailto:ifedotov@suse.de>> wrote:

     Hi Seena,

     this parameter isn't intended to be adjusted in production
     environments - it's supposed that default behavior covers all
     regular customers' needs.

     The issue though is that default setting is invalid. It should be
     'use_some_extra'. Gonna fix that shortly...

     Thanks,

     Igor

     On 8/20/2020 1:44 PM, Seena Fallah wrote:
>     Hi Igor.
>
>     Could you please tell why this config is in LEVEL_DEV
>    
(https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe66…
>     As it is documented in Ceph we can't use LEVEL_DEV in production
>     environments!
>
>     Thanks
>
>     On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov &lt;ifedotov(a)suse.de
>     <mailto:ifedotov@suse.de>> wrote:
>
>         Hi Simon,
>
>
>         starting Nautlus v14.2.10 Bluestore is able to use 'wasted'
>         space at DB
>         volume.
>
>         see this PR: https://github.com/ceph/ceph/pull/29687
>
>         Nice overview on the overall BlueFS/RocksDB design can be
>         find here:
>
>        
https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b…
>
>         Which also includes some overview (as well as additional
>         concerns) for
>         changes brought by the above-mentioned PR.
>
>
>         Thanks,
>
>         Igor
>
>
>         On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
>         > Hi Michael,
>         >
>         > thanks for the explanation! So if I understand correctly,
>         we waste 93
>         > GB per OSD on unused NVME space, because only 30GB is
>         actually used...?
>         >
>         > And to improve the space for rocksdb, we need to plan for
>         300GB per
>         > rocksdb partition in order to benefit from this advantage....
>         >
>         > Reducing the number of small files is something we always
>         ask of our
>         > users, but reality is what it is ;-)
>         >
>         > I'll have to look into how I can get an informative view on
>         these
>         > metrics... It's pretty overwhelming the amount of
>         information coming
>         > out of the ceph cluster, even when you look only
>         superficially...
>         >
>         > Cheers,
>         >
>         > /Simon
>         >
>         > On 20/08/2020 10:16, Michael Bisig wrote:
>         >> Hi Simon
>         >>
>         >> As far as I know, RocksDB only uses "leveled" space on
the
>         NVME
>         >> partition. The values are set to be 300MB, 3GB, 30GB and
>         300GB. Every
>         >> DB space above such a limit will automatically end up on
>         slow devices.
>         >> In your setup where you have 123GB per OSD that means you
>         only use
>         >> 30GB of fast device. The DB which spills over this limit
>         will be
>         >> offloaded to the HDD and accordingly, it slows down
>         requests and
>         >> compactions.
>         >>
>         >> You can proof what your OSD currently consumes with:
>         >>    ceph daemon osd.X perf dump
>         >>
>         >> Informative values are `db_total_bytes`, `db_used_bytes` and
>         >> `slow_used_bytes`. This changes regularly because of the
>         ongoing
>         >> compactions but Prometheus mgr module exports these values
>         such that
>         >> you can track it.
>         >>
>         >> Small files generally leads to bigger RocksDB, especially
>         when you
>         >> use EC, but this depends on the actual amount and file sizes.
>         >>
>         >> I hope this helps.
>         >> Regards,
>         >> Michael
>         >>
>         >> On 20.08.20, 09:10, "Simon Oosthoek"
>         &lt;s.oosthoek(a)science.ru.nl <mailto:s.oosthoek@science.ru.nl>>
>         wrote:
>         >>
>         >>      Hi
>         >>
>         >>      Recently our ceph cluster (nautilus) is experiencing
>         bluefs
>         >> spillovers,
>         >>      just 2 osd's and I disabled the warning for these osds.
>         >>      (ceph config set osd.125
>         bluestore_warn_on_bluefs_spillover false)
>         >>
>         >>      I'm wondering what causes this and how this can be
>         prevented.
>         >>
>         >>      As I understand it the rocksdb for the OSD needs to
>         store more
>         >> than fits
>         >>      on the NVME logical volume (123G for 12T OSD). A way
>         to fix it
>         >> could be
>         >>      to increase the logical volume on the nvme (if there
>         was space
>         >> on the
>         >>      nvme, which there isn't at the moment).
>         >>
>         >>      This is the current size of the cluster and how much
>         is free:
>         >>
>         >>      [root@cephmon1 ~]# ceph df
>         >>      RAW STORAGE:
>         >>           CLASS     SIZE        AVAIL USED        RAW USED
>         >> %RAW USED
>         >>           hdd       1.8 PiB     842 TiB 974 TiB      974
>         >> TiB         53.63
>         >>           TOTAL     1.8 PiB     842 TiB 974 TiB      974
>         >> TiB         53.63
>         >>
>         >>      POOLS:
>         >>           POOL                    ID STORED      OBJECTS USED
>         >>      %USED     MAX AVAIL
>         >>           cephfs_data              1     572 MiB    
>         121.26M 2.4 GiB
>         >>          0       167 TiB
>         >>           cephfs_metadata          2      56 GiB
>         5.15M      57 GiB
>         >>          0       167 TiB
>         >>           cephfs_data_3copy        8     201 GiB     
>         51.68k 602 GiB
>         >>      0.09       222 TiB
>         >>           cephfs_data_ec83        13     643 TiB    
>         279.75M 953 TiB
>         >>      58.86       485 TiB
>         >>           rbd                     14      21 GiB
>         5.66k      64 GiB
>         >>          0       222 TiB
>         >>           .rgw.root               15     1.2 KiB 4       1 MiB
>         >>          0       167 TiB
>         >>           default.rgw.control     16 0 B 8         0 B
>         >>          0       167 TiB
>         >>           default.rgw.meta        17 765 B 4       1 MiB
>         >>          0       167 TiB
>         >>           default.rgw.log         18 0 B 207         0 B
>         >>          0       167 TiB
>         >>           cephfs_data_ec57        20     433 MiB        
>         230 1.2 GiB
>         >>          0       278 TiB
>         >>
>         >>      The amount used can still grow a bit before we need
>         to add
>         >> nodes, but
>         >>      apparently we are running into the limits of our rocskdb
>         >> partitions.
>         >>
>         >>      Did we choose a parameter (e.g. minimal object size)
>         too small,
>         >> so we
>         >>      have too much objects on these spillover OSDs? Or is
>         it that too
>         >> many
>         >>      small files are stored on the cephfs filesystems?
>         >>
>         >>      When we expand the cluster, we can choose larger nvme
>         devices to
>         >> allow
>         >>      larger rocksdb partitions, but is that the right way
>         to deal
>         >> with this,
>         >>      or should we adjust some parameters on the cluster
>         that will
>         >> reduce the
>         >>      rocksdb size?
>         >>
>         >>      Cheers
>         >>
>         >>      /Simon
>         >> _______________________________________________
>         >>      ceph-users mailing list -- ceph-users(a)ceph.io
>         <mailto:ceph-users@ceph.io>
>         >>      To unsubscribe send an email to
>         ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>
>         >>
>         > _______________________________________________
>         > ceph-users mailing list -- ceph-users(a)ceph.io
>         <mailto:ceph-users@ceph.io>
>         > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>         <mailto:ceph-users-leave@ceph.io>
>         _______________________________________________
>         ceph-users mailing list -- ceph-users(a)ceph.io
>         <mailto:ceph-users@ceph.io>
>         To unsubscribe send an email to ceph-users-leave(a)ceph.io
>         <mailto:ceph-users-leave@ceph.io>
> 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: BlueFS spillover detected, why, what?