[ceph-users] Re: BlueFS spillover detected, why, what?

21 Aug 2020

Can't say nothing about "write_buffer_size" tuning.. Never tried that.

But I presume that these are *"max_bytes_for_level_base*" and 
*"**max_bytes_for_level_multiplier*" params which rather should be tuned 
to modify RocksDB level granularity.

But I have no ideas how safe this is in a production environment.

Thanks,

Igor

On 8/21/2020 12:51 AM, Seena Fallah wrote:
...
  Ok thanks. And also as you mentioned in the doc you
shared from 
 cloudferro, It's not good to change `write_buffer_size` for bluestore 
 rocksdb to fit our db?

 On Fri, Aug 21, 2020 at 1:46 AM Igor Fedotov &lt;ifedotov(a)suse.de 
 <mailto:ifedotov@suse.de>> wrote:

     Honestly I don't have any perfect solution for now.

     If this is urgent you probably better to proceed with enabling the
     new DB space management feature.

     But please do that eventually, modify 1-2 OSDs at the first stage
     and test them for some period (may be a week or two).

     Thanks,

     Igor

     On 8/20/2020 5:36 PM, Seena Fallah wrote:
>     So what do you suggest for a short term solution? (I think you
>     won't backport it to nautilus at least about 6 month)
>
>     Changing db size is too expensive because I should buy new NVME
>     devices with double size and also redeploy all my OSDs.
>     Manual compaction will still have an impact on performance and
>     doing it for a month doesn't look very good!
>
>     On Thu, Aug 20, 2020 at 6:52 PM Igor Fedotov &lt;ifedotov(a)suse.de
>     <mailto:ifedotov@suse.de>> wrote:
>
>         Correct.
>
>         On 8/20/2020 5:15 PM, Seena Fallah wrote:
>>         So you won't backport it to nautilus until it gets
>>         default to master for a while?
>>
>>         On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov
>>         &lt;ifedotov(a)suse.de <mailto:ifedotov@suse.de>> wrote:
>>
>>             From technical/developer's point of view I don't see any
>>             issues with tuning this option. But since now I wouldn't
>>             recommend to enable it in production as it partially
>>             bypassed our regular development cycle. Being enabled in
>>             master for a while by default allows more develpers to
>>             use/try the feature before release. This can be
>>             considered as an additional implicit QA process. But as
>>             we just discovered this hasn't happened.
>>
>>             Hence you can definitely try it but this exposes your
>>             cluster(s) to some risk as for any new (and incompletely
>>             tested) feature....
>>
>>
>>             Thanks,
>>
>>             Igor
>>
>>
>>             On 8/20/2020 4:06 PM, Seena Fallah wrote:
>>>             Greate, thanks.
>>>
>>>             Is it safe to change it manually in ceph.conf
>>>             until next nautilus release or should I wait for the
>>>             next nautilus release for this change? I mean does qa
>>>             run on this value for this config that we could trust
>>>             and change it or should we wait until the next nautilus
>>>             release that qa ran on this value?
>>>
>>>             On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov
>>>             &lt;ifedotov(a)suse.de <mailto:ifedotov@suse.de>> wrote:
>>>
>>>                 Hi Seena,
>>>
>>>                 this parameter isn't intended to be adjusted in
>>>                 production environments - it's supposed that
>>>                 default behavior covers all regular customers' needs.
>>>
>>>                 The issue though is that default setting is
>>>                 invalid. It should be 'use_some_extra'. Gonna fix
>>>                 that shortly...
>>>
>>>
>>>                 Thanks,
>>>
>>>                 Igor
>>>
>>>
>>>
>>>
>>>                 On 8/20/2020 1:44 PM, Seena Fallah wrote:
>>>>                 Hi Igor.
>>>>
>>>>                 Could you please tell why this config is in
>>>>                 LEVEL_DEV
>>>>                
(https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe66…
>>>>                 As it is documented in Ceph we can't use LEVEL_DEV
>>>>                 in production environments!
>>>>
>>>>                 Thanks
>>>>
>>>>                 On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov
>>>>                 &lt;ifedotov(a)suse.de <mailto:ifedotov@suse.de>>
wrote:
>>>>
>>>>                     Hi Simon,
>>>>
>>>>
>>>>                     starting Nautlus v14.2.10 Bluestore is able to
>>>>                     use 'wasted' space at DB
>>>>                     volume.
>>>>
>>>>                     see this PR:
>>>>                     https://github.com/ceph/ceph/pull/29687
>>>>
>>>>                     Nice overview on the overall BlueFS/RocksDB
>>>>                     design can be find here:
>>>>
>>>>                    
https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b…
>>>>
>>>>                     Which also includes some overview (as well as
>>>>                     additional concerns) for
>>>>                     changes brought by the above-mentioned PR.
>>>>
>>>>
>>>>                     Thanks,
>>>>
>>>>                     Igor
>>>>
>>>>
>>>>                     On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
>>>>                     > Hi Michael,
>>>>                     >
>>>>                     > thanks for the explanation! So if I
>>>>                     understand correctly, we waste 93
>>>>                     > GB per OSD on unused NVME space, because
>>>>                     only 30GB is actually used...?
>>>>                     >
>>>>                     > And to improve the space for rocksdb, we
>>>>                     need to plan for 300GB per
>>>>                     > rocksdb partition in order to benefit from
>>>>                     this advantage....
>>>>                     >
>>>>                     > Reducing the number of small files is
>>>>                     something we always ask of our
>>>>                     > users, but reality is what it is ;-)
>>>>                     >
>>>>                     > I'll have to look into how I can get an
>>>>                     informative view on these
>>>>                     > metrics... It's pretty overwhelming the
>>>>                     amount of information coming
>>>>                     > out of the ceph cluster, even when you look
>>>>                     only superficially...
>>>>                     >
>>>>                     > Cheers,
>>>>                     >
>>>>                     > /Simon
>>>>                     >
>>>>                     > On 20/08/2020 10:16, Michael Bisig wrote:
>>>>                     >> Hi Simon
>>>>                     >>
>>>>                     >> As far as I know, RocksDB only uses
>>>>                     "leveled" space on the NVME
>>>>                     >> partition. The values are set to be 300MB,
>>>>                     3GB, 30GB and 300GB. Every
>>>>                     >> DB space above such a limit will
>>>>                     automatically end up on slow devices.
>>>>                     >> In your setup where you have 123GB per OSD
>>>>                     that means you only use
>>>>                     >> 30GB of fast device. The DB which spills
>>>>                     over this limit will be
>>>>                     >> offloaded to the HDD and accordingly, it
>>>>                     slows down requests and
>>>>                     >> compactions.
>>>>                     >>
>>>>                     >> You can proof what your OSD currently
>>>>                     consumes with:
>>>>                     >>    ceph daemon osd.X perf dump
>>>>                     >>
>>>>                     >> Informative values are `db_total_bytes`,
>>>>                     `db_used_bytes` and
>>>>                     >> `slow_used_bytes`. This changes regularly
>>>>                     because of the ongoing
>>>>                     >> compactions but Prometheus mgr module
>>>>                     exports these values such that
>>>>                     >> you can track it.
>>>>                     >>
>>>>                     >> Small files generally leads to bigger
>>>>                     RocksDB, especially when you
>>>>                     >> use EC, but this depends on the actual
>>>>                     amount and file sizes.
>>>>                     >>
>>>>                     >> I hope this helps.
>>>>                     >> Regards,
>>>>                     >> Michael
>>>>                     >>
>>>>                     >> On 20.08.20, 09:10, "Simon
Oosthoek"
>>>>                     &lt;s.oosthoek(a)science.ru.nl
>>>>                     <mailto:s.oosthoek@science.ru.nl>> wrote:
>>>>                     >>
>>>>                     >>      Hi
>>>>                     >>
>>>>                     >>      Recently our ceph cluster (nautilus)
>>>>                     is experiencing bluefs
>>>>                     >> spillovers,
>>>>                     >>      just 2 osd's and I disabled the
>>>>                     warning for these osds.
>>>>                     >>      (ceph config set osd.125
>>>>                     bluestore_warn_on_bluefs_spillover false)
>>>>                     >>
>>>>                     >>      I'm wondering what causes this and
how
>>>>                     this can be prevented.
>>>>                     >>
>>>>                     >>      As I understand it the rocksdb for the
>>>>                     OSD needs to store more
>>>>                     >> than fits
>>>>                     >>      on the NVME logical volume (123G for
>>>>                     12T OSD). A way to fix it
>>>>                     >> could be
>>>>                     >>      to increase the logical volume on the
>>>>                     nvme (if there was space
>>>>                     >> on the
>>>>                     >>      nvme, which there isn't at the
moment).
>>>>                     >>
>>>>                     >>      This is the current size of the
>>>>                     cluster and how much is free:
>>>>                     >>
>>>>                     >>      [root@cephmon1 ~]# ceph df
>>>>                     >>      RAW STORAGE:
>>>>                     >>           CLASS SIZE        AVAIL
>>>>                     USED        RAW USED
>>>>                     >> %RAW USED
>>>>                     >>           hdd 1.8 PiB     842 TiB     974
>>>>                     TiB      974
>>>>                     >> TiB         53.63
>>>>                     >>           TOTAL 1.8 PiB     842 TiB     974
>>>>                     TiB      974
>>>>                     >> TiB         53.63
>>>>                     >>
>>>>                     >>      POOLS:
>>>>                     >> POOL                    ID STORED     
>>>>                     OBJECTS USED
>>>>                     >>      %USED     MAX AVAIL
>>>>                     >> cephfs_data              1 572 MiB    
>>>>                     121.26M 2.4 GiB
>>>>                     >>          0       167 TiB
>>>>                     >> cephfs_metadata 2      56 GiB 5.15M      57
GiB
>>>>                     >>          0       167 TiB
>>>>                     >> cephfs_data_3copy        8 201 GiB     
>>>>                     51.68k 602 GiB
>>>>                     >>      0.09       222 TiB
>>>>                     >> cephfs_data_ec83        13 643 TiB    
>>>>                     279.75M 953 TiB
>>>>                     >>      58.86       485 TiB
>>>>                     >> rbd 14      21 GiB 5.66k      64 GiB
>>>>                     >>          0       222 TiB
>>>>                     >> .rgw.root               15 1.2 KiB 4      
>>>>                     1 MiB
>>>>                     >>          0       167 TiB
>>>>                     >> default.rgw.control 16         0 B
>>>>                     8         0 B
>>>>                     >>          0       167 TiB
>>>>                     >> default.rgw.meta 17       765 B 4       1
MiB
>>>>                     >>          0       167 TiB
>>>>                     >> default.rgw.log 18         0 B 207         0
B
>>>>                     >>          0       167 TiB
>>>>                     >> cephfs_data_ec57        20 433 MiB        
>>>>                     230 1.2 GiB
>>>>                     >>          0       278 TiB
>>>>                     >>
>>>>                     >>      The amount used can still grow a bit
>>>>                     before we need to add
>>>>                     >> nodes, but
>>>>                     >>      apparently we are running into the
>>>>                     limits of our rocskdb
>>>>                     >> partitions.
>>>>                     >>
>>>>                     >>      Did we choose a parameter (e.g.
>>>>                     minimal object size) too small,
>>>>                     >> so we
>>>>                     >>      have too much objects on these
>>>>                     spillover OSDs? Or is it that too
>>>>                     >> many
>>>>                     >>      small files are stored on the cephfs
>>>>                     filesystems?
>>>>                     >>
>>>>                     >>      When we expand the cluster, we can
>>>>                     choose larger nvme devices to
>>>>                     >> allow
>>>>                     >>      larger rocksdb partitions, but is that
>>>>                     the right way to deal
>>>>                     >> with this,
>>>>                     >>      or should we adjust some parameters on
>>>>                     the cluster that will
>>>>                     >> reduce the
>>>>                     >>      rocksdb size?
>>>>                     >>
>>>>                     >>      Cheers
>>>>                     >>
>>>>                     >>      /Simon
>>>>                     >>
_______________________________________________
>>>>                     >>      ceph-users mailing list --
>>>>                     ceph-users(a)ceph.io <mailto:ceph-users@ceph.io>
>>>>                     >>      To unsubscribe send an email to
>>>>                     ceph-users-leave(a)ceph.io
>>>>                     <mailto:ceph-users-leave@ceph.io>
>>>>                     >>
>>>>                     > _______________________________________________
>>>>                     > ceph-users mailing list --
>>>>                     ceph-users(a)ceph.io <mailto:ceph-users@ceph.io>
>>>>                     > To unsubscribe send an email to
>>>>                     ceph-users-leave(a)ceph.io
>>>>                     <mailto:ceph-users-leave@ceph.io>
>>>>                     _______________________________________________
>>>>                     ceph-users mailing list -- ceph-users(a)ceph.io
>>>>                     <mailto:ceph-users@ceph.io>
>>>>                     To unsubscribe send an email to
>>>>                     ceph-users-leave(a)ceph.io
>>>>                     <mailto:ceph-users-leave@ceph.io>
>>>> 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: BlueFS spillover detected, why, what?