[ceph-users] Re: BlueFS spillover detected, why, what?

20 Aug 2020

Honestly I don't have any perfect solution for now.

If this is urgent you probably better to proceed with enabling the new 
DB space management feature.

But please do that eventually, modify 1-2 OSDs at the first stage and 
test them for some period (may be a week or two).

Thanks,

Igor

On 8/20/2020 5:36 PM, Seena Fallah wrote:
...
  So what do you suggest for a short term solution? (I
think you won't 
 backport it to nautilus at least about 6 month)

 Changing db size is too expensive because I should buy new NVME 
 devices with double size and also redeploy all my OSDs.
 Manual compaction will still have an impact on performance and doing 
 it for a month doesn't look very good!

 On Thu, Aug 20, 2020 at 6:52 PM Igor Fedotov &lt;ifedotov(a)suse.de 
 <mailto:ifedotov@suse.de>> wrote:

     Correct.

     On 8/20/2020 5:15 PM, Seena Fallah wrote:
>     So you won't backport it to nautilus until it gets default to
>     master for a while?
>
>     On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov &lt;ifedotov(a)suse.de
>     <mailto:ifedotov@suse.de>> wrote:
>
>         From technical/developer's point of view I don't see any
>         issues with tuning this option. But since now I wouldn't 
>         recommend to enable it in production as it partially bypassed
>         our regular development cycle. Being enabled in master for a
>         while by default allows more develpers to use/try the feature
>         before release. This can be considered as an additional
>         implicit QA process. But as we just discovered this hasn't
>         happened.
>
>         Hence you can definitely try it but this exposes your
>         cluster(s) to some risk as for any new (and incompletely
>         tested) feature....
>
>
>         Thanks,
>
>         Igor
>
>
>         On 8/20/2020 4:06 PM, Seena Fallah wrote:
>>         Greate, thanks.
>>
>>         Is it safe to change it manually in ceph.conf until next
>>         nautilus release or should I wait for the next nautilus
>>         release for this change? I mean does qa run on this value
>>         for this config that we could trust and change it or should
>>         we wait until the next nautilus release that qa ran on this
>>         value?
>>
>>         On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov
>>         &lt;ifedotov(a)suse.de <mailto:ifedotov@suse.de>> wrote:
>>
>>             Hi Seena,
>>
>>             this parameter isn't intended to be adjusted in
>>             production environments - it's supposed that default
>>             behavior covers all regular customers' needs.
>>
>>             The issue though is that default setting is invalid. It
>>             should be 'use_some_extra'. Gonna fix that shortly...
>>
>>
>>             Thanks,
>>
>>             Igor
>>
>>
>>
>>
>>             On 8/20/2020 1:44 PM, Seena Fallah wrote:
>>>             Hi Igor.
>>>
>>>             Could you please tell why this config is in LEVEL_DEV
>>>            
(https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe66…
>>>             As it is documented in Ceph we can't use LEVEL_DEV in
>>>             production environments!
>>>
>>>             Thanks
>>>
>>>             On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov
>>>             &lt;ifedotov(a)suse.de <mailto:ifedotov@suse.de>> wrote:
>>>
>>>                 Hi Simon,
>>>
>>>
>>>                 starting Nautlus v14.2.10 Bluestore is able to use
>>>                 'wasted' space at DB
>>>                 volume.
>>>
>>>                 see this PR: https://github.com/ceph/ceph/pull/29687
>>>
>>>                 Nice overview on the overall BlueFS/RocksDB design
>>>                 can be find here:
>>>
>>>                
https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b…
>>>
>>>                 Which also includes some overview (as well as
>>>                 additional concerns) for
>>>                 changes brought by the above-mentioned PR.
>>>
>>>
>>>                 Thanks,
>>>
>>>                 Igor
>>>
>>>
>>>                 On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
>>>                 > Hi Michael,
>>>                 >
>>>                 > thanks for the explanation! So if I understand
>>>                 correctly, we waste 93
>>>                 > GB per OSD on unused NVME space, because only
>>>                 30GB is actually used...?
>>>                 >
>>>                 > And to improve the space for rocksdb, we need to
>>>                 plan for 300GB per
>>>                 > rocksdb partition in order to benefit from this
>>>                 advantage....
>>>                 >
>>>                 > Reducing the number of small files is something
>>>                 we always ask of our
>>>                 > users, but reality is what it is ;-)
>>>                 >
>>>                 > I'll have to look into how I can get an
>>>                 informative view on these
>>>                 > metrics... It's pretty overwhelming the amount of
>>>                 information coming
>>>                 > out of the ceph cluster, even when you look only
>>>                 superficially...
>>>                 >
>>>                 > Cheers,
>>>                 >
>>>                 > /Simon
>>>                 >
>>>                 > On 20/08/2020 10:16, Michael Bisig wrote:
>>>                 >> Hi Simon
>>>                 >>
>>>                 >> As far as I know, RocksDB only uses
"leveled"
>>>                 space on the NVME
>>>                 >> partition. The values are set to be 300MB, 3GB,
>>>                 30GB and 300GB. Every
>>>                 >> DB space above such a limit will automatically
>>>                 end up on slow devices.
>>>                 >> In your setup where you have 123GB per OSD that
>>>                 means you only use
>>>                 >> 30GB of fast device. The DB which spills over
>>>                 this limit will be
>>>                 >> offloaded to the HDD and accordingly, it slows
>>>                 down requests and
>>>                 >> compactions.
>>>                 >>
>>>                 >> You can proof what your OSD currently consumes
with:
>>>                 >>    ceph daemon osd.X perf dump
>>>                 >>
>>>                 >> Informative values are `db_total_bytes`,
>>>                 `db_used_bytes` and
>>>                 >> `slow_used_bytes`. This changes regularly
>>>                 because of the ongoing
>>>                 >> compactions but Prometheus mgr module exports
>>>                 these values such that
>>>                 >> you can track it.
>>>                 >>
>>>                 >> Small files generally leads to bigger RocksDB,
>>>                 especially when you
>>>                 >> use EC, but this depends on the actual amount
>>>                 and file sizes.
>>>                 >>
>>>                 >> I hope this helps.
>>>                 >> Regards,
>>>                 >> Michael
>>>                 >>
>>>                 >> On 20.08.20, 09:10, "Simon Oosthoek"
>>>                 &lt;s.oosthoek(a)science.ru.nl
>>>                 <mailto:s.oosthoek@science.ru.nl>> wrote:
>>>                 >>
>>>                 >>      Hi
>>>                 >>
>>>                 >>      Recently our ceph cluster (nautilus) is
>>>                 experiencing bluefs
>>>                 >> spillovers,
>>>                 >>      just 2 osd's and I disabled the warning
for
>>>                 these osds.
>>>                 >>      (ceph config set osd.125
>>>                 bluestore_warn_on_bluefs_spillover false)
>>>                 >>
>>>                 >>      I'm wondering what causes this and how
this
>>>                 can be prevented.
>>>                 >>
>>>                 >>      As I understand it the rocksdb for the OSD
>>>                 needs to store more
>>>                 >> than fits
>>>                 >>      on the NVME logical volume (123G for 12T
>>>                 OSD). A way to fix it
>>>                 >> could be
>>>                 >>      to increase the logical volume on the nvme
>>>                 (if there was space
>>>                 >> on the
>>>                 >>      nvme, which there isn't at the moment).
>>>                 >>
>>>                 >>      This is the current size of the cluster and
>>>                 how much is free:
>>>                 >>
>>>                 >>      [root@cephmon1 ~]# ceph df
>>>                 >>      RAW STORAGE:
>>>                 >>           CLASS SIZE        AVAIL       USED RAW
>>>                 USED
>>>                 >> %RAW USED
>>>                 >>           hdd       1.8 PiB     842 TiB     974
>>>                 TiB      974
>>>                 >> TiB         53.63
>>>                 >>           TOTAL     1.8 PiB     842 TiB     974
>>>                 TiB      974
>>>                 >> TiB         53.63
>>>                 >>
>>>                 >>      POOLS:
>>>                 >> POOL                    ID STORED      OBJECTS USED
>>>                 >>      %USED     MAX AVAIL
>>>                 >> cephfs_data              1     572 MiB    
>>>                 121.26M 2.4 GiB
>>>                 >>          0       167 TiB
>>>                 >> cephfs_metadata          2      56 GiB
>>>                 5.15M      57 GiB
>>>                 >>          0       167 TiB
>>>                 >> cephfs_data_3copy        8     201 GiB     
>>>                 51.68k 602 GiB
>>>                 >>      0.09       222 TiB
>>>                 >> cephfs_data_ec83        13     643 TiB    
>>>                 279.75M 953 TiB
>>>                 >>      58.86       485 TiB
>>>                 >> rbd                     14      21 GiB
>>>                 5.66k      64 GiB
>>>                 >>          0       222 TiB
>>>                 >> .rgw.root               15     1.2 KiB 4       1
MiB
>>>                 >>          0       167 TiB
>>>                 >> default.rgw.control     16         0 B 8         0
B
>>>                 >>          0       167 TiB
>>>                 >> default.rgw.meta        17       765 B 4       1
MiB
>>>                 >>          0       167 TiB
>>>                 >> default.rgw.log         18         0 B
>>>                 207         0 B
>>>                 >>          0       167 TiB
>>>                 >> cephfs_data_ec57        20     433 MiB        
>>>                 230 1.2 GiB
>>>                 >>          0       278 TiB
>>>                 >>
>>>                 >>      The amount used can still grow a bit before
>>>                 we need to add
>>>                 >> nodes, but
>>>                 >>      apparently we are running into the limits
>>>                 of our rocskdb
>>>                 >> partitions.
>>>                 >>
>>>                 >>      Did we choose a parameter (e.g. minimal
>>>                 object size) too small,
>>>                 >> so we
>>>                 >>      have too much objects on these spillover
>>>                 OSDs? Or is it that too
>>>                 >> many
>>>                 >>      small files are stored on the cephfs
>>>                 filesystems?
>>>                 >>
>>>                 >>      When we expand the cluster, we can choose
>>>                 larger nvme devices to
>>>                 >> allow
>>>                 >>      larger rocksdb partitions, but is that the
>>>                 right way to deal
>>>                 >> with this,
>>>                 >>      or should we adjust some parameters on the
>>>                 cluster that will
>>>                 >> reduce the
>>>                 >>      rocksdb size?
>>>                 >>
>>>                 >>      Cheers
>>>                 >>
>>>                 >>      /Simon
>>>                 >> _______________________________________________
>>>                 >>      ceph-users mailing list --
>>>                 ceph-users(a)ceph.io <mailto:ceph-users@ceph.io>
>>>                 >>      To unsubscribe send an email to
>>>                 ceph-users-leave(a)ceph.io
>>>                 <mailto:ceph-users-leave@ceph.io>
>>>                 >>
>>>                 > _______________________________________________
>>>                 > ceph-users mailing list -- ceph-users(a)ceph.io
>>>                 <mailto:ceph-users@ceph.io>
>>>                 > To unsubscribe send an email to
>>>                 ceph-users-leave(a)ceph.io
>>>                 <mailto:ceph-users-leave@ceph.io>
>>>                 _______________________________________________
>>>                 ceph-users mailing list -- ceph-users(a)ceph.io
>>>                 <mailto:ceph-users@ceph.io>
>>>                 To unsubscribe send an email to
>>>                 ceph-users-leave(a)ceph.io
>>>                 <mailto:ceph-users-leave@ceph.io>
>>> 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: BlueFS spillover detected, why, what?