[ceph-users] Re: BlueFS spillover detected, why, what?

20 Aug 2020

Hi Simon,

starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space at DB 
volume.

see this PR: https://github.com/ceph/ceph/pull/29687

Nice overview on the overall BlueFS/RocksDB design can be find here:

https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b…

Which also includes some overview (as well as additional concerns) for 
changes brought by the above-mentioned PR.

Thanks,

Igor

On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
> Hi Michael,
>
> thanks for the explanation! So if I understand correctly, we waste 93 
> GB per OSD on unused NVME space, because only 30GB is actually used...?
>
> And to improve the space for rocksdb, we need to plan for 300GB per 
> rocksdb partition in order to benefit from this advantage....
>
> Reducing the number of small files is something we always ask of our 
> users, but reality is what it is ;-)
>
> I'll have to look into how I can get an informative view on these 
> metrics... It's pretty overwhelming the amount of information coming 
> out of the ceph cluster, even when you look only superficially...
>
> Cheers,
>
> /Simon
>
> On 20/08/2020 10:16, Michael Bisig wrote:
>> Hi Simon
>>
>> As far as I know, RocksDB only uses "leveled" space on the NVME 
>> partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every 
>> DB space above such a limit will automatically end up on slow devices.
>> In your setup where you have 123GB per OSD that means you only use 
>> 30GB of fast device. The DB which spills over this limit will be 
>> offloaded to the HDD and accordingly, it slows down requests and 
>> compactions.
>>
>> You can proof what your OSD currently consumes with:
>>    ceph daemon osd.X perf dump
>>
>> Informative values are `db_total_bytes`, `db_used_bytes` and 
>> `slow_used_bytes`. This changes regularly because of the ongoing 
>> compactions but Prometheus mgr module exports these values such that 
>> you can track it.
>>
>> Small files generally leads to bigger RocksDB, especially when you 
>> use EC, but this depends on the actual amount and file sizes.
>>
>> I hope this helps.
>> Regards,
>> Michael
>>
>> On 20.08.20, 09:10, "Simon Oosthoek" &lt;s.oosthoek(a)science.ru.nl&gt;
wrote:
>>
>>      Hi
>>
>>      Recently our ceph cluster (nautilus) is experiencing bluefs 
>> spillovers,
>>      just 2 osd's and I disabled the warning for these osds.
>>      (ceph config set osd.125 bluestore_warn_on_bluefs_spillover false)
>>
>>      I'm wondering what causes this and how this can be prevented.
>>
>>      As I understand it the rocksdb for the OSD needs to store more 
>> than fits
>>      on the NVME logical volume (123G for 12T OSD). A way to fix it 
>> could be
>>      to increase the logical volume on the nvme (if there was space 
>> on the
>>      nvme, which there isn't at the moment).
>>
>>      This is the current size of the cluster and how much is free:
>>
>>      [root@cephmon1 ~]# ceph df
>>      RAW STORAGE:
>>           CLASS     SIZE        AVAIL       USED        RAW USED     
>> %RAW USED
>>           hdd       1.8 PiB     842 TiB     974 TiB      974 
>> TiB         53.63
>>           TOTAL     1.8 PiB     842 TiB     974 TiB      974 
>> TiB         53.63
>>
>>      POOLS:
>>           POOL                    ID     STORED      OBJECTS USED
>>      %USED     MAX AVAIL
>>           cephfs_data              1     572 MiB     121.26M 2.4 GiB
>>          0       167 TiB
>>           cephfs_metadata          2      56 GiB 5.15M      57 GiB
>>          0       167 TiB
>>           cephfs_data_3copy        8     201 GiB      51.68k 602 GiB
>>      0.09       222 TiB
>>           cephfs_data_ec83        13     643 TiB     279.75M 953 TiB
>>      58.86       485 TiB
>>           rbd                     14      21 GiB 5.66k      64 GiB
>>          0       222 TiB
>>           .rgw.root               15     1.2 KiB 4       1 MiB
>>          0       167 TiB
>>           default.rgw.control     16         0 B 8         0 B
>>          0       167 TiB
>>           default.rgw.meta        17       765 B 4       1 MiB
>>          0       167 TiB
>>           default.rgw.log         18         0 B 207         0 B
>>          0       167 TiB
>>           cephfs_data_ec57        20     433 MiB         230 1.2 GiB
>>          0       278 TiB
>>
>>      The amount used can still grow a bit before we need to add 
>> nodes, but
>>      apparently we are running into the limits of our rocskdb 
>> partitions.
>>
>>      Did we choose a parameter (e.g. minimal object size) too small, 
>> so we
>>      have too much objects on these spillover OSDs? Or is it that too 
>> many
>>      small files are stored on the cephfs filesystems?
>>
>>      When we expand the cluster, we can choose larger nvme devices to 
>> allow
>>      larger rocksdb partitions, but is that the right way to deal 
>> with this,
>>      or should we adjust some parameters on the cluster that will 
>> reduce the
>>      rocksdb size?
>>
>>      Cheers
>>
>>      /Simon
>>      _______________________________________________
>>      ceph-users mailing list -- ceph-users(a)ceph.io
>>      To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io

2024

2023

2022

2021

2020

2019

[ceph-users] Re: BlueFS spillover detected, why, what?