[ceph-users] Re: BlueFS spillover detected, why, what?

21 Aug 2020

Ok thanks. And also as you mentioned in the doc you shared from cloudferro,
It's not good to change `write_buffer_size` for bluestore rocksdb to fit
our db?

On Fri, Aug 21, 2020 at 1:46 AM Igor Fedotov &lt;ifedotov(a)suse.de&gt; wrote:

...
  Honestly I don't have any perfect solution for
now.

 If this is urgent you probably better to proceed with enabling the new DB
 space management feature.

 But please do that eventually, modify 1-2 OSDs at the first stage and test
 them for some period (may be a week or two).

 Thanks,

 Igor

 On 8/20/2020 5:36 PM, Seena Fallah wrote:

 So what do you suggest for a short term solution? (I think you won't
 backport it to nautilus at least about 6 month)

 Changing db size is too expensive because I should buy new NVME devices
 with double size and also redeploy all my OSDs.
 Manual compaction will still have an impact on performance and doing it
 for a month doesn't look very good!

 On Thu, Aug 20, 2020 at 6:52 PM Igor Fedotov &lt;ifedotov(a)suse.de&gt; wrote:

> Correct.
> On 8/20/2020 5:15 PM, Seena Fallah wrote:
>
> So you won't backport it to nautilus until it gets default to master for
> a while?
>
> On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov &lt;ifedotov(a)suse.de&gt; wrote:
>
>> From technical/developer's point of view I don't see any issues with
>> tuning this option. But since now I wouldn't  recommend to enable it in
>> production as it partially bypassed our regular development cycle. Being
>> enabled in master for a while by default allows more develpers to use/try
>> the feature before release. This can be considered as an additional
>> implicit QA process. But as we just discovered this hasn't happened.
>>
>> Hence you can definitely try it but this exposes your cluster(s) to some
>> risk as for any new (and incompletely tested) feature....
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>> On 8/20/2020 4:06 PM, Seena Fallah wrote:
>>
>> Greate, thanks.
>>
>> Is it safe to change it manually in ceph.conf until next nautilus
>> release or should I wait for the next nautilus release for this change? I
>> mean does qa run on this value for this config that we could trust and
>> change it or should we wait until the next nautilus release that qa ran on
>> this value?
>>
>> On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov &lt;ifedotov(a)suse.de&gt; wrote:
>>
>>> Hi Seena,
>>>
>>> this parameter isn't intended to be adjusted in production environments
>>> - it's supposed that default behavior covers all regular customers'
needs.
>>>
>>> The issue though is that default setting is invalid. It should be
>>> 'use_some_extra'. Gonna fix that shortly...
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>>
>>>
>>> On 8/20/2020 1:44 PM, Seena Fallah wrote:
>>>
>>> Hi Igor.
>>>
>>> Could you please tell why this config is in LEVEL_DEV (
>>>
https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe66…
>>> As it is documented in Ceph we can't use LEVEL_DEV in production
>>> environments!
>>>
>>> Thanks
>>>
>>> On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov &lt;ifedotov(a)suse.de&gt; wrote:
>>>
>>>> Hi Simon,
>>>>
>>>>
>>>> starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space
at
>>>> DB
>>>> volume.
>>>>
>>>> see this PR: https://github.com/ceph/ceph/pull/29687
>>>>
>>>> Nice overview on the overall BlueFS/RocksDB design can be find here:
>>>>
>>>>
>>>>
https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b…
>>>>
>>>> Which also includes some overview (as well as additional concerns) for
>>>> changes brought by the above-mentioned PR.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Igor
>>>>
>>>>
>>>> On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
>>>> > Hi Michael,
>>>> >
>>>> > thanks for the explanation! So if I understand correctly, we waste
>>>> 93
>>>> > GB per OSD on unused NVME space, because only 30GB is actually
>>>> used...?
>>>> >
>>>> > And to improve the space for rocksdb, we need to plan for 300GB per
>>>> > rocksdb partition in order to benefit from this advantage....
>>>> >
>>>> > Reducing the number of small files is something we always ask of
our
>>>> > users, but reality is what it is ;-)
>>>> >
>>>> > I'll have to look into how I can get an informative view on
these
>>>> > metrics... It's pretty overwhelming the amount of information
coming
>>>> > out of the ceph cluster, even when you look only superficially...
>>>> >
>>>> > Cheers,
>>>> >
>>>> > /Simon
>>>> >
>>>> > On 20/08/2020 10:16, Michael Bisig wrote:
>>>> >> Hi Simon
>>>> >>
>>>> >> As far as I know, RocksDB only uses "leveled" space on
the NVME
>>>> >> partition. The values are set to be 300MB, 3GB, 30GB and 300GB.
>>>> Every
>>>> >> DB space above such a limit will automatically end up on slow
>>>> devices.
>>>> >> In your setup where you have 123GB per OSD that means you only
use
>>>> >> 30GB of fast device. The DB which spills over this limit will
be
>>>> >> offloaded to the HDD and accordingly, it slows down requests
and
>>>> >> compactions.
>>>> >>
>>>> >> You can proof what your OSD currently consumes with:
>>>> >>    ceph daemon osd.X perf dump
>>>> >>
>>>> >> Informative values are `db_total_bytes`, `db_used_bytes` and
>>>> >> `slow_used_bytes`. This changes regularly because of the
ongoing
>>>> >> compactions but Prometheus mgr module exports these values such
>>>> that
>>>> >> you can track it.
>>>> >>
>>>> >> Small files generally leads to bigger RocksDB, especially when
you
>>>> >> use EC, but this depends on the actual amount and file sizes.
>>>> >>
>>>> >> I hope this helps.
>>>> >> Regards,
>>>> >> Michael
>>>> >>
>>>> >> On 20.08.20, 09:10, "Simon Oosthoek"
&lt;s.oosthoek(a)science.ru.nl&gt;
>>>> wrote:
>>>> >>
>>>> >>      Hi
>>>> >>
>>>> >>      Recently our ceph cluster (nautilus) is experiencing
bluefs
>>>> >> spillovers,
>>>> >>      just 2 osd's and I disabled the warning for these
osds.
>>>> >>      (ceph config set osd.125
bluestore_warn_on_bluefs_spillover
>>>> false)
>>>> >>
>>>> >>      I'm wondering what causes this and how this can be
prevented.
>>>> >>
>>>> >>      As I understand it the rocksdb for the OSD needs to store
more
>>>> >> than fits
>>>> >>      on the NVME logical volume (123G for 12T OSD). A way to fix
it
>>>> >> could be
>>>> >>      to increase the logical volume on the nvme (if there was
space
>>>> >> on the
>>>> >>      nvme, which there isn't at the moment).
>>>> >>
>>>> >>      This is the current size of the cluster and how much is
free:
>>>> >>
>>>> >>      [root@cephmon1 ~]# ceph df
>>>> >>      RAW STORAGE:
>>>> >>           CLASS     SIZE        AVAIL       USED        RAW
>>>> USED
>>>> >> %RAW USED
>>>> >>           hdd       1.8 PiB     842 TiB     974 TiB      974
>>>> >> TiB         53.63
>>>> >>           TOTAL     1.8 PiB     842 TiB     974 TiB      974
>>>> >> TiB         53.63
>>>> >>
>>>> >>      POOLS:
>>>> >>           POOL                    ID     STORED      OBJECTS
USED
>>>> >>      %USED     MAX AVAIL
>>>> >>           cephfs_data              1     572 MiB     121.26M 2.4
GiB
>>>> >>          0       167 TiB
>>>> >>           cephfs_metadata          2      56 GiB 5.15M      57
GiB
>>>> >>          0       167 TiB
>>>> >>           cephfs_data_3copy        8     201 GiB      51.68k 602
GiB
>>>> >>      0.09       222 TiB
>>>> >>           cephfs_data_ec83        13     643 TiB     279.75M 953
TiB
>>>> >>      58.86       485 TiB
>>>> >>           rbd                     14      21 GiB 5.66k      64
GiB
>>>> >>          0       222 TiB
>>>> >>           .rgw.root               15     1.2 KiB 4       1 MiB
>>>> >>          0       167 TiB
>>>> >>           default.rgw.control     16         0 B 8         0 B
>>>> >>          0       167 TiB
>>>> >>           default.rgw.meta        17       765 B 4       1 MiB
>>>> >>          0       167 TiB
>>>> >>           default.rgw.log         18         0 B 207         0
B
>>>> >>          0       167 TiB
>>>> >>           cephfs_data_ec57        20     433 MiB         230 1.2
GiB
>>>> >>          0       278 TiB
>>>> >>
>>>> >>      The amount used can still grow a bit before we need to add
>>>> >> nodes, but
>>>> >>      apparently we are running into the limits of our rocskdb
>>>> >> partitions.
>>>> >>
>>>> >>      Did we choose a parameter (e.g. minimal object size) too
>>>> small,
>>>> >> so we
>>>> >>      have too much objects on these spillover OSDs? Or is it
that
>>>> too
>>>> >> many
>>>> >>      small files are stored on the cephfs filesystems?
>>>> >>
>>>> >>      When we expand the cluster, we can choose larger nvme
devices
>>>> to
>>>> >> allow
>>>> >>      larger rocksdb partitions, but is that the right way to
deal
>>>> >> with this,
>>>> >>      or should we adjust some parameters on the cluster that
will
>>>> >> reduce the
>>>> >>      rocksdb size?
>>>> >>
>>>> >>      Cheers
>>>> >>
>>>> >>      /Simon
>>>> >>      _______________________________________________
>>>> >>      ceph-users mailing list -- ceph-users(a)ceph.io
>>>> >>      To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>> >>
>>>> > _______________________________________________
>>>> > ceph-users mailing list -- ceph-users(a)ceph.io
>>>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>
>>> 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: BlueFS spillover detected, why, what?