Can't say nothing about "write_buffer_size" tuning.. Never tried that.
But I presume that these are *"max_bytes_for_level_base*" and
*"**max_bytes_for_level_multiplier*" params which rather should be tuned
to modify RocksDB level granularity.
But I have no ideas how safe this is in a production environment.
Thanks,
Igor
On 8/21/2020 12:51 AM, Seena Fallah wrote:
Ok thanks. And also as you mentioned in the doc you
shared from
cloudferro, It's not good to change `write_buffer_size` for bluestore
rocksdb to fit our db?
On Fri, Aug 21, 2020 at 1:46 AM Igor Fedotov <ifedotov(a)suse.de
<mailto:ifedotov@suse.de>> wrote:
Honestly I don't have any perfect solution for now.
If this is urgent you probably better to proceed with enabling the
new DB space management feature.
But please do that eventually, modify 1-2 OSDs at the first stage
and test them for some period (may be a week or two).
Thanks,
Igor
On 8/20/2020 5:36 PM, Seena Fallah wrote:
> So what do you suggest for a short term solution? (I think you
> won't backport it to nautilus at least about 6 month)
>
> Changing db size is too expensive because I should buy new NVME
> devices with double size and also redeploy all my OSDs.
> Manual compaction will still have an impact on performance and
> doing it for a month doesn't look very good!
>
> On Thu, Aug 20, 2020 at 6:52 PM Igor Fedotov <ifedotov(a)suse.de
> <mailto:ifedotov@suse.de>> wrote:
>
> Correct.
>
> On 8/20/2020 5:15 PM, Seena Fallah wrote:
>> So you won't backport it to nautilus until it gets
>> default to master for a while?
>>
>> On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov
>> <ifedotov(a)suse.de <mailto:ifedotov@suse.de>> wrote:
>>
>> From technical/developer's point of view I don't see any
>> issues with tuning this option. But since now I wouldn't
>> recommend to enable it in production as it partially
>> bypassed our regular development cycle. Being enabled in
>> master for a while by default allows more develpers to
>> use/try the feature before release. This can be
>> considered as an additional implicit QA process. But as
>> we just discovered this hasn't happened.
>>
>> Hence you can definitely try it but this exposes your
>> cluster(s) to some risk as for any new (and incompletely
>> tested) feature....
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>> On 8/20/2020 4:06 PM, Seena Fallah wrote:
>>> Greate, thanks.
>>>
>>> Is it safe to change it manually in ceph.conf
>>> until next nautilus release or should I wait for the
>>> next nautilus release for this change? I mean does qa
>>> run on this value for this config that we could trust
>>> and change it or should we wait until the next nautilus
>>> release that qa ran on this value?
>>>
>>> On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov
>>> <ifedotov(a)suse.de <mailto:ifedotov@suse.de>> wrote:
>>>
>>> Hi Seena,
>>>
>>> this parameter isn't intended to be adjusted in
>>> production environments - it's supposed that
>>> default behavior covers all regular customers' needs.
>>>
>>> The issue though is that default setting is
>>> invalid. It should be 'use_some_extra'. Gonna fix
>>> that shortly...
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>>
>>>
>>> On 8/20/2020 1:44 PM, Seena Fallah wrote:
>>>> Hi Igor.
>>>>
>>>> Could you please tell why this config is in
>>>> LEVEL_DEV
>>>>
(
https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe66…
>>>> As it is documented in Ceph we can't use LEVEL_DEV
>>>> in production environments!
>>>>
>>>> Thanks
>>>>
>>>> On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov
>>>> <ifedotov(a)suse.de <mailto:ifedotov@suse.de>>
wrote:
>>>>
>>>> Hi Simon,
>>>>
>>>>
>>>> starting Nautlus v14.2.10 Bluestore is able to
>>>> use 'wasted' space at DB
>>>> volume.
>>>>
>>>> see this PR:
>>>>
https://github.com/ceph/ceph/pull/29687
>>>>
>>>> Nice overview on the overall BlueFS/RocksDB
>>>> design can be find here:
>>>>
>>>>
https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b…
>>>>
>>>> Which also includes some overview (as well as
>>>> additional concerns) for
>>>> changes brought by the above-mentioned PR.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Igor
>>>>
>>>>
>>>> On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
>>>> > Hi Michael,
>>>> >
>>>> > thanks for the explanation! So if I
>>>> understand correctly, we waste 93
>>>> > GB per OSD on unused NVME space, because
>>>> only 30GB is actually used...?
>>>> >
>>>> > And to improve the space for rocksdb, we
>>>> need to plan for 300GB per
>>>> > rocksdb partition in order to benefit from
>>>> this advantage....
>>>> >
>>>> > Reducing the number of small files is
>>>> something we always ask of our
>>>> > users, but reality is what it is ;-)
>>>> >
>>>> > I'll have to look into how I can get an
>>>> informative view on these
>>>> > metrics... It's pretty overwhelming the
>>>> amount of information coming
>>>> > out of the ceph cluster, even when you look
>>>> only superficially...
>>>> >
>>>> > Cheers,
>>>> >
>>>> > /Simon
>>>> >
>>>> > On 20/08/2020 10:16, Michael Bisig wrote:
>>>> >> Hi Simon
>>>> >>
>>>> >> As far as I know, RocksDB only uses
>>>> "leveled" space on the NVME
>>>> >> partition. The values are set to be 300MB,
>>>> 3GB, 30GB and 300GB. Every
>>>> >> DB space above such a limit will
>>>> automatically end up on slow devices.
>>>> >> In your setup where you have 123GB per OSD
>>>> that means you only use
>>>> >> 30GB of fast device. The DB which spills
>>>> over this limit will be
>>>> >> offloaded to the HDD and accordingly, it
>>>> slows down requests and
>>>> >> compactions.
>>>> >>
>>>> >> You can proof what your OSD currently
>>>> consumes with:
>>>> >> ceph daemon osd.X perf dump
>>>> >>
>>>> >> Informative values are `db_total_bytes`,
>>>> `db_used_bytes` and
>>>> >> `slow_used_bytes`. This changes regularly
>>>> because of the ongoing
>>>> >> compactions but Prometheus mgr module
>>>> exports these values such that
>>>> >> you can track it.
>>>> >>
>>>> >> Small files generally leads to bigger
>>>> RocksDB, especially when you
>>>> >> use EC, but this depends on the actual
>>>> amount and file sizes.
>>>> >>
>>>> >> I hope this helps.
>>>> >> Regards,
>>>> >> Michael
>>>> >>
>>>> >> On 20.08.20, 09:10, "Simon
Oosthoek"
>>>> <s.oosthoek(a)science.ru.nl
>>>> <mailto:s.oosthoek@science.ru.nl>> wrote:
>>>> >>
>>>> >> Hi
>>>> >>
>>>> >> Recently our ceph cluster (nautilus)
>>>> is experiencing bluefs
>>>> >> spillovers,
>>>> >> just 2 osd's and I disabled the
>>>> warning for these osds.
>>>> >> (ceph config set osd.125
>>>> bluestore_warn_on_bluefs_spillover false)
>>>> >>
>>>> >> I'm wondering what causes this and
how
>>>> this can be prevented.
>>>> >>
>>>> >> As I understand it the rocksdb for the
>>>> OSD needs to store more
>>>> >> than fits
>>>> >> on the NVME logical volume (123G for
>>>> 12T OSD). A way to fix it
>>>> >> could be
>>>> >> to increase the logical volume on the
>>>> nvme (if there was space
>>>> >> on the
>>>> >> nvme, which there isn't at the
moment).
>>>> >>
>>>> >> This is the current size of the
>>>> cluster and how much is free:
>>>> >>
>>>> >> [root@cephmon1 ~]# ceph df
>>>> >> RAW STORAGE:
>>>> >> CLASS SIZE AVAIL
>>>> USED RAW USED
>>>> >> %RAW USED
>>>> >> hdd 1.8 PiB 842 TiB 974
>>>> TiB 974
>>>> >> TiB 53.63
>>>> >> TOTAL 1.8 PiB 842 TiB 974
>>>> TiB 974
>>>> >> TiB 53.63
>>>> >>
>>>> >> POOLS:
>>>> >> POOL ID STORED
>>>> OBJECTS USED
>>>> >> %USED MAX AVAIL
>>>> >> cephfs_data 1 572 MiB
>>>> 121.26M 2.4 GiB
>>>> >> 0 167 TiB
>>>> >> cephfs_metadata 2 56 GiB 5.15M 57
GiB
>>>> >> 0 167 TiB
>>>> >> cephfs_data_3copy 8 201 GiB
>>>> 51.68k 602 GiB
>>>> >> 0.09 222 TiB
>>>> >> cephfs_data_ec83 13 643 TiB
>>>> 279.75M 953 TiB
>>>> >> 58.86 485 TiB
>>>> >> rbd 14 21 GiB 5.66k 64 GiB
>>>> >> 0 222 TiB
>>>> >> .rgw.root 15 1.2 KiB 4
>>>> 1 MiB
>>>> >> 0 167 TiB
>>>> >> default.rgw.control 16 0 B
>>>> 8 0 B
>>>> >> 0 167 TiB
>>>> >> default.rgw.meta 17 765 B 4 1
MiB
>>>> >> 0 167 TiB
>>>> >> default.rgw.log 18 0 B 207 0
B
>>>> >> 0 167 TiB
>>>> >> cephfs_data_ec57 20 433 MiB
>>>> 230 1.2 GiB
>>>> >> 0 278 TiB
>>>> >>
>>>> >> The amount used can still grow a bit
>>>> before we need to add
>>>> >> nodes, but
>>>> >> apparently we are running into the
>>>> limits of our rocskdb
>>>> >> partitions.
>>>> >>
>>>> >> Did we choose a parameter (e.g.
>>>> minimal object size) too small,
>>>> >> so we
>>>> >> have too much objects on these
>>>> spillover OSDs? Or is it that too
>>>> >> many
>>>> >> small files are stored on the cephfs
>>>> filesystems?
>>>> >>
>>>> >> When we expand the cluster, we can
>>>> choose larger nvme devices to
>>>> >> allow
>>>> >> larger rocksdb partitions, but is that
>>>> the right way to deal
>>>> >> with this,
>>>> >> or should we adjust some parameters on
>>>> the cluster that will
>>>> >> reduce the
>>>> >> rocksdb size?
>>>> >>
>>>> >> Cheers
>>>> >>
>>>> >> /Simon
>>>> >>
_______________________________________________
>>>> >> ceph-users mailing list --
>>>> ceph-users(a)ceph.io <mailto:ceph-users@ceph.io>
>>>> >> To unsubscribe send an email to
>>>> ceph-users-leave(a)ceph.io
>>>> <mailto:ceph-users-leave@ceph.io>
>>>> >>
>>>> > _______________________________________________
>>>> > ceph-users mailing list --
>>>> ceph-users(a)ceph.io <mailto:ceph-users@ceph.io>
>>>> > To unsubscribe send an email to
>>>> ceph-users-leave(a)ceph.io
>>>> <mailto:ceph-users-leave@ceph.io>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>> <mailto:ceph-users@ceph.io>
>>>> To unsubscribe send an email to
>>>> ceph-users-leave(a)ceph.io
>>>> <mailto:ceph-users-leave@ceph.io>
>>>>