Honestly I don't have any perfect solution for now.
If this is urgent you probably better to proceed with enabling the new
DB space management feature.
But please do that eventually, modify 1-2 OSDs at the first stage and
test them for some period (may be a week or two).
Thanks,
Igor
On 8/20/2020 5:36 PM, Seena Fallah wrote:
So what do you suggest for a short term solution? (I
think you won't
backport it to nautilus at least about 6 month)
Changing db size is too expensive because I should buy new NVME
devices with double size and also redeploy all my OSDs.
Manual compaction will still have an impact on performance and doing
it for a month doesn't look very good!
On Thu, Aug 20, 2020 at 6:52 PM Igor Fedotov <ifedotov(a)suse.de
<mailto:ifedotov@suse.de>> wrote:
Correct.
On 8/20/2020 5:15 PM, Seena Fallah wrote:
> So you won't backport it to nautilus until it gets default to
> master for a while?
>
> On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov <ifedotov(a)suse.de
> <mailto:ifedotov@suse.de>> wrote:
>
> From technical/developer's point of view I don't see any
> issues with tuning this option. But since now I wouldn't
> recommend to enable it in production as it partially bypassed
> our regular development cycle. Being enabled in master for a
> while by default allows more develpers to use/try the feature
> before release. This can be considered as an additional
> implicit QA process. But as we just discovered this hasn't
> happened.
>
> Hence you can definitely try it but this exposes your
> cluster(s) to some risk as for any new (and incompletely
> tested) feature....
>
>
> Thanks,
>
> Igor
>
>
> On 8/20/2020 4:06 PM, Seena Fallah wrote:
>> Greate, thanks.
>>
>> Is it safe to change it manually in ceph.conf until next
>> nautilus release or should I wait for the next nautilus
>> release for this change? I mean does qa run on this value
>> for this config that we could trust and change it or should
>> we wait until the next nautilus release that qa ran on this
>> value?
>>
>> On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov
>> <ifedotov(a)suse.de <mailto:ifedotov@suse.de>> wrote:
>>
>> Hi Seena,
>>
>> this parameter isn't intended to be adjusted in
>> production environments - it's supposed that default
>> behavior covers all regular customers' needs.
>>
>> The issue though is that default setting is invalid. It
>> should be 'use_some_extra'. Gonna fix that shortly...
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>>
>>
>> On 8/20/2020 1:44 PM, Seena Fallah wrote:
>>> Hi Igor.
>>>
>>> Could you please tell why this config is in LEVEL_DEV
>>>
(
https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe66…
>>> As it is documented in Ceph we can't use LEVEL_DEV in
>>> production environments!
>>>
>>> Thanks
>>>
>>> On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov
>>> <ifedotov(a)suse.de <mailto:ifedotov@suse.de>> wrote:
>>>
>>> Hi Simon,
>>>
>>>
>>> starting Nautlus v14.2.10 Bluestore is able to use
>>> 'wasted' space at DB
>>> volume.
>>>
>>> see this PR:
https://github.com/ceph/ceph/pull/29687
>>>
>>> Nice overview on the overall BlueFS/RocksDB design
>>> can be find here:
>>>
>>>
https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b…
>>>
>>> Which also includes some overview (as well as
>>> additional concerns) for
>>> changes brought by the above-mentioned PR.
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>> On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
>>> > Hi Michael,
>>> >
>>> > thanks for the explanation! So if I understand
>>> correctly, we waste 93
>>> > GB per OSD on unused NVME space, because only
>>> 30GB is actually used...?
>>> >
>>> > And to improve the space for rocksdb, we need to
>>> plan for 300GB per
>>> > rocksdb partition in order to benefit from this
>>> advantage....
>>> >
>>> > Reducing the number of small files is something
>>> we always ask of our
>>> > users, but reality is what it is ;-)
>>> >
>>> > I'll have to look into how I can get an
>>> informative view on these
>>> > metrics... It's pretty overwhelming the amount of
>>> information coming
>>> > out of the ceph cluster, even when you look only
>>> superficially...
>>> >
>>> > Cheers,
>>> >
>>> > /Simon
>>> >
>>> > On 20/08/2020 10:16, Michael Bisig wrote:
>>> >> Hi Simon
>>> >>
>>> >> As far as I know, RocksDB only uses
"leveled"
>>> space on the NVME
>>> >> partition. The values are set to be 300MB, 3GB,
>>> 30GB and 300GB. Every
>>> >> DB space above such a limit will automatically
>>> end up on slow devices.
>>> >> In your setup where you have 123GB per OSD that
>>> means you only use
>>> >> 30GB of fast device. The DB which spills over
>>> this limit will be
>>> >> offloaded to the HDD and accordingly, it slows
>>> down requests and
>>> >> compactions.
>>> >>
>>> >> You can proof what your OSD currently consumes
with:
>>> >> ceph daemon osd.X perf dump
>>> >>
>>> >> Informative values are `db_total_bytes`,
>>> `db_used_bytes` and
>>> >> `slow_used_bytes`. This changes regularly
>>> because of the ongoing
>>> >> compactions but Prometheus mgr module exports
>>> these values such that
>>> >> you can track it.
>>> >>
>>> >> Small files generally leads to bigger RocksDB,
>>> especially when you
>>> >> use EC, but this depends on the actual amount
>>> and file sizes.
>>> >>
>>> >> I hope this helps.
>>> >> Regards,
>>> >> Michael
>>> >>
>>> >> On 20.08.20, 09:10, "Simon Oosthoek"
>>> <s.oosthoek(a)science.ru.nl
>>> <mailto:s.oosthoek@science.ru.nl>> wrote:
>>> >>
>>> >> Hi
>>> >>
>>> >> Recently our ceph cluster (nautilus) is
>>> experiencing bluefs
>>> >> spillovers,
>>> >> just 2 osd's and I disabled the warning
for
>>> these osds.
>>> >> (ceph config set osd.125
>>> bluestore_warn_on_bluefs_spillover false)
>>> >>
>>> >> I'm wondering what causes this and how
this
>>> can be prevented.
>>> >>
>>> >> As I understand it the rocksdb for the OSD
>>> needs to store more
>>> >> than fits
>>> >> on the NVME logical volume (123G for 12T
>>> OSD). A way to fix it
>>> >> could be
>>> >> to increase the logical volume on the nvme
>>> (if there was space
>>> >> on the
>>> >> nvme, which there isn't at the moment).
>>> >>
>>> >> This is the current size of the cluster and
>>> how much is free:
>>> >>
>>> >> [root@cephmon1 ~]# ceph df
>>> >> RAW STORAGE:
>>> >> CLASS SIZE AVAIL USED RAW
>>> USED
>>> >> %RAW USED
>>> >> hdd 1.8 PiB 842 TiB 974
>>> TiB 974
>>> >> TiB 53.63
>>> >> TOTAL 1.8 PiB 842 TiB 974
>>> TiB 974
>>> >> TiB 53.63
>>> >>
>>> >> POOLS:
>>> >> POOL ID STORED OBJECTS USED
>>> >> %USED MAX AVAIL
>>> >> cephfs_data 1 572 MiB
>>> 121.26M 2.4 GiB
>>> >> 0 167 TiB
>>> >> cephfs_metadata 2 56 GiB
>>> 5.15M 57 GiB
>>> >> 0 167 TiB
>>> >> cephfs_data_3copy 8 201 GiB
>>> 51.68k 602 GiB
>>> >> 0.09 222 TiB
>>> >> cephfs_data_ec83 13 643 TiB
>>> 279.75M 953 TiB
>>> >> 58.86 485 TiB
>>> >> rbd 14 21 GiB
>>> 5.66k 64 GiB
>>> >> 0 222 TiB
>>> >> .rgw.root 15 1.2 KiB 4 1
MiB
>>> >> 0 167 TiB
>>> >> default.rgw.control 16 0 B 8 0
B
>>> >> 0 167 TiB
>>> >> default.rgw.meta 17 765 B 4 1
MiB
>>> >> 0 167 TiB
>>> >> default.rgw.log 18 0 B
>>> 207 0 B
>>> >> 0 167 TiB
>>> >> cephfs_data_ec57 20 433 MiB
>>> 230 1.2 GiB
>>> >> 0 278 TiB
>>> >>
>>> >> The amount used can still grow a bit before
>>> we need to add
>>> >> nodes, but
>>> >> apparently we are running into the limits
>>> of our rocskdb
>>> >> partitions.
>>> >>
>>> >> Did we choose a parameter (e.g. minimal
>>> object size) too small,
>>> >> so we
>>> >> have too much objects on these spillover
>>> OSDs? Or is it that too
>>> >> many
>>> >> small files are stored on the cephfs
>>> filesystems?
>>> >>
>>> >> When we expand the cluster, we can choose
>>> larger nvme devices to
>>> >> allow
>>> >> larger rocksdb partitions, but is that the
>>> right way to deal
>>> >> with this,
>>> >> or should we adjust some parameters on the
>>> cluster that will
>>> >> reduce the
>>> >> rocksdb size?
>>> >>
>>> >> Cheers
>>> >>
>>> >> /Simon
>>> >> _______________________________________________
>>> >> ceph-users mailing list --
>>> ceph-users(a)ceph.io <mailto:ceph-users@ceph.io>
>>> >> To unsubscribe send an email to
>>> ceph-users-leave(a)ceph.io
>>> <mailto:ceph-users-leave@ceph.io>
>>> >>
>>> > _______________________________________________
>>> > ceph-users mailing list -- ceph-users(a)ceph.io
>>> <mailto:ceph-users@ceph.io>
>>> > To unsubscribe send an email to
>>> ceph-users-leave(a)ceph.io
>>> <mailto:ceph-users-leave@ceph.io>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>> <mailto:ceph-users@ceph.io>
>>> To unsubscribe send an email to
>>> ceph-users-leave(a)ceph.io
>>> <mailto:ceph-users-leave@ceph.io>
>>>