You can use the extra container arguments I pointed
out a few months
ago. Those work in my test clusters, although I haven’t enabled that
in production yet. But it shouldn’t make a difference if it’s a test
cluster or not. 😉
Zitat von Zakhar Kirpichenko <zakhar(a)gmail.com>om>:
> Hi,
>
>> Did you noticed any downsides with your compression settings so far?
>
> None, at least on our systems. Except the part that I haven't found a way
> to make the settings persist.
>
>> Do you have all mons now on compression?
>
> I have 3 out of 5 monitors with compression and 2 without it. The 2
> monitors with uncompressed RocksDB have much larger disks which do not
> suffer from writes as much as the other 3. I keep them uncompressed "just
> in case", i.e. for the unlikely event if the 3 monitors with compressed
> RocksDB fail or have any issues specifically because of the compression. I
> have to say that this hasn't happened yet, and this precaution may be
> unnecessary.
>
>> Did release updates go through without issues?
>
> In our case, container updates overwrite the monitors' configurations and
> reset RocksDB options, thus each updated monitor runs with no RocksDB
> compression until it is added back manually. Other than that, I have not
> encountered any issues related to compression during the updates.
>
>> Do you know if this works also with reef (we see massive writes as well
> there)?
>
> Unfortunately, I can't comment on Reef as we're still using Pacific.
>
> /Z
>
> On Tue, 16 Apr 2024 at 18:08, Dietmar Rieder <dietmar.rieder(a)i-med.ac.at>
> wrote:
>
>> Hi Zakhar, hello List,
>>
>> I just wanted to follow up on this and ask a few quesitions:
>>
>> Did you noticed any downsides with your compression settings so far?
>> Do you have all mons now on compression?
>> Did release updates go through without issues?
>> Do you know if this works also with reef (we see massive writes as well
>> there)?
>>
>> Can you briefly tabulate the commands you used to persistently set the
>> compression options?
>>
>> Thanks so much,
>>
>> Dietmar
>>
>>
>> On 10/18/23 06:14, Zakhar Kirpichenko wrote:
>>> Many thanks for this, Eugen! I very much appreciate yours and Mykola's
>>> efforts and insight!
>>>
>>> Another thing I noticed was a reduction of RocksDB store after the
>>> reduction of the total PG number by 30%, from 590-600 MB:
>>>
>>> 65M 3675511.sst
>>> 65M 3675512.sst
>>> 65M 3675513.sst
>>> 65M 3675514.sst
>>> 65M 3675515.sst
>>> 65M 3675516.sst
>>> 65M 3675517.sst
>>> 65M 3675518.sst
>>> 62M 3675519.sst
>>>
>>> to about half of the original size:
>>>
>>> -rw-r--r-- 1 167 167 7218886 Oct 13 16:16 3056869.log
>>> -rw-r--r-- 1 167 167 67250650 Oct 13 16:15 3056871.sst
>>> -rw-r--r-- 1 167 167 67367527 Oct 13 16:15 3056872.sst
>>> -rw-r--r-- 1 167 167 63268486 Oct 13 16:15 3056873.sst
>>>
>>> Then when I restarted the monitors one by one before adding compression,
>>> RocksDB store reduced even further. I am not sure why and what exactly
>> got
>>> automatically removed from the store:
>>>
>>> -rw-r--r-- 1 167 167 841960 Oct 18 03:31 018779.log
>>> -rw-r--r-- 1 167 167 67290532 Oct 18 03:31 018781.sst
>>> -rw-r--r-- 1 167 167 53287626 Oct 18 03:31 018782.sst
>>>
>>> Then I have enabled LZ4 and LZ4HC compression in our small production
>>> cluster (6 nodes, 96 OSDs) on 3 out of 5
>>> monitors:
>> compression=kLZ4Compression,bottommost_compression=kLZ4HCCompression.
>>> I specifically went for LZ4 and LZ4HC because of the balance between
>>> compression/decompression speed and impact on CPU usage. The compression
>>> doesn't seem to affect the cluster in any negative way, the 3 monitors
>> with
>>> compression are operating normally. The effect of the compression on
>>> RocksDB store size and disk writes is quite noticeable:
>>>
>>> Compression disabled, 155 MB store.db, ~125 MB RocksDB sst, and ~530 MB
>>> writes over 5 minutes:
>>>
>>> -rw-r--r-- 1 167 167 4227337 Oct 18 03:58 3080868.log
>>> -rw-r--r-- 1 167 167 67253592 Oct 18 03:57 3080870.sst
>>> -rw-r--r-- 1 167 167 57783180 Oct 18 03:57 3080871.sst
>>>
>>> # du -hs
>>> /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/;
>>> iotop -ao -bn 2 -d 300 2>&1 | grep ceph-mon
>>> 155M
>>> /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/
>>> 2471602 be/4 167 6.05 M 473.24 M 0.00 % 0.16 % ceph-mon -n
>>> mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false
>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
>>> --default-mon-cluster-log-to-file=false
>>> --default-mon-cluster-log-to-stderr=true [rocksdb:low0]
>>> 2471633 be/4 167 188.00 K 40.91 M 0.00 % 0.02 % ceph-mon -n
>>> mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false
>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
>>> --default-mon-cluster-log-to-file=false
>>> --default-mon-cluster-log-to-stderr=true [ms_dispatch]
>>> 2471603 be/4 167 16.00 K 24.16 M 0.00 % 0.01 % ceph-mon -n
>>> mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false
>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
>>> --default-mon-cluster-log-to-file=false
>>> --default-mon-cluster-log-to-stderr=true [rocksdb:high0]
>>>
>>> Compression enabled, 60 MB store.db, ~23 MB RocksDB sst, and ~130 MB of
>>> writes over 5 minutes:
>>>
>>> -rw-r--r-- 1 167 167 5766659 Oct 18 03:56 3723355.log
>>> -rw-r--r-- 1 167 167 22240390 Oct 18 03:56 3723357.sst
>>>
>>> # du -hs
>>> /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph03/store.db/;
>>> iotop -ao -bn 2 -d 300 2>&1 | grep ceph-mon
>>> 60M
>>> /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph03/store.db/
>>> 2052031 be/4 167 1040.00 K 83.48 M 0.00 % 0.01 % ceph-mon -n
>>> mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false
>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
>>> --default-mon-cluster-log-to-file=false
>>> --default-mon-cluster-log-to-stderr=true [rocksdb:low0]
>>> 2052062 be/4 167 0.00 B 40.79 M 0.00 % 0.01 % ceph-mon -n
>>> mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false
>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
>>> --default-mon-cluster-log-to-file=false
>>> --default-mon-cluster-log-to-stderr=true [ms_dispatch]
>>> 2052032 be/4 167 16.00 K 4.68 M 0.00 % 0.00 % ceph-mon -n
>>> mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false
>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
>>> --default-mon-cluster-log-to-file=false
>>> --default-mon-cluster-log-to-stderr=true [rocksdb:high0]
>>> 2052052 be/4 167 44.00 K 0.00 B 0.00 % 0.00 % ceph-mon -n
>>> mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false
>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
>>> --default-mon-cluster-log-to-file=false
>>> --default-mon-cluster-log-to-stderr=true [msgr-worker-0]
>>>
>>> I haven't noticed a major CPU impact. Unfortunately I didn't
specifically
>>> measure CPU time for monitors and , but overall the CPU impact of monitor
>>> store compression on our systems isn't noticeable. This may be different
>>> for larger clusters with larger RocksDB datasets, then perhaps
>>> compression=kLZ4Compression can be enabled by defualt and
>>> bottommost_compression=kLZ4HCCompression can be optional, in theory this
>>> should result in lower but much faster compression.
>>>
>>> I hope this helps. My plan is to keep the monitors with the current
>>> settings, i.e. 3 with compression + 2 without compression, until the next
>>> minor release of Pacific to see whether the monitors with compressed
>>> RocksDB store can be upgraded without issues.
>>>
>>> /Z
>>>
>>>
>>> On Tue, 17 Oct 2023 at 23:45, Eugen Block <eblock(a)nde.ag> wrote:
>>>
>>>> Hi Zakhar,
>>>>
>>>> I took a closer look into what the MONs really do (again with
Mykola's
>>>> help) and why manual compaction is triggered so frequently. With
>>>> debug_paxos=20 I noticed that paxosservice and paxos triggered manual
>>>> compactions. So I played with these values:
>>>>
>>>> paxos_service_trim_max = 1000 (default 500)
>>>> paxos_service_trim_min = 500 (default 250)
>>>> paxos_trim_max = 1000 (default 500)
>>>> paxos_trim_min = 500 (default 250)
>>>>
>>>> This reduced the amount of writes by a factor of 3 or 4, the iotop
>>>> values are fluctuating a bit, of course. As Mykola suggested I created
>>>> a tracker issue [1] to increase the default values since they don't
>>>> seem suitable for a production environment. Although I don't have
>>>> tested that in production yet I'll ask one of our customers to do
that
>>>> in their secondary cluster (for rbd mirroring) where they also suffer
>>>> from large mon stores and heavy writes to the mon store. Your findings
>>>> with the compaction were quite helpful as well, we'll test that as
well.
>>>> Igor mentioned that the default bluestore_rocksdb config for OSDs will
>>>> enable compression because of positive test results. If we can confirm
>>>> that compression works well for MONs too, compression could be enabled
>>>> by default as well.
>>>>
>>>> Regards,
>>>> Eugen
>>>>
>>>>
https://tracker.ceph.com/issues/63229
>>>>
>>>> Zitat von Zakhar Kirpichenko <zakhar(a)gmail.com>om>:
>>>>
>>>>> With the help of community members, I managed to enable RocksDB
>>>> compression
>>>>> for a test monitor, and it seems to be working well.
>>>>>
>>>>> Monitor w/o compression writes about 750 MB to disk in 5 minutes:
>>>>>
>>>>> 4854 be/4 167 4.97 M 755.02 M 0.00 % 0.24 %
>> ceph-mon -n
>>>>> mon.ceph04 -f --setuser ceph --setgroup ceph
>> --default-log-to-file=false
>>>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
>>>>> --default-mon-cluster-log-to-file=false
>>>>> --default-mon-cluster-log-to-stderr=true [rocksdb:low0]
>>>>>
>>>>> Monitor with LZ4 compression writes about 1/4 of that over the same
>> time
>>>>> period:
>>>>>
>>>>> 2034728 be/4 167 172.00 K 199.27 M 0.00 % 0.06 %
ceph-mon
>> -n
>>>>> mon.ceph05 -f --setuser ceph --setgroup ceph
>> --default-log-to-file=false
>>>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
>>>>> --default-mon-cluster-log-to-file=false
>>>>> --default-mon-cluster-log-to-stderr=true [rocksdb:low0]
>>>>>
>>>>> This is caused by the apparent difference in store.db sizes.
>>>>>
>>>>> Mon store.db w/o compression:
>>>>>
>>>>> # ls -al
>>>>>
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db
>>>>> total 257196
>>>>> drwxr-xr-x 2 167 167 4096 Oct 16 14:00 .
>>>>> drwx------ 3 167 167 4096 Aug 31 05:22 ..
>>>>> -rw-r--r-- 1 167 167 1517623 Oct 16 14:00 3073035.log
>>>>> -rw-r--r-- 1 167 167 67285944 Oct 16 14:00 3073037.sst
>>>>> -rw-r--r-- 1 167 167 67402325 Oct 16 14:00 3073038.sst
>>>>> -rw-r--r-- 1 167 167 62364991 Oct 16 14:00 3073039.sst
>>>>>
>>>>> Mon store.db with compression:
>>>>>
>>>>> # ls -al
>>>>>
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph05/store.db
>>>>> total 91188
>>>>> drwxr-xr-x 2 167 167 4096 Oct 16 14:00 .
>>>>> drwx------ 3 167 167 4096 Oct 16 13:35 ..
>>>>> -rw-r--r-- 1 167 167 1760114 Oct 16 14:00 012693.log
>>>>> -rw-r--r-- 1 167 167 52236087 Oct 16 14:00 012695.sst
>>>>>
>>>>> There are no apparent downsides thus far. If everything works well,
I
>>>> will
>>>>> try adding compression to other monitors.
>>>>>
>>>>> /Z
>>>>>
>>>>> On Mon, 16 Oct 2023 at 14:57, Zakhar Kirpichenko
<zakhar(a)gmail.com>
>>>> wrote:
>>>>>
>>>>>> The issue persists, although to a lesser extent. Any comments
from the
>>>>>> Ceph team please?
>>>>>>
>>>>>> /Z
>>>>>>
>>>>>> On Fri, 13 Oct 2023 at 20:51, Zakhar Kirpichenko
<zakhar(a)gmail.com>
>>>> wrote:
>>>>>>
>>>>>>>> Some of it is transferable to RocksDB on mons
nonetheless.
>>>>>>>
>>>>>>> Please point me to relevant Ceph documentation, i.e. a
description of
>>>> how
>>>>>>> various Ceph monitor and RocksDB tunables affect the
operations of
>>>>>>> monitors, I'll gladly look into it.
>>>>>>>
>>>>>>>> Please point me to such recommendations, if they're
on
>>
docs.ceph.com
>>>> I'll
>>>>>>> get them updated.
>>>>>>>
>>>>>>> This are the recommendations we used when we built our
Pacific
>> cluster:
>>>>>>>
https://docs.ceph.com/en/pacific/start/hardware-recommendations/
>>>>>>>
>>>>>>> Our drives are 4x times larger than recommended by this
guide. The
>>>> drives
>>>>>>> are rated for < 0.5 DWPD, which is more than sufficient
for boot
>>>> drives and
>>>>>>> storage of rarely modified files. It is not documented or
suggested
>>>>>>> anywhere that monitor processes write several hundred
gigabytes of
>>>> data per
>>>>>>> day, exceeding the amount of data written by OSDs. Which is
why I am
>>>> not
>>>>>>> convinced that what we're observing is expected behavior,
but it's
>> not
>>>> easy
>>>>>>> to get a definitive answer from the Ceph community.
>>>>>>>
>>>>>>> /Z
>>>>>>>
>>>>>>> On Fri, 13 Oct 2023 at 20:35, Anthony D'Atri <
>> anthony.datri(a)gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Some of it is transferable to RocksDB on mons
nonetheless.
>>>>>>>>
>>>>>>>> but their specs exceed Ceph hardware recommendations by a
good
>> margin
>>>>>>>>
>>>>>>>>
>>>>>>>> Please point me to such recommendations, if they're
on
>>
docs.ceph.com
>>>> I'll
>>>>>>>> get them updated.
>>>>>>>>
>>>>>>>> On Oct 13, 2023, at 13:34, Zakhar Kirpichenko
<zakhar(a)gmail.com>
>>>> wrote:
>>>>>>>>
>>>>>>>> Thank you, Anthony. As I explained to you earlier, the
article you
>> had
>>>>>>>> sent is about RocksDB tuning for Bluestore OSDs, while
the issue
>>>>>>>> at hand is
>>>>>>>> not with OSDs but rather monitors and their RocksDB
store. Indeed,
>> the
>>>>>>>> drives are not enterprise-grade, but their specs exceed
Ceph
>> hardware
>>>>>>>> recommendations by a good margin, they're being used
as boot drives
>>>> only
>>>>>>>> and aren't supposed to be written to continuously at
high rates -
>>>> which is
>>>>>>>> what unfortunately is happening. I am trying to determine
why it is
>>>>>>>> happening and how the issue can be alleviated or
resolved,
>>>> unfortunately
>>>>>>>> monitor RocksDB usage and tunables appear to be not
documented at
>> all.
>>>>>>>>
>>>>>>>> /Z
>>>>>>>>
>>>>>>>> On Fri, 13 Oct 2023 at 20:11, Anthony D'Atri <
>> anthony.datri(a)gmail.com
>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> cf. Mark's article I sent you re RocksDB tuning.
I suspect that
>> with
>>>>>>>>> Reef you would experience fewer writes. Universal
compaction might
>>>> also
>>>>>>>>> help, but in the end this SSD is a client SKU and
really not suited
>>>> for
>>>>>>>>> enterprise use. If you had the 1TB SKU you'd get
much longer
>>>>>>>>> life, or you
>>>>>>>>> could change the overprovisioning on the ones you
have.
>>>>>>>>>
>>>>>>>>> On Oct 13, 2023, at 12:30, Zakhar Kirpichenko
<zakhar(a)gmail.com>
>>>> wrote:
>>>>>>>>>
>>>>>>>>> I would very much appreciate it if someone with a
better
>>>> understanding
>>>>>>>>> of
>>>>>>>>> monitor internals and use of RocksDB could please
chip in.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>
>>
>> --
>> _________________________________________________________
>> D i e t m a r R i e d e r
>> Innsbruck Medical University
>> Biocenter - Institute of Bioinformatics
>> Innrain 80, 6020 Innsbruck
>> Phone: +43 512 9003 71402 | Mobile: +43 676 8716 72402
>> Email: dietmar.rieder(a)i-med.ac.at
>> Web:
http://www.icbi.at
>>
>>
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io