Hi Zakhar,
thanks so much for the information.
Best
D.Rieder
On 4/16/24 17:40, Zakhar Kirpichenko wrote:
Hi,
Did you noticed any downsides with your
compression settings so far?
None, at least on our systems. Except the part that I haven't found a
way to make the settings persist.
Do you have all mons now on compression?
I have 3 out of 5 monitors with compression and 2 without it. The 2
monitors with uncompressed RocksDB have much larger disks which do not
suffer from writes as much as the other 3. I keep them uncompressed
"just in case", i.e. for the unlikely event if the 3 monitors with
compressed RocksDB fail or have any issues specifically because of the
compression. I have to say that this hasn't happened yet, and this
precaution may be unnecessary.
Did release updates go through without issues?
In our case, container updates overwrite the monitors' configurations
and reset RocksDB options, thus each updated monitor runs with no
RocksDB compression until it is added back manually. Other than that, I
have not encountered any issues related to compression during the updates.
Do you know if this works also with reef (we see
massive writes as
well there)?
Unfortunately, I can't comment on Reef as we're still using Pacific.
/Z
On Tue, 16 Apr 2024 at 18:08, Dietmar Rieder <dietmar.rieder(a)i-med.ac.at
<mailto:dietmar.rieder@i-med.ac.at>> wrote:
Hi Zakhar, hello List,
I just wanted to follow up on this and ask a few quesitions:
Did you noticed any downsides with your compression settings so far?
Do you have all mons now on compression?
Did release updates go through without issues?
Do you know if this works also with reef (we see massive writes as well
there)?
Can you briefly tabulate the commands you used to persistently set the
compression options?
Thanks so much,
Dietmar
On 10/18/23 06:14, Zakhar Kirpichenko wrote:
Many thanks for this, Eugen! I very much
appreciate yours and
Mykola's
efforts and insight!
Another thing I noticed was a reduction of RocksDB store after the
reduction of the total PG number by 30%, from 590-600 MB:
65M 3675511.sst
65M 3675512.sst
65M 3675513.sst
65M 3675514.sst
65M 3675515.sst
65M 3675516.sst
65M 3675517.sst
65M 3675518.sst
62M 3675519.sst
to about half of the original size:
-rw-r--r-- 1 167 167 7218886 Oct 13 16:16 3056869.log
-rw-r--r-- 1 167 167 67250650 Oct 13 16:15 3056871.sst
-rw-r--r-- 1 167 167 67367527 Oct 13 16:15 3056872.sst
-rw-r--r-- 1 167 167 63268486 Oct 13 16:15 3056873.sst
Then when I restarted the monitors one by one before adding
compression,
RocksDB store reduced even further. I am not sure
why and what
exactly got
automatically removed from the store:
-rw-r--r-- 1 167 167 841960 Oct 18 03:31 018779.log
-rw-r--r-- 1 167 167 67290532 Oct 18 03:31 018781.sst
-rw-r--r-- 1 167 167 53287626 Oct 18 03:31 018782.sst
Then I have enabled LZ4 and LZ4HC compression in our small production
cluster (6 nodes, 96 OSDs) on 3 out of 5
monitors:
compression=kLZ4Compression,bottommost_compression=kLZ4HCCompression.
I specifically went for LZ4 and LZ4HC because of
the balance between
compression/decompression speed and impact on CPU usage. The
compression
doesn't seem to affect the cluster in any
negative way, the 3
monitors with
compression are operating normally. The effect of
the compression on
RocksDB store size and disk writes is quite noticeable:
Compression disabled, 155 MB store.db, ~125 MB RocksDB sst, and
~530 MB
writes over 5 minutes:
-rw-r--r-- 1 167 167 4227337 Oct 18 03:58 3080868.log
-rw-r--r-- 1 167 167 67253592 Oct 18 03:57 3080870.sst
-rw-r--r-- 1 167 167 57783180 Oct 18 03:57 3080871.sst
# du -hs
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/;
iotop -ao -bn 2 -d 300 2>&1 | grep
ceph-mon
155M
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/
2471602 be/4 167 6.05 M 473.24 M
0.00 % 0.16 %
ceph-mon -n
mon.ceph04 -f --setuser ceph --setgroup ceph
--default-log-to-file=false
--default-log-to-stderr=true
--default-log-stderr-prefix=debug
--default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:low0]
2471633 be/4 167 188.00 K 40.91 M 0.00 % 0.02 %
ceph-mon -n
mon.ceph04 -f --setuser ceph --setgroup ceph
--default-log-to-file=false
--default-log-to-stderr=true
--default-log-stderr-prefix=debug
--default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [ms_dispatch]
2471603 be/4 167 16.00 K 24.16 M 0.00 % 0.01 %
ceph-mon -n
mon.ceph04 -f --setuser ceph --setgroup ceph
--default-log-to-file=false
--default-log-to-stderr=true
--default-log-stderr-prefix=debug
--default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:high0]
Compression enabled, 60 MB store.db, ~23 MB RocksDB sst, and ~130
MB of
writes over 5 minutes:
-rw-r--r-- 1 167 167 5766659 Oct 18 03:56 3723355.log
-rw-r--r-- 1 167 167 22240390 Oct 18 03:56 3723357.sst
# du -hs
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph03/store.db/;
iotop -ao -bn 2 -d 300 2>&1 | grep
ceph-mon
60M
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph03/store.db/
2052031 be/4 167 1040.00 K 83.48 M
0.00 % 0.01 %
ceph-mon -n
mon.ceph03 -f --setuser ceph --setgroup ceph
--default-log-to-file=false
--default-log-to-stderr=true
--default-log-stderr-prefix=debug
--default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:low0]
2052062 be/4 167 0.00 B 40.79 M 0.00 % 0.01 %
ceph-mon -n
mon.ceph03 -f --setuser ceph --setgroup ceph
--default-log-to-file=false
--default-log-to-stderr=true
--default-log-stderr-prefix=debug
--default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [ms_dispatch]
2052032 be/4 167 16.00 K 4.68 M 0.00 % 0.00 %
ceph-mon -n
mon.ceph03 -f --setuser ceph --setgroup ceph
--default-log-to-file=false
--default-log-to-stderr=true
--default-log-stderr-prefix=debug
--default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:high0]
2052052 be/4 167 44.00 K 0.00 B 0.00 % 0.00 %
ceph-mon -n
mon.ceph03 -f --setuser ceph --setgroup ceph
--default-log-to-file=false
--default-log-to-stderr=true
--default-log-stderr-prefix=debug
--default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [msgr-worker-0]
I haven't noticed a major CPU impact. Unfortunately I didn't
specifically
measure CPU time for monitors and , but overall
the CPU impact of
monitor
store compression on our systems isn't
noticeable. This may be
different
for larger clusters with larger RocksDB datasets,
then perhaps
compression=kLZ4Compression can be enabled by defualt and
bottommost_compression=kLZ4HCCompression can be optional, in
theory this
should result in lower but much faster
compression.
I hope this helps. My plan is to keep the monitors with the current
settings, i.e. 3 with compression + 2 without compression, until
the next
minor release of Pacific to see whether the
monitors with compressed
RocksDB store can be upgraded without issues.
/Z
On Tue, 17 Oct 2023 at 23:45, Eugen Block <eblock(a)nde.ag
<mailto:eblock@nde.ag>> wrote:
> Hi Zakhar,
>
> I took a closer look into what the MONs really do (again with
Mykola's
> help) and why manual compaction is triggered
so frequently. With
> debug_paxos=20 I noticed that paxosservice and paxos triggered
manual
> compactions. So I played with these values:
>
> paxos_service_trim_max = 1000 (default 500)
> paxos_service_trim_min = 500 (default 250)
> paxos_trim_max = 1000 (default 500)
> paxos_trim_min = 500 (default 250)
>
> This reduced the amount of writes by a factor of 3 or 4, the iotop
> values are fluctuating a bit, of course. As Mykola suggested I
created
> a tracker issue [1] to increase the default
values since they don't
> seem suitable for a production environment. Although I don't have
> tested that in production yet I'll ask one of our customers to
do that
> in their secondary cluster (for rbd
mirroring) where they also
suffer
> from large mon stores and heavy writes to the
mon store. Your
findings
> with the compaction were quite helpful as
well, we'll test that
as well.
> Igor mentioned that the default
bluestore_rocksdb config for
OSDs will
> enable compression because of positive test
results. If we can
confirm
> that compression works well for MONs too,
compression could be
enabled
> by default as well.
>
> Regards,
> Eugen
>
>
https://tracker.ceph.com/issues/63229
<https://tracker.ceph.com/issues/63229>
>
> Zitat von Zakhar Kirpichenko <zakhar(a)gmail.com
<mailto:zakhar@gmail.com>>:
>
>> With the help of community members, I managed to enable RocksDB
> compression
>> for a test monitor, and it seems to be working well.
>>
>> Monitor w/o compression writes about 750 MB to disk in 5 minutes:
>>
>> 4854 be/4 167 4.97 M 755.02 M 0.00 % 0.24 %
ceph-mon -n
>> mon.ceph04 -f --setuser ceph --setgroup
ceph
--default-log-to-file=false
>> --default-log-to-stderr=true
--default-log-stderr-prefix=debug
>> --default-mon-cluster-log-to-file=false
>> --default-mon-cluster-log-to-stderr=true [rocksdb:low0]
>>
>> Monitor with LZ4 compression writes about 1/4 of that over the
same time
>> period:
>>
>> 2034728 be/4 167 172.00 K 199.27 M 0.00 % 0.06 %
ceph-mon -n
>> mon.ceph05 -f --setuser ceph --setgroup
ceph
--default-log-to-file=false
>> --default-log-to-stderr=true
--default-log-stderr-prefix=debug
>> --default-mon-cluster-log-to-file=false
>> --default-mon-cluster-log-to-stderr=true [rocksdb:low0]
>>
>> This is caused by the apparent difference in store.db sizes.
>>
>> Mon store.db w/o compression:
>>
>> # ls -al
>>
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db
>> total 257196
>> drwxr-xr-x 2 167 167 4096 Oct 16 14:00 .
>> drwx------ 3 167 167 4096 Aug 31 05:22 ..
>> -rw-r--r-- 1 167 167 1517623 Oct 16 14:00 3073035.log
>> -rw-r--r-- 1 167 167 67285944 Oct 16 14:00 3073037.sst
>> -rw-r--r-- 1 167 167 67402325 Oct 16 14:00 3073038.sst
>> -rw-r--r-- 1 167 167 62364991 Oct 16 14:00 3073039.sst
>>
>> Mon store.db with compression:
>>
>> # ls -al
>>
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph05/store.db
>> total 91188
>> drwxr-xr-x 2 167 167 4096 Oct 16 14:00 .
>> drwx------ 3 167 167 4096 Oct 16 13:35 ..
>> -rw-r--r-- 1 167 167 1760114 Oct 16 14:00 012693.log
>> -rw-r--r-- 1 167 167 52236087 Oct 16 14:00 012695.sst
>>
>> There are no apparent downsides thus far. If everything works
well, I
> will
>> try adding compression to other monitors.
>>
>> /Z
>>
>> On Mon, 16 Oct 2023 at 14:57, Zakhar Kirpichenko
<zakhar(a)gmail.com <mailto:zakhar@gmail.com>>
> wrote:
>>
>>> The issue persists, although to a lesser extent. Any comments
from the
>>> Ceph team please?
>>>
>>> /Z
>>>
>>> On Fri, 13 Oct 2023 at 20:51, Zakhar Kirpichenko
<zakhar(a)gmail.com <mailto:zakhar@gmail.com>>
> wrote:
>>>
>>>>> Some of it is transferable to RocksDB on mons nonetheless.
>>>>
>>>> Please point me to relevant Ceph documentation, i.e. a
description of
> how
>>>> various Ceph monitor and RocksDB tunables affect the
operations of
>>>> monitors, I'll gladly look
into it.
>>>>
>>>>> Please point me to such recommendations, if they're on
docs.ceph.com <http://docs.ceph.com>
> I'll
>>>> get them updated.
>>>>
>>>> This are the recommendations we used when we built our
Pacific cluster:
>>>>
https://docs.ceph.com/en/pacific/start/hardware-recommendations/
<https://docs.ceph.com/en/pacific/start/hardware-recommendations/>
>>>>
>>>>> Our drives are 4x times larger than recommended by
this
guide. The
> drives
>>>> are rated for < 0.5 DWPD, which is more than sufficient for boot
> drives and
>>>> storage of rarely modified files. It is not documented or
suggested
>>>> anywhere that monitor processes
write several hundred
gigabytes of
> data per
>>>> day, exceeding the amount of data written by OSDs. Which is
why I am
> not
>>>> convinced that what we're observing is expected behavior, but
it's not
>> easy
>>>>> to get a definitive answer from the Ceph community.
>>>>
>>>>> /Z
>>>>
>>>>> On Fri, 13 Oct 2023 at 20:35, Anthony D'Atri
<anthony.datri(a)gmail.com <mailto:anthony.datri@gmail.com>>
>>>>> wrote:
>>>>
>>>>>> Some of it is transferable to RocksDB on mons
nonetheless.
>>>>>>
>>>>>> but their specs exceed Ceph hardware recommendations by a
good margin
>>>>>
>>>>>
>>>>> Please point me to such recommendations, if they're on
docs.ceph.com <http://docs.ceph.com>
> I'll
>>>>> get them updated.
>>>>>
>>>>> On Oct 13, 2023, at 13:34, Zakhar Kirpichenko
<zakhar(a)gmail.com <mailto:zakhar@gmail.com>>
> wrote:
>>>>>
>>>>> Thank you, Anthony. As I explained to you earlier, the
article you had
>>>>> sent is about RocksDB tuning
for Bluestore OSDs, while the issue
>>>>> at hand is
>>>>> not with OSDs but rather monitors and their RocksDB store.
Indeed, the
>>>>> drives are not
enterprise-grade, but their specs exceed Ceph
hardware
>>>>> recommendations by a good
margin, they're being used as boot
drives
> only
>>>>> and aren't supposed to be written to continuously at high
rates -
> which is
>>>>> what unfortunately is happening. I am trying to determine
why it is
>>>>> happening and how the issue
can be alleviated or resolved,
> unfortunately
>>>>> monitor RocksDB usage and tunables appear to be not
documented at all.
>>>>>
>>>>> /Z
>>>>>
>>>>> On Fri, 13 Oct 2023 at 20:11, Anthony D'Atri
<anthony.datri(a)gmail.com <mailto:anthony.datri@gmail.com>
>>
>>>>> wrote:
>>>>>
>>>>>> cf. Mark's article I sent you re RocksDB tuning. I suspect
that with
>>>>>> Reef you would experience
fewer writes. Universal
compaction might
> also
>>>>>> help, but in the end this SSD is a client SKU and really
not suited
> for
>>>>>> enterprise use. If you had the 1TB SKU you'd get much
longer
>>>>>> life, or you
>>>>>> could change the overprovisioning on the ones you have.
>>>>>>
>>>>>> On Oct 13, 2023, at 12:30, Zakhar Kirpichenko
<zakhar(a)gmail.com <mailto:zakhar@gmail.com>>
> wrote:
>>>>>>
>>>>>> I would very much appreciate it if someone with a better
> understanding
>>>>>> of
>>>>>> monitor internals and use of RocksDB could please chip in.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
<mailto:ceph-users@ceph.io>
>> To unsubscribe send an email to
ceph-users-leave(a)ceph.io
<mailto:ceph-users-leave@ceph.io>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
<mailto:ceph-users@ceph.io>
> To unsubscribe send an email to
ceph-users-leave(a)ceph.io
<mailto:ceph-users-leave@ceph.io>
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
<mailto:ceph-users@ceph.io>
To unsubscribe send an email to
ceph-users-leave(a)ceph.io
<mailto:ceph-users-leave@ceph.io>