[ceph-users] Re: Ceph 16.2.x mon compactions, disk writes

18 Oct 2023

Many thanks for this, Eugen! I very much appreciate yours and Mykola's
efforts and insight!

Another thing I noticed was a reduction of RocksDB store after the
reduction of the total PG number by 30%, from 590-600 MB:

65M     3675511.sst
65M     3675512.sst
65M     3675513.sst
65M     3675514.sst
65M     3675515.sst
65M     3675516.sst
65M     3675517.sst
65M     3675518.sst
62M     3675519.sst

to about half of the original size:

-rw-r--r-- 1 167 167  7218886 Oct 13 16:16 3056869.log
-rw-r--r-- 1 167 167 67250650 Oct 13 16:15 3056871.sst
-rw-r--r-- 1 167 167 67367527 Oct 13 16:15 3056872.sst
-rw-r--r-- 1 167 167 63268486 Oct 13 16:15 3056873.sst

Then when I restarted the monitors one by one before adding compression,
RocksDB store reduced even further. I am not sure why and what exactly got
automatically removed from the store:

-rw-r--r-- 1 167 167   841960 Oct 18 03:31 018779.log
-rw-r--r-- 1 167 167 67290532 Oct 18 03:31 018781.sst
-rw-r--r-- 1 167 167 53287626 Oct 18 03:31 018782.sst

Then I have enabled LZ4 and LZ4HC compression in our small production
cluster (6 nodes, 96 OSDs) on 3 out of 5
monitors: compression=kLZ4Compression,bottommost_compression=kLZ4HCCompression.
I specifically went for LZ4 and LZ4HC because of the balance between
compression/decompression speed and impact on CPU usage. The compression
doesn't seem to affect the cluster in any negative way, the 3 monitors with
compression are operating normally. The effect of the compression on
RocksDB store size and disk writes is quite noticeable:

Compression disabled, 155 MB store.db, ~125 MB RocksDB sst, and ~530 MB
writes over 5 minutes:

-rw-r--r-- 1 167 167  4227337 Oct 18 03:58 3080868.log
-rw-r--r-- 1 167 167 67253592 Oct 18 03:57 3080870.sst
-rw-r--r-- 1 167 167 57783180 Oct 18 03:57 3080871.sst

# du -hs
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/;
iotop -ao -bn 2 -d 300 2>&1 | grep ceph-mon
155M
 /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/
2471602 be/4 167           6.05 M    473.24 M  0.00 %  0.16 % ceph-mon -n
mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:low0]
2471633 be/4 167         188.00 K     40.91 M  0.00 %  0.02 % ceph-mon -n
mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [ms_dispatch]
2471603 be/4 167          16.00 K     24.16 M  0.00 %  0.01 % ceph-mon -n
mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:high0]

Compression enabled, 60 MB store.db, ~23 MB RocksDB sst, and ~130 MB of
writes over 5 minutes:

-rw-r--r-- 1 167 167  5766659 Oct 18 03:56 3723355.log
-rw-r--r-- 1 167 167 22240390 Oct 18 03:56 3723357.sst

# du -hs
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph03/store.db/;
iotop -ao -bn 2 -d 300 2>&1 | grep ceph-mon
60M
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph03/store.db/
2052031 be/4 167        1040.00 K     83.48 M  0.00 %  0.01 % ceph-mon -n
mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:low0]
2052062 be/4 167           0.00 B     40.79 M  0.00 %  0.01 % ceph-mon -n
mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [ms_dispatch]
2052032 be/4 167          16.00 K      4.68 M  0.00 %  0.00 % ceph-mon -n
mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:high0]
2052052 be/4 167          44.00 K      0.00 B  0.00 %  0.00 % ceph-mon -n
mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [msgr-worker-0]

I haven't noticed a major CPU impact. Unfortunately I didn't specifically
measure CPU time for monitors and , but overall the CPU impact of monitor
store compression on our systems isn't noticeable. This may be different
for larger clusters with larger RocksDB datasets, then perhaps
compression=kLZ4Compression can be enabled by defualt and
bottommost_compression=kLZ4HCCompression can be optional, in theory this
should result in lower but much faster compression.

I hope this helps. My plan is to keep the monitors with the current
settings, i.e. 3 with compression + 2 without compression, until the next
minor release of Pacific to see whether the monitors with compressed
RocksDB store can be upgraded without issues.

/Z

On Tue, 17 Oct 2023 at 23:45, Eugen Block &lt;eblock(a)nde.ag&gt; wrote:

...
  Hi Zakhar,

 I took a closer look into what the MONs really do (again with Mykola's
 help) and why manual compaction is triggered so frequently. With
 debug_paxos=20 I noticed that paxosservice and paxos triggered manual
 compactions. So I played with these values:

 paxos_service_trim_max = 1000 (default 500)
 paxos_service_trim_min = 500 (default 250)
 paxos_trim_max = 1000 (default 500)
 paxos_trim_min = 500 (default 250)

 This reduced the amount of writes by a factor of 3 or 4, the iotop
 values are fluctuating a bit, of course. As Mykola suggested I created
 a tracker issue [1] to increase the default values since they don't
 seem suitable for a production environment. Although I don't have
 tested that in production yet I'll ask one of our customers to do that
 in their secondary cluster (for rbd mirroring) where they also suffer
 from large mon stores and heavy writes to the mon store. Your findings
 with the compaction were quite helpful as well, we'll test that as well.
 Igor mentioned that the default bluestore_rocksdb config for OSDs will
 enable compression because of positive test results. If we can confirm
 that compression works well for MONs too, compression could be enabled
 by default as well.

 Regards,
 Eugen

 https://tracker.ceph.com/issues/63229

 Zitat von Zakhar Kirpichenko &lt;zakhar(a)gmail.com&gt;om>:

  With the help of community members, I managed to
enable RocksDB  compression
  for a test monitor, and it seems to be working
well.

 Monitor w/o compression writes about 750 MB to disk in 5 minutes:

    4854 be/4 167           4.97 M    755.02 M  0.00 %  0.24 % ceph-mon -n
 mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false
 --default-log-to-stderr=true --default-log-stderr-prefix=debug
  --default-mon-cluster-log-to-file=false
 --default-mon-cluster-log-to-stderr=true [rocksdb:low0]

 Monitor with LZ4 compression writes about 1/4 of that over the same time
 period:

 2034728 be/4 167         172.00 K    199.27 M  0.00 %  0.06 % ceph-mon -n
 mon.ceph05 -f --setuser ceph --setgroup ceph --default-log-to-file=false
 --default-log-to-stderr=true --default-log-stderr-prefix=debug
  --default-mon-cluster-log-to-file=false
 --default-mon-cluster-log-to-stderr=true [rocksdb:low0]

 This is caused by the apparent difference in store.db sizes.

 Mon store.db w/o compression:

 # ls -al
 /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db
 total 257196
 drwxr-xr-x 2 167 167     4096 Oct 16 14:00 .
 drwx------ 3 167 167     4096 Aug 31 05:22 ..
 -rw-r--r-- 1 167 167  1517623 Oct 16 14:00 3073035.log
 -rw-r--r-- 1 167 167 67285944 Oct 16 14:00 3073037.sst
 -rw-r--r-- 1 167 167 67402325 Oct 16 14:00 3073038.sst
 -rw-r--r-- 1 167 167 62364991 Oct 16 14:00 3073039.sst

 Mon store.db with compression:

 # ls -al
 /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph05/store.db
 total 91188
 drwxr-xr-x 2 167 167     4096 Oct 16 14:00 .
 drwx------ 3 167 167     4096 Oct 16 13:35 ..
 -rw-r--r-- 1 167 167  1760114 Oct 16 14:00 012693.log
 -rw-r--r-- 1 167 167 52236087 Oct 16 14:00 012695.sst

 There are no apparent downsides thus far. If everything works well, I  will
  try adding compression to other monitors.

 /Z

 On Mon, 16 Oct 2023 at 14:57, Zakhar Kirpichenko &lt;zakhar(a)gmail.com&gt;  wrote:

> The issue persists, although to a lesser extent. Any comments from the
> Ceph team please?
>
> /Z
>
> On Fri, 13 Oct 2023 at 20:51, Zakhar Kirpichenko &lt;zakhar(a)gmail.com&gt; 
wrote:
 >
>> > Some of it is transferable to RocksDB on mons nonetheless.
>>
>> Please point me to relevant Ceph documentation, i.e. a description of  how
 >> various Ceph monitor and RocksDB tunables
affect the operations of
>> monitors, I'll gladly look into it.
>>
>> > Please point me to such recommendations, if they're on docs.ceph.com
 I'll
 >> get them updated.
>>
>> This are the recommendations we used when we built our Pacific cluster:
>> https://docs.ceph.com/en/pacific/start/hardware-recommendations/
>>
>> Our drives are 4x times larger than recommended by this guide. The  drives
 >> are rated for < 0.5 DWPD, which is
more than sufficient for boot  drives and
 >> storage of rarely modified files. It is
not documented or suggested
>> anywhere that monitor processes write several hundred gigabytes of  data
per
 >> day, exceeding the amount of data written
by OSDs. Which is why I am  not
 >> convinced that what we're observing
is expected behavior, but it's not  easy
 >> to get a definitive answer from the Ceph
community.
>>
>> /Z
>>
>> On Fri, 13 Oct 2023 at 20:35, Anthony D'Atri &lt;anthony.datri(a)gmail.com&gt;
>> wrote:
>>
>>> Some of it is transferable to RocksDB on mons nonetheless.
>>>
>>> but their specs exceed Ceph hardware recommendations by a good margin
>>>
>>>
>>> Please point me to such recommendations, if they're on docs.ceph.com
 I'll
 >>> get them updated.
>>>
>>> On Oct 13, 2023, at 13:34, Zakhar Kirpichenko &lt;zakhar(a)gmail.com&gt; 
wrote:
 >>>
>>> Thank you, Anthony. As I explained to you earlier, the article you had
>>> sent is about RocksDB tuning for Bluestore OSDs, while the issue
>>> at hand is
>>> not with OSDs but rather monitors and their RocksDB store. Indeed, the
>>> drives are not enterprise-grade, but their specs exceed Ceph hardware
>>> recommendations by a good margin, they're being used as boot drives
 only
 >>> and aren't supposed to be written
to continuously at high rates -  which is
 >>> what unfortunately is happening. I am
trying to determine why it is
>>> happening and how the issue can be alleviated or resolved, 
unfortunately
  >>
monitor RocksDB usage and tunables appear to be not documented at all.
>>
>> /Z
>>
>> On Fri, 13 Oct 2023 at 20:11, Anthony D'Atri &lt;anthony.datri(a)gmail.com

>>> wrote:
>>>
>>>> cf. Mark's article I sent you re RocksDB tuning.  I suspect that
with
>>>> Reef you would experience fewer writes.  Universal compaction might
 also
 >>>> help, but in the end this SSD is
a client SKU and really not suited  for
 >>>> enterprise use.  If you had the
1TB SKU you'd get much longer
>>>> life, or you
>>>> could change the overprovisioning on the ones you have.
>>>>
>>>> On Oct 13, 2023, at 12:30, Zakhar Kirpichenko &lt;zakhar(a)gmail.com&gt;
 wrote:
 >>>>
>>>> I would very much appreciate it if someone with a better 
understanding

>>> of
>>> monitor internals and use of RocksDB could please chip in.
>>>
>>>
>>>
>>  _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 

 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Ceph 16.2.x mon compactions, disk writes