[ceph-users] Re: Ceph 16.2.x mon compactions, disk writes

16 Oct 2023

With the help of community members, I managed to enable RocksDB compression
for a test monitor, and it seems to be working well.

Monitor w/o compression writes about 750 MB to disk in 5 minutes:

   4854 be/4 167           4.97 M    755.02 M  0.00 %  0.24 % ceph-mon -n
mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:low0]

Monitor with LZ4 compression writes about 1/4 of that over the same time
period:

2034728 be/4 167         172.00 K    199.27 M  0.00 %  0.06 % ceph-mon -n
mon.ceph05 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:low0]

This is caused by the apparent difference in store.db sizes.

Mon store.db w/o compression:

# ls -al
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db
total 257196
drwxr-xr-x 2 167 167     4096 Oct 16 14:00 .
drwx------ 3 167 167     4096 Aug 31 05:22 ..
-rw-r--r-- 1 167 167  1517623 Oct 16 14:00 3073035.log
-rw-r--r-- 1 167 167 67285944 Oct 16 14:00 3073037.sst
-rw-r--r-- 1 167 167 67402325 Oct 16 14:00 3073038.sst
-rw-r--r-- 1 167 167 62364991 Oct 16 14:00 3073039.sst

Mon store.db with compression:

# ls -al
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph05/store.db
total 91188
drwxr-xr-x 2 167 167     4096 Oct 16 14:00 .
drwx------ 3 167 167     4096 Oct 16 13:35 ..
-rw-r--r-- 1 167 167  1760114 Oct 16 14:00 012693.log
-rw-r--r-- 1 167 167 52236087 Oct 16 14:00 012695.sst

There are no apparent downsides thus far. If everything works well, I will
try adding compression to other monitors.

/Z

On Mon, 16 Oct 2023 at 14:57, Zakhar Kirpichenko &lt;zakhar(a)gmail.com&gt; wrote:

...
  The issue persists, although to a lesser extent. Any
comments from the
 Ceph team please?

 /Z

 On Fri, 13 Oct 2023 at 20:51, Zakhar Kirpichenko &lt;zakhar(a)gmail.com&gt; wrote:

> > Some of it is transferable to RocksDB on mons nonetheless.
>
> Please point me to relevant Ceph documentation, i.e. a description of how
> various Ceph monitor and RocksDB tunables affect the operations of
> monitors, I'll gladly look into it.
>
> > Please point me to such recommendations, if they're on docs.ceph.com
I'll
> get them updated.
>
> This are the recommendations we used when we built our Pacific cluster:
> https://docs.ceph.com/en/pacific/start/hardware-recommendations/
>
> Our drives are 4x times larger than recommended by this guide. The drives
> are rated for < 0.5 DWPD, which is more than sufficient for boot drives and
> storage of rarely modified files. It is not documented or suggested
> anywhere that monitor processes write several hundred gigabytes of data per
> day, exceeding the amount of data written by OSDs. Which is why I am not
> convinced that what we're observing is expected behavior, but it's not easy
> to get a definitive answer from the Ceph community.
>
> /Z
>
> On Fri, 13 Oct 2023 at 20:35, Anthony D'Atri &lt;anthony.datri(a)gmail.com&gt;
> wrote:
>
>> Some of it is transferable to RocksDB on mons nonetheless.
>>
>> but their specs exceed Ceph hardware recommendations by a good margin
>>
>>
>> Please point me to such recommendations, if they're on docs.ceph.com
I'll
>> get them updated.
>>
>> On Oct 13, 2023, at 13:34, Zakhar Kirpichenko &lt;zakhar(a)gmail.com&gt; wrote:
>>
>> Thank you, Anthony. As I explained to you earlier, the article you had
>> sent is about RocksDB tuning for Bluestore OSDs, while the issue at hand is
>> not with OSDs but rather monitors and their RocksDB store. Indeed, the
>> drives are not enterprise-grade, but their specs exceed Ceph hardware
>> recommendations by a good margin, they're being used as boot drives only
>> and aren't supposed to be written to continuously at high rates - which is
>> what unfortunately is happening. I am trying to determine why it is
>> happening and how the issue can be alleviated or resolved, unfortunately
>> monitor RocksDB usage and tunables appear to be not documented at all.
>>
>> /Z
>>
>> On Fri, 13 Oct 2023 at 20:11, Anthony D'Atri &lt;anthony.datri(a)gmail.com&gt;
>> wrote:
>>
>>> cf. Mark's article I sent you re RocksDB tuning.  I suspect that with
>>> Reef you would experience fewer writes.  Universal compaction might also
>>> help, but in the end this SSD is a client SKU and really not suited for
>>> enterprise use.  If you had the 1TB SKU you'd get much longer life, or
you
>>> could change the overprovisioning on the ones you have.
>>>
>>> On Oct 13, 2023, at 12:30, Zakhar Kirpichenko &lt;zakhar(a)gmail.com&gt;
wrote:
>>>
>>> I would very much appreciate it if someone with a better understanding
>>> of
>>> monitor internals and use of RocksDB could please chip in.
>>>
>>>
>>>
>> 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Ceph 16.2.x mon compactions, disk writes