With the help of community members, I managed to enable RocksDB compression
for a test monitor, and it seems to be working well.
Monitor w/o compression writes about 750 MB to disk in 5 minutes:
4854 be/4 167 4.97 M 755.02 M 0.00 % 0.24 % ceph-mon -n
mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
--default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:low0]
Monitor with LZ4 compression writes about 1/4 of that over the same time
period:
2034728 be/4 167 172.00 K 199.27 M 0.00 % 0.06 % ceph-mon -n
mon.ceph05 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
--default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:low0]
This is caused by the apparent difference in store.db sizes.
Mon store.db w/o compression:
# ls -al
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db
total 257196
drwxr-xr-x 2 167 167 4096 Oct 16 14:00 .
drwx------ 3 167 167 4096 Aug 31 05:22 ..
-rw-r--r-- 1 167 167 1517623 Oct 16 14:00 3073035.log
-rw-r--r-- 1 167 167 67285944 Oct 16 14:00 3073037.sst
-rw-r--r-- 1 167 167 67402325 Oct 16 14:00 3073038.sst
-rw-r--r-- 1 167 167 62364991 Oct 16 14:00 3073039.sst
Mon store.db with compression:
# ls -al
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph05/store.db
total 91188
drwxr-xr-x 2 167 167 4096 Oct 16 14:00 .
drwx------ 3 167 167 4096 Oct 16 13:35 ..
-rw-r--r-- 1 167 167 1760114 Oct 16 14:00 012693.log
-rw-r--r-- 1 167 167 52236087 Oct 16 14:00 012695.sst
There are no apparent downsides thus far. If everything works well, I will
try adding compression to other monitors.
/Z
On Mon, 16 Oct 2023 at 14:57, Zakhar Kirpichenko <zakhar(a)gmail.com> wrote:
The issue persists, although to a lesser extent. Any
comments from the
Ceph team please?
/Z
On Fri, 13 Oct 2023 at 20:51, Zakhar Kirpichenko <zakhar(a)gmail.com> wrote:
> > Some of it is transferable to RocksDB on mons nonetheless.
>
> Please point me to relevant Ceph documentation, i.e. a description of how
> various Ceph monitor and RocksDB tunables affect the operations of
> monitors, I'll gladly look into it.
>
> > Please point me to such recommendations, if they're on
docs.ceph.com
I'll
> get them updated.
>
> This are the recommendations we used when we built our Pacific cluster:
>
https://docs.ceph.com/en/pacific/start/hardware-recommendations/
>
> Our drives are 4x times larger than recommended by this guide. The drives
> are rated for < 0.5 DWPD, which is more than sufficient for boot drives and
> storage of rarely modified files. It is not documented or suggested
> anywhere that monitor processes write several hundred gigabytes of data per
> day, exceeding the amount of data written by OSDs. Which is why I am not
> convinced that what we're observing is expected behavior, but it's not easy
> to get a definitive answer from the Ceph community.
>
> /Z
>
> On Fri, 13 Oct 2023 at 20:35, Anthony D'Atri <anthony.datri(a)gmail.com>
> wrote:
>
>> Some of it is transferable to RocksDB on mons nonetheless.
>>
>> but their specs exceed Ceph hardware recommendations by a good margin
>>
>>
>> Please point me to such recommendations, if they're on
docs.ceph.com
I'll
>> get them updated.
>>
>> On Oct 13, 2023, at 13:34, Zakhar Kirpichenko <zakhar(a)gmail.com> wrote:
>>
>> Thank you, Anthony. As I explained to you earlier, the article you had
>> sent is about RocksDB tuning for Bluestore OSDs, while the issue at hand is
>> not with OSDs but rather monitors and their RocksDB store. Indeed, the
>> drives are not enterprise-grade, but their specs exceed Ceph hardware
>> recommendations by a good margin, they're being used as boot drives only
>> and aren't supposed to be written to continuously at high rates - which is
>> what unfortunately is happening. I am trying to determine why it is
>> happening and how the issue can be alleviated or resolved, unfortunately
>> monitor RocksDB usage and tunables appear to be not documented at all.
>>
>> /Z
>>
>> On Fri, 13 Oct 2023 at 20:11, Anthony D'Atri <anthony.datri(a)gmail.com>
>> wrote:
>>
>>> cf. Mark's article I sent you re RocksDB tuning. I suspect that with
>>> Reef you would experience fewer writes. Universal compaction might also
>>> help, but in the end this SSD is a client SKU and really not suited for
>>> enterprise use. If you had the 1TB SKU you'd get much longer life, or
you
>>> could change the overprovisioning on the ones you have.
>>>
>>> On Oct 13, 2023, at 12:30, Zakhar Kirpichenko <zakhar(a)gmail.com>
wrote:
>>>
>>> I would very much appreciate it if someone with a better understanding
>>> of
>>> monitor internals and use of RocksDB could please chip in.
>>>
>>>
>>>
>>