In my company, we currently have the following infrastructure:
- Ceph Luminous - OpenStack Pike.
We have a cluster of 3 osd nodes with the following configuration:
- 1 x Xeon (R) D-2146NT CPU @ 2.30GHz - 128GB RAM - 128GB ROOT DISK - 12 x 10TB SATA ST10000NM0146 (OSD) - 1 x Intel Optane P4800X SSD DC 375GB (block.DB / block.wal) - Ubuntu 16.04 - 2 X 10Gb network interface configured with lacp
The compute nodes have - 4 x 10Gb network interfaces with lacp.
We also have 4 monitors with: - 4 x 10Gb lacp network interfaces. - The monitor nodes are approx. 90% cpu idle time with 32GB / 256GB available RAM
For each OSD disk we have created a partition of 33GB to block.db and block.wal.
We are recently facing a number of performance issues. Virtual machines created in OpenStack are experiencing slow writing issues (approx. 50MB / s).
The OSD nodes monitoring incur an average of 20% cpu IOwait time and 70 cpu idle time. The memory consumption is around 30% consumption. We have no latency issues (9ms average)
My question is if what is happening may have to do with the amount of disk dedicated to DB / WAL. In the CEPH documentation it says it is recommended that the block.db size is not smaller than 4% of block.
In this case for each disk in my environment block.db could not be less than 400GB / OSD.
Another question is if I set my disks to use block.db / block.wal on the mechanical disks themselves, if that could lead to a performance degradation.