I am having trouble with our cluster getting consistent RBD latencies to our KVM virtual machines connected via the KRBD driver. When measuring with tools like rbd perf image iotop, we constantly see latency spike up from around 1-2ms to 100+ms. This seems to kill our Windows VM SQL performance. I essentially have 2 questions:
1) Am I missing something with my configuration that should be applied to get consistent low latency to the VM guests?
2) When measuring the disks, it seems that sequential IO results in higher latency vs random IO. Is this correct or is there a way to tweak this using the KRBD driver?
3 x MON/MGR nodes
12 x OSD nodes (24 x HDD, 2 x NVMe for DB and WAL)
KVM clients attaching the RBD images via KRBD
1 pool w/ 16384 PGs
Ceph version 14.2.1
mon host = 10.97.11.17,10.97.11.27,10.97.11.37
public network = 10.97.11.0/24
cluster network = 10.97.12.0/24
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd journal size = 30720
osd pool default size = 3
osd pool default min size = 2
osd pool default pg num = 4096
osd pool default pgp num = 4096
osd crush chooseleaf type = 1
bluestore_default_buffered_write = false
bluestore_default_buffered_read = true
Does anyone know a way I could get the actual disk usage of my objects
inside RBD pool and CephFS pool?
I am trying to understand if object X occupies 10G of space and object Y
20G of space and it's mounted by Alpha client or Beta client...
Thank you very much for any input,
I have been experimenting using Ceph mostly for KVM image storage but I just have been unable to get any decent RBD performance with sequential writes. Reads are mostly fine, same with random read/write they are handled fine. I appear to be limited to around 30-40MB/s write speed on 3 hard drives (size 2). I moved my tests over to a different single node and osd performance remained the same. I have messed with Bluestore cache, Rbd cache and the speed level remains fairly constant apart from a burst of speed when I begin the write test.
Looking at iostat on the host, it appears the LVM groups for the osd-block are close to maxed out util (~ 80 % - 100 %) but the raw devices show anywhere between 10-40% and the write speed of a single drive should be a lot higher. During my various tests I have tried doing rados benchmarks and that gives me results way closer to what I was expecting. Also sometimes all the IO hits a single OSD, regardless of stripe size on RBD.
After various trial and error I have tried putting the DB on an SSD and while it is not my ideal situtation (another point of failure), it appears to of helped, slightly. So one of my theories of it being the rockstore DB sync write was not the sole issue. However interestingly I noticed that CephFS gives me greater performance than RBD even just using a raw image file for a vm (random IO does suffer however), it is still not exactly what I would of hoped for my setup, around ~ 80MB-120MB on 3 osd - size 2.
This was also the case with mimic and was one of the reasons I upgraded and remade all the bluestore OSDs, but without any significant gains. I was previously using ZFS and able to maintain over 200MB/s with a similar setup.
I have been trying to crunch this issue for many weeks, however I feel I have just been going in circles and I wanted to use RBD as most of my data is VM storage. Using CephFS for the performance boost would mean I loose many of the advantages of RBD like snapshotting, ease of use between multiple pools, plus CephFS random write io is nowhere near as good.
Is there any easy way to make RBD perform as well as CephFS, or to use cephfs with the above benefits.
ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)