Hello,
we are using same environment, Opennebula + Ceph.
Our ceph cluster is composed by 5 ceph OSD Hosts with SSD, spinning 10ktrs and 7.2ktrs,
with 10Gb/s fiber network
Each spinning OSD are associated with a db and wall devices on SSD
Nearly all our Windows VM RBD images are in a 10k/trs pool with erasure coding.
For the moment we are house about 15 VM (RDS and exchange)
What we are noting :
- VM are far from respondig as well as on our old 10k SAN ( less than 30%)
- RBD average Latency is oscillating between 50ms to 250ms with some peaks that can
reach the second
- some tests (crystal test drive) from inside the VM can show performance up to 700MB/s
on read and 170 MB/s on write, but a single file copy barely reach 150 MB/s and stay at a
poor 25 MB/s most of the time
- test on 4K rnd, show some iops performance up to 4K iops read and 2kiops write, but
view from RDB point of view, it's like the image iops cant barely go over 500
iops(read+write)
Since we have to migrate our VM from the old SAN to Ceph, I am really worried, there is
mode than 150 VMs on it, and our Ceph seems to have hard time to cope with 15 VMs.
I can't find accurate date and relevant calculus templates that should permit me to
evualate what I can expect
All the documents I've read (and I read a lot ;) ) only reports empirical
ascertainment with "it's better", or "it's worst".
There is a lot of parameters we can tweaks like block size, striping, stripe size, strip
count, ... but those are poorly documented, especially the relation between them.
I will be more than happy to work with some peoples who are in the same situation to try
to find some solutions, methods which can help us to be sure of our design. And break the
"make the cluster, tweak it, and maybe it will be fine for you". I feel that
each of us ( as I read in forums and mailing list) are a bit lonesome. Google is a real
friend, but if feel he reached its limits ;)
Maybe my call will reach some volontee.
Best regards
JC Passard
CTO Provectio
France