Hi,
I think you are hit by two different problems at the same time. The second problem might
be the same that we also experience, namely that Windows VMs have very strange performance
characteristics with libvirt, vd driver and RBD. With copy operations on very large files
(>2GB) we see a sharp drop of bandwidth after ca. 1 to 1.5GB to a measly 25MB/s for as
yet unknown reasons. We cannot reproduce this behaviour with Linux VMs, so chances that
this is a Windows and not a ceph problem are rather high.
The first problem, however, has to do with how ceph uses disks. Bare spinning disks have
very poor performance characteristics and a lot of development since their invention has
been on smart controllers (internal and external) with volatile and persistent caches and
OS file buffers that attempt to translate usual user's workloads into something that
works reasonable well with spinning drives. The main ideas being to re-order and merge
I/O, cache hot data and absorb I/O bursts for constant write back. The SANs you are used
to are almost certainly high-end products with all the magic money can currently afford.
Ceph forcefully bypasses all of such logic and a rule of thumb I'm following is that
with ceph and current hardware, using current generation drives will provide previous
generation's drive performance. With NVMes you can achieve SSD performance, with SSDs
you get good spinning SAS drive performance and with SAS drives you get, well, floppy or
zip drive performance. I'm afraid that's what you are seeing with 15VMs saturating
the available aggregated performance of the spindles.
If you want to stick with spindles as a data store, what you need is fast, reliable
persistent cache. Reliable here means that the firmware is free of bugs with respect to
power outages, which is quite a requirement in itself. Some expensive disk controllers
claim to have that, they offer persistent NVMe cache. How much you want to trust the
firmware is a different story. Alternatively, you could consider a few TB NVMe drives for
a ceph cache pool. People report that they are happy with that. As long as the cache pool
can hold all hot data plus write bursts, I would also expect this to work fine.
Instead of caching we decided to go for a split. We use datacenter grade low-cost SSDs for
a small all-flash pool for OS RBD disks and a large HDD-only pool for data storage. This
works quite well since the major annoying simultaneous I/O workload of Windows VMs happens
on the OS disks. For ordinary data access, an EC HDD pool is perfectly fine and we
provision machines with a second large data disk on HDD. Our users are quite happy with
that model.
In any case, we are still stuck with the strange performance drop with Windows machines
that you also seem to observe and are still looking for help with that. If you manage to
figure out what is going on, I would like to hear about that. So far, we haven't found
a clue.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: jcharles(a)provectio.fr <jcharles(a)provectio.fr>
Sent: 11 June 2020 12:38:32
To: ceph-users(a)ceph.io
Subject: [ceph-users] Re: Poor Windows performance on ceph RBD.
Hello,
we are using same environment, Opennebula + Ceph.
Our ceph cluster is composed by 5 ceph OSD Hosts with SSD, spinning 10ktrs and 7.2ktrs,
with 10Gb/s fiber network
Each spinning OSD are associated with a db and wall devices on SSD
Nearly all our Windows VM RBD images are in a 10k/trs pool with erasure coding.
For the moment we are house about 15 VM (RDS and exchange)
What we are noting :
- VM are far from respondig as well as on our old 10k SAN ( less than 30%)
- RBD average Latency is oscillating between 50ms to 250ms with some peaks that can
reach the second
- some tests (crystal test drive) from inside the VM can show performance up to 700MB/s
on read and 170 MB/s on write, but a single file copy barely reach 150 MB/s and stay at a
poor 25 MB/s most of the time
- test on 4K rnd, show some iops performance up to 4K iops read and 2kiops write, but
view from RDB point of view, it's like the image iops cant barely go over 500
iops(read+write)
Since we have to migrate our VM from the old SAN to Ceph, I am really worried, there is
mode than 150 VMs on it, and our Ceph seems to have hard time to cope with 15 VMs.
I can't find accurate date and relevant calculus templates that should permit me to
evualate what I can expect
All the documents I've read (and I read a lot ;) ) only reports empirical
ascertainment with "it's better", or "it's worst".
There is a lot of parameters we can tweaks like block size, striping, stripe size, strip
count, ... but those are poorly documented, especially the relation between them.
I will be more than happy to work with some peoples who are in the same situation to try
to find some solutions, methods which can help us to be sure of our design. And break the
"make the cluster, tweak it, and maybe it will be fine for you". I feel that
each of us ( as I read in forums and mailing list) are a bit lonesome. Google is a real
friend, but if feel he reached its limits ;)
Maybe my call will reach some volontee.
Best regards
JC Passard
CTO Provectio
France
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io