Not related to the original topic but the Micron case in that article is fascinating and a
little surprising.
With pretty much best in class hardware in a lab environment:
Potential 25,899,072 4KiB random write IOPs goes to 477K
Potential 23,826,216 4KiB random read IOPs goes to 2,000,000
477K write IOPs and 2M read IOPs isn't terrible especially given there is replication
but the overhead when you look at the numbers is still staggering.
Thanks for sharing this article.
-Drew
-----Original Message-----
From: Vitaliy Filippov <vitalif(a)yourcmc.ru>
Sent: Thursday, October 24, 2019 6:32 PM
To: ceph-users(a)ceph.com; Hermann Himmelbauer <hermann(a)qwer.tk>
Subject: [ceph-users] Re: Choosing suitable SSD for Ceph cluster
It's easy:
Hi,
I am running a nice ceph (proxmox 4 / debian-8 / ceph 0.94.3) cluster
on
3 nodes (supermicro X8DTT-HIBQF), 2 OSD each (2TB SATA harddisks),
interconnected via Infiniband 40.
Problem is that the ceph performance is quite bad (approx. 30MiB/s
reading, 3-4 MiB/s writing ), so I thought about plugging into each
node a PCIe to NVMe/M.2 adapter and install SSD harddisks. The idea is
to have a faster ceph storage and also some storage extension.
The question is now which SSDs I should use. If I understand it right,
not every SSD is suitable for ceph, as is denoted at the links below:
https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-
ssd-is-suitable-as-a-journal-device/
or here:
https://www.proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark
In the first link, the Samsung SSD 950 PRO 512GB NVMe is listed as a
fast SSD for ceph. As the 950 is not available anymore, I ordered a
Samsung 970 1TB for testing, unfortunately, the "EVO" instead of PRO.
Before equipping all nodes with these SSDs, I did some tests with "fio"
as recommended, e.g. like this:
fio --filename=/dev/DEVICE --direct=1 --sync=1 --rw=write --bs=4k
--numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting
--name=journal-test
The results are as the following:
-----------------------
1) Samsung 970 EVO NVMe M.2 mit PCIe Adapter
Jobs: 1:
read : io=26706MB, bw=445MiB/s, iops=113945, runt= 60001msec
write: io=252576KB, bw=4.1MiB/s, iops=1052, runt= 60001msec
Jobs: 4:
read : io=21805MB, bw=432.7MiB/s, iops=93034, runt= 60001msec
write: io=422204KB, bw=6.8MiB/s, iops=1759, runt= 60002msec
Jobs: 10:
read : io=26921MB, bw=448MiB/s, iops=114859, runt= 60001msec
write: io=435644KB, bw=7MiB/s, iops=1815, runt= 60004msec
-----------------------
So the read speed is impressive, but the write speed is really bad.
Therefore I ordered the Samsung 970 PRO (1TB) as it has faster NAND
chips (MLC instead of TLC). The results are, however even worse for
writing:
-----------------------
Samsung 970 PRO NVMe M.2 mit PCIe Adapter
Jobs: 1:
read : io=15570MB, bw=259.4MiB/s, iops=66430, runt= 60001msec
write: io=199436KB, bw=3.2MiB/s, iops=830, runt= 60001msec
Jobs: 4:
read : io=48982MB, bw=816.3MiB/s, iops=208986, runt= 60001msec
write: io=327800KB, bw=5.3MiB/s, iops=1365, runt= 60002msec
Jobs: 10:
read : io=91753MB, bw=1529.3MiB/s, iops=391474, runt= 60001msec
write: io=343368KB, bw=5.6MiB/s, iops=1430, runt= 60005msec
-----------------------
I did some research and found out, that the "--sync" flag sets the
flag "O_DSYNC" which seems to disable the SSD cache which leads to
these horrid write speeds.
It seems that this relates to the fact that the write cache is only
not disabled for SSDs which implement some kind of battery buffer that
guarantees a data flush to the flash in case of a powerloss.
However, It seems impossible to find out which SSDs do have this
powerloss protection, moreover, these enterprise SSDs are crazy
expensive compared to the SSDs above - moreover it's unclear if
powerloss protection is even available in the NVMe form factor. So
building a 1 or 2 TB cluster seems not really affordable/viable.
So, can please anyone give me hints what to do? Is it possible to
ensure that the write cache is not disabled in some way (my server is
situated in a data center, so there will probably never be loss of power).
Or is the link above already outdated as newer ceph releases somehow
deal with this problem? Or maybe a later Debian release (10) will
handle the O_DSYNC flag differently?
Perhaps I should simply invest in faster (and bigger) harddisks and
forget the SSD-cluster idea?
Thank you in advance for any help,
Best Regards,
Hermann
--
With best regards,
Vitaliy Filippov
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to
ceph-users-leave(a)ceph.io