try your tests again with volatile write cache disabled ([s/h]dparm -W 0 DEVICE). If your
disks have super capacitors, you should then see spec performance (possibly starting with
iodopth=2 or 4) with your fio test. A good article is this one here:
The feature you are looking for is called "power loss protection". I would
expect Samsung PRO disks to have it.
The fio test with iodepth=1 will give you an indication of what you an expect from a
single OSD deployed on the disk. When choosing disks, also look for DWPD>=1.
In addition, as Martin writes, consider upgrading and deploy all new disks with
AIT Risø Campus
Bygning 109, rum S14
From: Martin Verges <martin.verges(a)croit.io>
Sent: 24 October 2019 21:21
To: Hermann Himmelbauer
Subject: [ceph-users] Re: Choosing suitable SSD for Ceph cluster
think about migrating to a way faster and better Ceph version and towards bluestore to
increase the performance with the existing hardware.
If you want to go with PCIe card, the Samsung PM1725b can provide quite good speeds but at
much higher costs then the EVO. If you want to check drives, take a look at the uncached
write latency. The lower the value is, the better will be the drive.
Mobile: +49 174 9335695
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Am Do., 24. Okt. 2019 um 21:09 Uhr schrieb Hermann Himmelbauer
I am running a nice ceph (proxmox 4 / debian-8 / ceph 0.94.3) cluster on
3 nodes (supermicro X8DTT-HIBQF), 2 OSD each (2TB SATA harddisks),
interconnected via Infiniband 40.
Problem is that the ceph performance is quite bad (approx. 30MiB/s
reading, 3-4 MiB/s writing ), so I thought about plugging into each node
a PCIe to NVMe/M.2 adapter and install SSD harddisks. The idea is to
have a faster ceph storage and also some storage extension.
The question is now which SSDs I should use. If I understand it right,
not every SSD is suitable for ceph, as is denoted at the links below:
In the first link, the Samsung SSD 950 PRO 512GB NVMe is listed as a
fast SSD for ceph. As the 950 is not available anymore, I ordered a
Samsung 970 1TB for testing, unfortunately, the "EVO" instead of PRO.
Before equipping all nodes with these SSDs, I did some tests with "fio"
as recommended, e.g. like this:
fio --filename=/dev/DEVICE --direct=1 --sync=1 --rw=write --bs=4k
--numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting
The results are as the following:
1) Samsung 970 EVO NVMe M.2 mit PCIe Adapter
read : io=26706MB, bw=445MiB/s, iops=113945, runt= 60001msec
write: io=252576KB, bw=4.1MiB/s, iops=1052, runt= 60001msec
read : io=21805MB, bw=432.7MiB/s, iops=93034, runt= 60001msec
write: io=422204KB, bw=6.8MiB/s, iops=1759, runt= 60002msec
read : io=26921MB, bw=448MiB/s, iops=114859, runt= 60001msec
write: io=435644KB, bw=7MiB/s, iops=1815, runt= 60004msec
So the read speed is impressive, but the write speed is really bad.
Therefore I ordered the Samsung 970 PRO (1TB) as it has faster NAND
chips (MLC instead of TLC). The results are, however even worse for writing:
Samsung 970 PRO NVMe M.2 mit PCIe Adapter
read : io=15570MB, bw=259.4MiB/s, iops=66430, runt= 60001msec
write: io=199436KB, bw=3.2MiB/s, iops=830, runt= 60001msec
read : io=48982MB, bw=816.3MiB/s, iops=208986, runt= 60001msec
write: io=327800KB, bw=5.3MiB/s, iops=1365, runt= 60002msec
read : io=91753MB, bw=1529.3MiB/s, iops=391474, runt= 60001msec
write: io=343368KB, bw=5.6MiB/s, iops=1430, runt= 60005msec
I did some research and found out, that the "--sync" flag sets the flag
"O_DSYNC" which seems to disable the SSD cache which leads to these
horrid write speeds.
It seems that this relates to the fact that the write cache is only not
disabled for SSDs which implement some kind of battery buffer that
guarantees a data flush to the flash in case of a powerloss.
However, It seems impossible to find out which SSDs do have this
powerloss protection, moreover, these enterprise SSDs are crazy
expensive compared to the SSDs above - moreover it's unclear if
powerloss protection is even available in the NVMe form factor. So
building a 1 or 2 TB cluster seems not really affordable/viable.
So, can please anyone give me hints what to do? Is it possible to ensure
that the write cache is not disabled in some way (my server is situated
in a data center, so there will probably never be loss of power).
Or is the link above already outdated as newer ceph releases somehow
deal with this problem? Or maybe a later Debian release (10) will handle
the O_DSYNC flag differently?
Perhaps I should simply invest in faster (and bigger) harddisks and
forget the SSD-cluster idea?
Thank you in advance for any help,
PGP/GPG: 299893C7 (on keyservers)
ceph-users mailing list -- email@example.com<mailto:firstname.lastname@example.org>
To unsubscribe send an email to