Hi,
I am running a nice ceph (proxmox 4 / debian-8 / ceph 0.94.3) cluster on
3 nodes (supermicro X8DTT-HIBQF), 2 OSD each (2TB SATA harddisks),
interconnected via Infiniband 40.
Problem is that the ceph performance is quite bad (approx. 30MiB/s
reading, 3-4 MiB/s writing ), so I thought about plugging into each node
a PCIe to NVMe/M.2 adapter and install SSD harddisks. The idea is to
have a faster ceph storage and also some storage extension.
The question is now which SSDs I should use. If I understand it right,
not every SSD is suitable for ceph, as is denoted at the links below:
https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-i…
or here:
https://www.proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark
In the first link, the Samsung SSD 950 PRO 512GB NVMe is listed as a
fast SSD for ceph. As the 950 is not available anymore, I ordered a
Samsung 970 1TB for testing, unfortunately, the "EVO" instead of PRO.
Before equipping all nodes with these SSDs, I did some tests with "fio"
as recommended, e.g. like this:
fio --filename=/dev/DEVICE --direct=1 --sync=1 --rw=write --bs=4k
--numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting
--name=journal-test
The results are as the following:
-----------------------
1) Samsung 970 EVO NVMe M.2 mit PCIe Adapter
Jobs: 1:
read : io=26706MB, bw=445MiB/s, iops=113945, runt= 60001msec
write: io=252576KB, bw=4.1MiB/s, iops=1052, runt= 60001msec
Jobs: 4:
read : io=21805MB, bw=432.7MiB/s, iops=93034, runt= 60001msec
write: io=422204KB, bw=6.8MiB/s, iops=1759, runt= 60002msec
Jobs: 10:
read : io=26921MB, bw=448MiB/s, iops=114859, runt= 60001msec
write: io=435644KB, bw=7MiB/s, iops=1815, runt= 60004msec
-----------------------
So the read speed is impressive, but the write speed is really bad.
Therefore I ordered the Samsung 970 PRO (1TB) as it has faster NAND
chips (MLC instead of TLC). The results are, however even worse for writing:
-----------------------
Samsung 970 PRO NVMe M.2 mit PCIe Adapter
Jobs: 1:
read : io=15570MB, bw=259.4MiB/s, iops=66430, runt= 60001msec
write: io=199436KB, bw=3.2MiB/s, iops=830, runt= 60001msec
Jobs: 4:
read : io=48982MB, bw=816.3MiB/s, iops=208986, runt= 60001msec
write: io=327800KB, bw=5.3MiB/s, iops=1365, runt= 60002msec
Jobs: 10:
read : io=91753MB, bw=1529.3MiB/s, iops=391474, runt= 60001msec
write: io=343368KB, bw=5.6MiB/s, iops=1430, runt= 60005msec
-----------------------
I did some research and found out, that the "--sync" flag sets the flag
"O_DSYNC" which seems to disable the SSD cache which leads to these
horrid write speeds.
It seems that this relates to the fact that the write cache is only not
disabled for SSDs which implement some kind of battery buffer that
guarantees a data flush to the flash in case of a powerloss.
However, It seems impossible to find out which SSDs do have this
powerloss protection, moreover, these enterprise SSDs are crazy
expensive compared to the SSDs above - moreover it's unclear if
powerloss protection is even available in the NVMe form factor. So
building a 1 or 2 TB cluster seems not really affordable/viable.
So, can please anyone give me hints what to do? Is it possible to ensure
that the write cache is not disabled in some way (my server is situated
in a data center, so there will probably never be loss of power).
Or is the link above already outdated as newer ceph releases somehow
deal with this problem? Or maybe a later Debian release (10) will handle
the O_DSYNC flag differently?
Perhaps I should simply invest in faster (and bigger) harddisks and
forget the SSD-cluster idea?
Thank you in advance for any help,
Best Regards,
Hermann
--
hermann(a)qwer.tk
PGP/GPG: 299893C7 (on keyservers)
I'm trying to deploy a ceph cluster with a cephadm tool. I've already successfully done all steps except adding OSDs. My testing equipment consists of three hosts. Each host has SSD storage, where OS is installed into. On that storage I created partition, which can be used as a ceph block.db. Hosts have also 2 additional HDs (spinning drives) for OSD data. On docs I couldn't find how to deploy such configuration. Do you have any hints, how to do that?
Thanks for help!
Hello everybody,
Can somebody add support for Debian buster and ceph-deploy:
https://tracker.ceph.com/issues/42870
Highly appreciated,
Regards,
Jelle de Jong
Dear all,
After enabling "allow_standby_replay" on our cluster we are getting
(lots) of identical errors on the client /var/log/messages like
Apr 29 14:21:26 hal kernel: ceph: mdsmap_decode got incorrect
state(up:standby-replay)
We are using the ml kernel 5.6.4-1.el7 on Scientific Linux 7.8
Cluster and client are running Ceph v14.2.9
Setting was enabled with:
# ceph fs set cephfs allow_standby_replay true
[root@ceph-s1 ~]# ceph mds stat
cephfs:1 {0=ceph-s3=up:active} 1 up:standby-replay 2 up:standby
Is this something to worry about, or should we just disable
allow_standby_replay ?
any advice appreciated,
many thanks
Jake
Note: I am working from home until further notice.
For help, contact unixadmin(a)mrc-lmb.cam.ac.uk
--
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
Phone 01223 267019
Mobile 0776 9886539
We are seeking information on configuring Ceph to work with Noobaa and
NextCloud.
Randy
--
Randy Morgan
CSR
Department of Chemistry/BioChemistry
Brigham Young University
randym(a)chem.byu.edu
Hi list,
We're wondering if Ceph Nautilus packages will be provided for Ubuntu
Focal Fossa (20.04)?
You might wonder why one would not just use Ubuntu Bionic (18.04)
instead of using the latest LTS. Here is why: a glibc bug in Ubuntu
Bionic that *might* affect Open vSwitch (OVS) users [1].
We had quite a few issues with OVS deadlocks on hypervisors, and do not
want to risk experiencing the same issues on our Ceph cluster(s). I'm
not sure how many of you use OVS for bridging / bonding, but for those
who do, running Ceph (Nautlilus / Octopus) on 20.04 would be preferred.
Gr. Stefan
[1]: https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1839592
Hi,
I've read that Ceph has some InfluxDB reporting capabilities inbuilt (https://docs.ceph.com/docs/master/mgr/influx/).
However, Telegraf, which is the system reporting daemon for InfluxDB, also has a Ceph plugin (https://github.com/influxdata/telegraf/tree/master/plugins/inputs/ceph).
Just curious what people's thoughts on the two are, or what they are using in production?
Which is easier to deploy/maintain, have you found? Or more useful for alerting, or tracking performance gremlins?
Thanks,
Victor
Hi,
I was just checking on a few (13) IPv6-only Ceph clusters and I noticed
that they couldn't send their Telemetry data anymore:
telemetry.ceph.com has address 8.43.84.137
This server used to have Dual-Stack connectivity while it was still
hosted at OVH.
It seemed to have moved to Red Hat, but lost IPv6 connectivity.
How can we get this back?
Wido