I'm running a cluster with bluestore on raw devices (no lvm) and all
journals collocated on the same disk with the data. Disks are spinning
NL-SAS. Our goal was to build storage at lowest cost, therefore all data on
HDD only. I got a few SSDs that I'm using for FS and RBD meta data. All
large pools are EC on spinning disk.
I spent at least one month to run detailed benchmarks (rbd bench)
depending on EC profile, object size, write size, etc. Results were varying
a lot. My advice would be to run benchmarks with your hardware. If there
was a single perfect choice, there wouldn't be so many options. For
example, my tests will not be valid when using separate fast disks for WAL
and DB.
There are some results though that might be valid in general:
1) EC pools have high throughput but low IOP/s compared with replicated
pools
I see single-thread write speeds of up to 1.2GB (gigabyte) per second,
which is probably the network limit and not the disk limit. IOP/s get
better with more disks, but are way lower than what replicated pools can
provide. On a cephfs with EC data pool, small-file IO will be comparably
slow and eat a lot of resources.
2) I observe massive network traffic amplification on small IO sizes,
which is due to the way EC overwrites are handled. This is one bottleneck
for IOP/s. We have 10G infrastructure and use 2x10G client and 4x10G OSD
network. OSD bandwidth at least 2x client network, better 4x or more.
3) k should only have small prime factors, power of 2 if possible
I tested k=5,6,8,10,12. Best results in decreasing order: k=8, k=6. All
other choices were poor. The value of m seems not relevant for performance.
Larger k will require more failure domains (more hardware).
4) object size matters
The best throughput (1M write size) I see with object sizes of 4MB or
8MB, with IOP/s getting somewhat better with slower object sizes but
throughput dropping fast. I use the default of 4MB in production. Works
well for us.
5) jerasure is quite good and seems most flexible
jerasure is quite CPU efficient and can handle smaller chunk sizes than
other plugins, which is preferrable for IOP/s. However, CPU usage can
become a problem and a plugin optimized for specific values of k and m
might help here. Under usual circumstances I see very low load on all OSD
hosts, even under rebalancing. However, I remember that once I needed to
rebuild something on all OSDs (I don't remember what it was, sorry). In
this situation, CPU load went up to 30-50% (meaning up to half the cores
were at 100%), which is really high considering that each server has only
16 disks at the moment and is sized to handle up to 100. CPU power could
become a bottle for us neck in the future.
These are some general observations and do not replace benchmarks for
specific use cases. I was hunting for a specific performance pattern, which
might not be what you want to optimize for. I would recommend to run
extensive benchmarks if you have to live with a configuration for a long
time - EC profiles cannot be changed.
We settled on 8+2 and 6+2 pools with jerasure and object size 4M. We also
use bluestore compression. All meta data pools are on SSD, only very little
SSD space is required. This choice works well for the majority of our use
cases. We can still build small expensive pools to accommodate special
performance requests.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: ceph-users <ceph-users-bounces(a)lists.ceph.com> on behalf of David <
xiaomajia.st(a)gmail.com>
Sent: 07 July 2019 20:01:18
To: ceph-users(a)lists.ceph.com
Subject: [ceph-users] What's the best practice for Erasure Coding
Hi Ceph-Users,
I'm working with a Ceph cluster (about 50TB, 28 OSDs, all Bluestore on
lvm).
Recently, I'm trying to use the Erasure Code pool.
My question is "what's the best practice for using EC pools ?".
More specifically, which plugin (jerasure, isa, lrc, shec or clay)
should I adopt, and how to choose the combinations of (k,m) (e.g.
(k=3,m=2), (k=6,m=3) ).
Does anyone share some experience?
Thanks for any help.
Regards,
David
_______________________________________________
ceph-users mailing list
ceph-users(a)lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com