[ceph-users] Re: What's the best practice for Erasure Coding

4 May 2020

Hi Frank,

Reviving this old thread as to whether the performance on these raw NL-SAS
drives is adequate?  I was wondering if this is a deep archive with almost
no retrieval, or how many drives are used?  In my experience with large
parallel writes, WAL/DB with bluestore, or journal drives on SSD with
filestore have always been needed to sustain a reasonably consistent
transfer rate.
Very much appreciate any reference info as to your design.

Best regards,
Alex

On Mon, Jul 8, 2019 at 4:30 AM Frank Schilder &lt;frans(a)dtu.dk&gt; wrote:

...
  Hi David,

 I'm running a cluster with bluestore on raw devices (no lvm) and all
 journals collocated on the same disk with the data. Disks are spinning
 NL-SAS. Our goal was to build storage at lowest cost, therefore all data on
 HDD only. I got a few SSDs that I'm using for FS and RBD meta data. All
 large pools are EC on spinning disk.

 I spent at least one month to run detailed benchmarks (rbd bench)
 depending on EC profile, object size, write size, etc. Results were varying
 a lot. My advice would be to run benchmarks with your hardware. If there
 was a single perfect choice, there wouldn't be so many options. For
 example, my tests will not be valid when using separate fast disks for WAL
 and DB.

 There are some results though that might be valid in general:

 1) EC pools have high throughput but low IOP/s compared with replicated
 pools

 I see single-thread write speeds of up to 1.2GB (gigabyte) per second,
 which is probably the network limit and not the disk limit. IOP/s get
 better with more disks, but are way lower than what replicated pools can
 provide. On a cephfs with EC data pool, small-file IO will be comparably
 slow and eat a lot of resources.

 2) I observe massive network traffic amplification on small IO sizes,
 which is due to the way EC overwrites are handled. This is one bottleneck
 for IOP/s. We have 10G infrastructure and use 2x10G client and 4x10G OSD
 network. OSD bandwidth at least 2x client network, better 4x or more.

 3) k should only have small prime factors, power of 2 if possible

 I tested k=5,6,8,10,12. Best results in decreasing order: k=8, k=6. All
 other choices were poor. The value of m seems not relevant for performance.
 Larger k will require more failure domains (more hardware).

 4) object size matters

 The best throughput (1M write size) I see with object sizes of 4MB or
 8MB, with IOP/s getting somewhat better with slower object sizes but
 throughput dropping fast. I use the default of 4MB in production. Works
 well for us.

 5) jerasure is quite good and seems most flexible

 jerasure is quite CPU efficient and can handle smaller chunk sizes than
 other plugins, which is preferrable for IOP/s. However, CPU usage can
 become a problem and a plugin optimized for specific values of k and m
 might help here. Under usual circumstances I see very low load on all OSD
 hosts, even under rebalancing. However, I remember that once I needed to
 rebuild something on all OSDs (I don't remember what it was, sorry). In
 this situation, CPU load went up to 30-50% (meaning up to half the cores
 were at 100%), which is really high considering that each server has only
 16 disks at the moment and is sized to handle up to 100. CPU power could
 become a bottle for us neck in the future.

 These are some general observations and do not replace benchmarks for
 specific use cases. I was hunting for a specific performance pattern, which
 might not be what you want to optimize for. I would recommend to run
 extensive benchmarks if you have to live with a configuration for a long
 time - EC profiles cannot be changed.

 We settled on 8+2 and 6+2 pools with jerasure and object size 4M. We also
 use bluestore compression. All meta data pools are on SSD, only very little
 SSD space is required. This choice works well for the majority of our use
 cases. We can still build small expensive pools to accommodate special
 performance requests.

 Best regards,

 =================
 Frank Schilder
 AIT Risø Campus
 Bygning 109, rum S14

 ________________________________________
 From: ceph-users &lt;ceph-users-bounces(a)lists.ceph.com&gt; on behalf of David <
 xiaomajia.st(a)gmail.com&gt;
 Sent: 07 July 2019 20:01:18
 To: ceph-users(a)lists.ceph.com
 Subject: [ceph-users]  What's the best practice for Erasure Coding

 Hi Ceph-Users,

 I'm working with a  Ceph cluster (about 50TB, 28 OSDs, all Bluestore on
 lvm).
 Recently, I'm trying to use the Erasure Code pool.
 My question is "what's the best practice for using EC pools ?".
 More specifically, which plugin (jerasure, isa, lrc, shec or  clay)
 should I adopt, and how to choose the combinations of (k,m) (e.g.
 (k=3,m=2), (k=6,m=3) ).

 Does anyone share some experience?

 Thanks for any help.

 Regards,
 David

 _______________________________________________
 ceph-users mailing list
 ceph-users(a)lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

2024

2023

2022

2021

2020

2019

[ceph-users] Re: What's the best practice for Erasure Coding