Re: Comparison of 3 replication models on Pech OSD cluster

20 Jul 2020

Hi Roman,

It's always really interesting to read your messages :) maybe you'll 
join our telegram chat @ceph_ru? One of your colleagues is there :)

Client-based replication is of course the fastest, but the problem is 
that it's unclear how to provide consistency with it. Maybe it's 
possible with some restrictions, but... I don't think it's possible in 
Ceph :)

By the way, have you tested their Crimson OSD? Is it any faster than 
current implementation? (regarding iodepth=1 fsync=1 latency)

> Hi all,
> 
> I would like to share the comparison of 3 replication models based on
> Pech OSD [1] cluster, which supports a sufficient minimum to replicate
> transactions from OSD to OSD and keeps all data mutations in memory
> (memstore).
> 
> My goal was to compare "primary-copy", "chain" and
"client-based"
> replication models and answer the question: how each model affects
> network performance.
> 
> For this estimation I chose to implement my own OSD with bare minimum
> (laborious but worth it) which design is similar to Crimson OSD but
> core is based on sources from kernel libceph implementation
> (i.e. messenger, osdmap, mon_client, etc), thus written in pure C.
> 
> -- What Pech OSD supports and what does not --
> 
> Comparison of the network response under different replication
> scenarios does not require fail-over (we assume that during testing,
> storing data in memory never fails, hosts never crash, etc), thus to
> ease development Pech OSD does not support (current state of the code)
> peering and fail-over. So object modification is replicated on each
> mutation, yes, but cluster is not able to come to the consistent state
> after an error.
> 
> Pech OSD supports RBD images, so that image can be accessed from
> userspace librbd or mapped by kernel RBD block device. That is a bare
> minimum which I need to run FIO loads and test network behavior.
> 
> -- What I test --
> 
> Originally my goal was to compare performance under same loads but
> using different replication models: "client-based",
"primary-copy" and
> "chain". I want to see the numbers what different models can bring in
> terms of network bandwidth, latency and IOPS (and when we talk about
> comparison of replication models, only network is the factor which
> impacts the overall performance).
> 
> Shortly about replication models:
> 
> "client-based" - client itself is responsible for sending requests to
>    replicas. To test this model OSD client code was modified on
>    userspace [2] and kernel [3] sides.  Pros: savings on network hops
>    which reduces latency. Cons: complications in replication algorithm
>    when PG is not healthy, complications in replication algorithm when
>    there is a concurrent access to the same object from divers
>    clients, client network should be fat enough.
> 
> "chain" - client sends write request to primary, primary forwards to
>    the next secondary, and so on. Final ACK from last replica in chain
>    reaches primary or client directly. Pros: each OSD sends a request
>    only once, which reduces load on network for particular node and
>    spreads load. Overall bandwidth should increase. Cons: sequential
>    requests processing, which should impacts latency.
> 
> "primary-copy" - default and the only one model for Ceph: client
>    accesses primary replica, primary replica fans out data to
>    secondaries. Pros: already implemented.  Cons: higher latency
>    comparing to "client-based", lower bandwidth comparing to
"chain".
> 
> What is said above is the theory which has motivated me to prove or
> disprove it with numbers on the real cluster.
> 
> -- How I test --
> 
> I have the cluster at my disposal, with 5 hosts with 100gb network for
> OSDs and 8 client hosts, with 25gbit/s network.
> 
> Each OSD host has 24 CPUs, so for obvious reasons each host runs 24
> OSDs, so (24x5) 120 Pech OSDs for the whole cluster setup.
> 
> There is one fully declustered pool with 1024 PGs (I want to spread
> the load as much as possible). Pool is created with 3x replication
> factor.
> 
> Each client starts 16 FIO jobs with random write to 16 RBD images
> (userspace RBD client) with various block sizes, i.e. one FIO job per
> image and 128 (16x8) jobs in total. Each client host runs FIO server,
> all data from all servers is aggregated by FIO client and stored in
> json format. There is a convenient python script [4] which generates
> and runs FIO jobs, parses json results and outputs them in a human
> readable pretty table.
> 
> Major FIO options:
> 
> ioengine=rbd
> clientname=admin
> pool=rbd
> 
> rw=randwrite
> size=256m
> 
> time_based=1
> runtime=10
> ramp_time=10
> 
> iodepth=32
> numjobs=1
> 
> During all tests I collected almost 1Gb of json data results. Pretty
> enough for good analysis.
> 
> -- Results --
> 
> Firstly I would like to start comparing "primary-copy" and
"chain"
> on Pech OSD:
> 
> 120OSDS/pech/primary-copy
> 
>          write/iops    write/bw   write/clat_ns/mean
>     4k     365.89 K   1.40 GB/s             11.11 ms
>     8k     330.51 K   2.52 GB/s             12.22 ms
>    16k     274.06 K   4.19 GB/s             14.79 ms
>    32k     204.36 K   6.25 GB/s             19.95 ms
>    64k     141.78 K   8.68 GB/s             28.54 ms
>   128k      70.42 K   8.64 GB/s             58.99 ms
>   256k      37.75 K   9.30 GB/s            109.75 ms
>   512k      17.46 K   8.67 GB/s            216.53 ms
>     1m       8.56 K   8.65 GB/s            474.94 ms
> 
> 
> 120OSDS/pech/chain
> 
>          write/iops    write/bw   write/clat_ns/mean
>     4k     380.29 K   1.45 GB/s             10.72 ms
>     8k     339.10 K   2.59 GB/s             11.99 ms
>    16k     280.28 K   4.28 GB/s             14.34 ms
>    32k     206.84 K   6.32 GB/s             19.64 ms
>    64k     131.57 K   8.05 GB/s             30.54 ms
>   128k      74.78 K   9.18 GB/s             54.25 ms
>   256k      39.82 K   9.81 GB/s            103.27 ms
>   512k      18.47 K   9.17 GB/s            213.78 ms
>     1m       8.98 K   9.08 GB/s            461.12 ms
> 
> 
> There is a slight difference in the direction of bandwidth increase
> for "chain" model, but I would rather take it for a noise. Another
> runs for similar configuration show almost similar results: there is
> a minor "bandwidth" improve but not so solid.
> 
> Client-based results are much more interesting:
> 
> 120OSDS/pech/client-based
> 
>          write/iops    write/bw   write/clat_ns/mean
>     4k     534.08 K   2.04 GB/s              7.62 ms
>     8k     471.78 K   3.60 GB/s              8.64 ms
>    16k     367.12 K   5.61 GB/s             11.11 ms
>    32k     242.56 K   7.41 GB/s             16.82 ms
>    64k     124.54 K   7.63 GB/s             32.98 ms
>   128k      62.45 K   7.67 GB/s             66.71 ms
>   256k      31.10 K   7.69 GB/s            135.36 ms
>   512k      15.41 K   7.71 GB/s            282.41 ms
>     1m       7.63 K   7.82 GB/s            567.63 ms
> 
> Small blocks show significant improve in latency: almost 40%, from
> 380k IOPS to 534k IOPS. Starting from 64k block the client network
> 25gbit/s is reached ("client-based" replication means client is
> responsible for sending the data to all replicas, that means that each
> byte with 3x replication factor should be repeated 3 times from each
> client host, having ~8GB/s for 8 clients we estimate each client sends
> ~1GB/s, with 3x replication factor this is ~3GB/s and this is exactly
> the ~24gbit/s of the client network).
> 
> What is important to keep in mind with Pech OSD design is that each
> OSD process has only 1 OS thread, so when request is received and
> request handler is executed there is no any preemption happens and no
> other requests can be handled in parallel (unless special scheduling
> routine is called, which is not, at least in current code state). So
> various PGs on particular Pech OSD are handled sequentially.
> 
> The design is highly CPU bound, thus one simple trick can be made to
> increase bandwidth: OSD pinning to CPU. Since we have 24 OSDs and 24
> CPUs CPU affinitty is easy to apply:
> 
> 120OSDS-AFF/pech/primary-copy
> 
>          write/iops     write/bw   write/clat_ns/mean
>     4k     324.15 K    1.24 GB/s             12.35 ms
>     8k     293.52 K    2.24 GB/s             13.43 ms
>    16k     235.53 K    3.60 GB/s             16.46 ms
>    32k     187.31 K    5.73 GB/s             20.77 ms
>    64k     170.60 K   10.43 GB/s             23.10 ms
>   128k      92.54 K   11.33 GB/s             34.48 ms
>   256k      47.69 K   11.73 GB/s             97.32 ms
>   512k      18.52 K    9.19 GB/s            252.26 ms
>     1m       9.20 K    9.28 GB/s            507.33 ms
> 
> Bandwidth looks better for bigger blocks.
> 
> In conclusion about replication models. I did not notice any
> significant difference between "primary-copy" and "chain".
Perhaps it
> makes sense to play with the replication factor.
> 
> In its turn "client-based" replication can be very promising for loads
> in homogeneous networks, where there is no any concurrent access to
> images. Simple example is a cluster with compute and storage nodes in
> private network, where VMs access their own images. For such setups
> latency is a factor which plays a huge role.
> 
> --
> Roman
> 
> [1] https://github.com/rouming/pech
> [2] https://github.com/rouming/ceph/tree/pech-osd
> [3] 
> https://github.com/rouming/linux/tree/akpm--ceph-client-based-replication
> [4] https://github.com/rouming/pech/blob/master/scripts/fio-runner.py
> _______________________________________________
> Dev mailing list -- dev(a)ceph.io
> To unsubscribe send an email to dev-leave(a)ceph.io

2024

2023

2022

2021

2020

2019

Re: Comparison of 3 replication models on Pech OSD cluster