Hi Roman,
It's always really interesting to read your messages :) maybe you'll
join our telegram chat @ceph_ru? One of your colleagues is there :)
Client-based replication is of course the fastest, but the problem is
that it's unclear how to provide consistency with it. Maybe it's
possible with some restrictions, but... I don't think it's possible in
Ceph :)
By the way, have you tested their Crimson OSD? Is it any faster than
current implementation? (regarding iodepth=1 fsync=1 latency)
> Hi all,
>
> I would like to share the comparison of 3 replication models based on
> Pech OSD [1] cluster, which supports a sufficient minimum to replicate
> transactions from OSD to OSD and keeps all data mutations in memory
> (memstore).
>
> My goal was to compare "primary-copy", "chain" and
"client-based"
> replication models and answer the question: how each model affects
> network performance.
>
> For this estimation I chose to implement my own OSD with bare minimum
> (laborious but worth it) which design is similar to Crimson OSD but
> core is based on sources from kernel libceph implementation
> (i.e. messenger, osdmap, mon_client, etc), thus written in pure C.
>
> -- What Pech OSD supports and what does not --
>
> Comparison of the network response under different replication
> scenarios does not require fail-over (we assume that during testing,
> storing data in memory never fails, hosts never crash, etc), thus to
> ease development Pech OSD does not support (current state of the code)
> peering and fail-over. So object modification is replicated on each
> mutation, yes, but cluster is not able to come to the consistent state
> after an error.
>
> Pech OSD supports RBD images, so that image can be accessed from
> userspace librbd or mapped by kernel RBD block device. That is a bare
> minimum which I need to run FIO loads and test network behavior.
>
> -- What I test --
>
> Originally my goal was to compare performance under same loads but
> using different replication models: "client-based",
"primary-copy" and
> "chain". I want to see the numbers what different models can bring in
> terms of network bandwidth, latency and IOPS (and when we talk about
> comparison of replication models, only network is the factor which
> impacts the overall performance).
>
> Shortly about replication models:
>
> "client-based" - client itself is responsible for sending requests to
> replicas. To test this model OSD client code was modified on
> userspace [2] and kernel [3] sides. Pros: savings on network hops
> which reduces latency. Cons: complications in replication algorithm
> when PG is not healthy, complications in replication algorithm when
> there is a concurrent access to the same object from divers
> clients, client network should be fat enough.
>
> "chain" - client sends write request to primary, primary forwards to
> the next secondary, and so on. Final ACK from last replica in chain
> reaches primary or client directly. Pros: each OSD sends a request
> only once, which reduces load on network for particular node and
> spreads load. Overall bandwidth should increase. Cons: sequential
> requests processing, which should impacts latency.
>
> "primary-copy" - default and the only one model for Ceph: client
> accesses primary replica, primary replica fans out data to
> secondaries. Pros: already implemented. Cons: higher latency
> comparing to "client-based", lower bandwidth comparing to
"chain".
>
> What is said above is the theory which has motivated me to prove or
> disprove it with numbers on the real cluster.
>
> -- How I test --
>
> I have the cluster at my disposal, with 5 hosts with 100gb network for
> OSDs and 8 client hosts, with 25gbit/s network.
>
> Each OSD host has 24 CPUs, so for obvious reasons each host runs 24
> OSDs, so (24x5) 120 Pech OSDs for the whole cluster setup.
>
> There is one fully declustered pool with 1024 PGs (I want to spread
> the load as much as possible). Pool is created with 3x replication
> factor.
>
> Each client starts 16 FIO jobs with random write to 16 RBD images
> (userspace RBD client) with various block sizes, i.e. one FIO job per
> image and 128 (16x8) jobs in total. Each client host runs FIO server,
> all data from all servers is aggregated by FIO client and stored in
> json format. There is a convenient python script [4] which generates
> and runs FIO jobs, parses json results and outputs them in a human
> readable pretty table.
>
> Major FIO options:
>
> ioengine=rbd
> clientname=admin
> pool=rbd
>
> rw=randwrite
> size=256m
>
> time_based=1
> runtime=10
> ramp_time=10
>
> iodepth=32
> numjobs=1
>
> During all tests I collected almost 1Gb of json data results. Pretty
> enough for good analysis.
>
> -- Results --
>
> Firstly I would like to start comparing "primary-copy" and
"chain"
> on Pech OSD:
>
> 120OSDS/pech/primary-copy
>
> write/iops write/bw write/clat_ns/mean
> 4k 365.89 K 1.40 GB/s 11.11 ms
> 8k 330.51 K 2.52 GB/s 12.22 ms
> 16k 274.06 K 4.19 GB/s 14.79 ms
> 32k 204.36 K 6.25 GB/s 19.95 ms
> 64k 141.78 K 8.68 GB/s 28.54 ms
> 128k 70.42 K 8.64 GB/s 58.99 ms
> 256k 37.75 K 9.30 GB/s 109.75 ms
> 512k 17.46 K 8.67 GB/s 216.53 ms
> 1m 8.56 K 8.65 GB/s 474.94 ms
>
>
> 120OSDS/pech/chain
>
> write/iops write/bw write/clat_ns/mean
> 4k 380.29 K 1.45 GB/s 10.72 ms
> 8k 339.10 K 2.59 GB/s 11.99 ms
> 16k 280.28 K 4.28 GB/s 14.34 ms
> 32k 206.84 K 6.32 GB/s 19.64 ms
> 64k 131.57 K 8.05 GB/s 30.54 ms
> 128k 74.78 K 9.18 GB/s 54.25 ms
> 256k 39.82 K 9.81 GB/s 103.27 ms
> 512k 18.47 K 9.17 GB/s 213.78 ms
> 1m 8.98 K 9.08 GB/s 461.12 ms
>
>
> There is a slight difference in the direction of bandwidth increase
> for "chain" model, but I would rather take it for a noise. Another
> runs for similar configuration show almost similar results: there is
> a minor "bandwidth" improve but not so solid.
>
> Client-based results are much more interesting:
>
> 120OSDS/pech/client-based
>
> write/iops write/bw write/clat_ns/mean
> 4k 534.08 K 2.04 GB/s 7.62 ms
> 8k 471.78 K 3.60 GB/s 8.64 ms
> 16k 367.12 K 5.61 GB/s 11.11 ms
> 32k 242.56 K 7.41 GB/s 16.82 ms
> 64k 124.54 K 7.63 GB/s 32.98 ms
> 128k 62.45 K 7.67 GB/s 66.71 ms
> 256k 31.10 K 7.69 GB/s 135.36 ms
> 512k 15.41 K 7.71 GB/s 282.41 ms
> 1m 7.63 K 7.82 GB/s 567.63 ms
>
> Small blocks show significant improve in latency: almost 40%, from
> 380k IOPS to 534k IOPS. Starting from 64k block the client network
> 25gbit/s is reached ("client-based" replication means client is
> responsible for sending the data to all replicas, that means that each
> byte with 3x replication factor should be repeated 3 times from each
> client host, having ~8GB/s for 8 clients we estimate each client sends
> ~1GB/s, with 3x replication factor this is ~3GB/s and this is exactly
> the ~24gbit/s of the client network).
>
> What is important to keep in mind with Pech OSD design is that each
> OSD process has only 1 OS thread, so when request is received and
> request handler is executed there is no any preemption happens and no
> other requests can be handled in parallel (unless special scheduling
> routine is called, which is not, at least in current code state). So
> various PGs on particular Pech OSD are handled sequentially.
>
> The design is highly CPU bound, thus one simple trick can be made to
> increase bandwidth: OSD pinning to CPU. Since we have 24 OSDs and 24
> CPUs CPU affinitty is easy to apply:
>
> 120OSDS-AFF/pech/primary-copy
>
> write/iops write/bw write/clat_ns/mean
> 4k 324.15 K 1.24 GB/s 12.35 ms
> 8k 293.52 K 2.24 GB/s 13.43 ms
> 16k 235.53 K 3.60 GB/s 16.46 ms
> 32k 187.31 K 5.73 GB/s 20.77 ms
> 64k 170.60 K 10.43 GB/s 23.10 ms
> 128k 92.54 K 11.33 GB/s 34.48 ms
> 256k 47.69 K 11.73 GB/s 97.32 ms
> 512k 18.52 K 9.19 GB/s 252.26 ms
> 1m 9.20 K 9.28 GB/s 507.33 ms
>
> Bandwidth looks better for bigger blocks.
>
> In conclusion about replication models. I did not notice any
> significant difference between "primary-copy" and "chain".
Perhaps it
> makes sense to play with the replication factor.
>
> In its turn "client-based" replication can be very promising for loads
> in homogeneous networks, where there is no any concurrent access to
> images. Simple example is a cluster with compute and storage nodes in
> private network, where VMs access their own images. For such setups
> latency is a factor which plays a huge role.
>
> --
> Roman
>
> [1]
https://github.com/rouming/pech
> [2]
https://github.com/rouming/ceph/tree/pech-osd
> [3]
>
https://github.com/rouming/linux/tree/akpm--ceph-client-based-replication
> [4]
https://github.com/rouming/pech/blob/master/scripts/fio-runner.py
> _______________________________________________
> Dev mailing list -- dev(a)ceph.io
> To unsubscribe send an email to dev-leave(a)ceph.io