Hi Wido,
Long time no see! Yeah, this kind of use case is painful for ceph.
It really hurts having to send off synchronous replica writes to other
OSDs and having to wait to ack the client after *all* have completed.
We're only as fast as the slowest replica write for each IO. We can't
parallelize thigns at all since it's literally a single client doing a
single IO at a time. I don't doubt your ZFS results, it sounds about
right given the difference in the IO path. FWIW, I've been doing
quite a bit of work lately poking at crimson with the "alienized"
bluestore implementation using fio+librbd against a single NVMe backed
OSD running on localhost. We're not as fast as the classic OSD for
large parallel workloads because we only have a single reactor thread
right now. We'll need multiple reactors before we can match classic
at high queue depths. At lower queue depths crimson is actually
faster though, despite not even using a seastar native objectstore yet.
For fun I just did a quick 30s QD1 test on our newer test hardware
that Intel donated for the community lab. (Xeon Platinum CPUs, P4510
NVMe drives, etc). This is "almost master" with a couple of
additional crimson PRs using a single 16GB pre-allocated RBD volume
and all of the system level optimizations you can imagine (no c/p
state transitions, ceph-osd pinned to a specific set of cores, fio on
localhost, no replication, etc), so it's pretty best case in terms of
squeezing out performance. First, here's the classic OSD with
bluestore on master:
[nhm@o01 initial]$ sudo /home/nhm/src/fio/fio --ioengine=rbd
--direct=1 --bs=4096B --iodepth=1 --end_fsync=1 --rw=randwrite
--norandommap --size=16384M --numjobs=1 --runtime=30 --time_based
--clientname=admin --pool=cbt-librbd --rbdname=`hostname -f`-0
--invalidate=0 --name=cbt-librbd/`hostname -f`-0-0
cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (g=0): rw=randwrite, bs=(R)
4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=1
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=14.5MiB/s][w=3723 IOPS][eta 00m:00s]
cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (groupid=0, jobs=1): err=
0: pid=2085951: Sun Feb 7 12:55:19 2021
write: IOPS=3804, BW=14.9MiB/s (15.6MB/s)(446MiB/30001msec); 0 zone
resets
slat (usec): min=2, max=474, avg= 6.85, stdev= 2.73
clat (usec): min=187, max=2658, avg=255.05, stdev=81.21
lat (usec): min=192, max=2704, avg=261.90, stdev=81.75
clat percentiles (usec):
| 1.00th=[ 200], 5.00th=[ 204], 10.00th=[ 210], 20.00th=[
219],
| 30.00th=[ 225], 40.00th=[ 231], 50.00th=[ 235], 60.00th=[
239],
| 70.00th=[ 245], 80.00th=[ 253], 90.00th=[ 293], 95.00th=[
502],
| 99.00th=[ 570], 99.50th=[ 611], 99.90th=[ 963], 99.95th=[
988],
| 99.99th=[ 1401]
bw ( KiB/s): min=14704, max=16104, per=100.00%, avg=15234.44,
stdev=260.42, samples=59
iops : min= 3676, max= 4026, avg=3808.61, stdev=65.10,
samples=59
lat (usec) : 250=76.75%, 500=18.23%, 750=4.86%, 1000=0.12%
lat (msec) : 2=0.03%, 4=0.01%
cpu : usr=3.12%, sys=2.55%, ctx=114166, majf=0, minf=184
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
=64=0.0%
submit : 0=0.0%, 4=100.0%,
8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.0%
complete : 0=0.0%, 4=100.0%,
8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.0%
issued rwts:
total=0,114148,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=14.9MiB/s (15.6MB/s), 14.9MiB/s-14.9MiB/s
(15.6MB/s-15.6MB/s), io=446MiB (468MB), run=30001-30001msec
Disk stats (read/write):
dm-2: ios=16/103, merge=0/0, ticks=1/24, in_queue=25, util=0.03%,
aggrios=31/426, aggrmerge=0/115, aggrticks=4/68, aggrin_queue=76,
aggrutil=0.30%
sda: ios=31/426, merge=0/115, ticks=4/68, in_queue=76, util=0.30%
IMHO this is about as high as we can get right now on classic with
everything stacked in our favor. Latency increases quickly once you
involve remote network clients, multiple OSDs, and replication, etc.
Here's the same test with crimson using alienized bluestore:
[nhm@o01 initial]$ sudo /home/nhm/src/fio/fio --ioengine=rbd
--direct=1 --bs=4096B --iodepth=1 --end_fsync=1 --rw=randwrite
--norandommap --size=16384M --numjobs=1 --runtime=30 --time_based
--clientname=admin --pool=cbt-librbd --rbdname=`hostname -f`-0
--invalidate=0 --name=cbt-librbd/`hostname -f`-0-0
cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (g=0): rw=randwrite, bs=(R)
4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=1
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=22.0MiB/s][w=5886 IOPS][eta 00m:00s]
cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (groupid=0, jobs=1): err=
0: pid=2075392: Sun Feb 7 12:44:32 2021
write: IOPS=5647, BW=22.1MiB/s (23.1MB/s)(662MiB/30001msec); 0 zone
resets
slat (usec): min=2, max=527, avg= 2.93, stdev= 1.34
clat (usec): min=138, max=41367, avg=173.81, stdev=232.51
lat (usec): min=141, max=41370, avg=176.74, stdev=232.52
clat percentiles (usec):
| 1.00th=[ 153], 5.00th=[ 155], 10.00th=[ 155], 20.00th=[
157],
| 30.00th=[ 159], 40.00th=[ 161], 50.00th=[ 163], 60.00th=[
163],
| 70.00th=[ 165], 80.00th=[ 169], 90.00th=[ 176], 95.00th=[
210],
| 99.00th=[ 644], 99.50th=[ 725], 99.90th=[ 832], 99.95th=[
881],
| 99.99th=[ 1221]
bw ( KiB/s): min=19024, max=23792, per=100.00%, avg=22632.32,
stdev=964.77, samples=59
iops : min= 4756, max= 5948, avg=5658.07, stdev=241.20,
samples=59
lat (usec) : 250=97.23%, 500=1.64%, 750=0.77%, 1000=0.33%
lat (msec) : 2=0.02%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
cpu : usr=1.91%, sys=1.78%, ctx=169457, majf=1, minf=336
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
=64=0.0%
submit : 0=0.0%, 4=100.0%,
8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.0%
complete : 0=0.0%, 4=100.0%,
8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.0%
issued rwts:
total=0,169444,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=22.1MiB/s (23.1MB/s), 22.1MiB/s-22.1MiB/s
(23.1MB/s-23.1MB/s), io=662MiB (694MB), run=30001-30001msec
Disk stats (read/write):
dm-2: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=1/308, aggrmerge=0/11, aggrticks=0/47, aggrin_queue=51,
aggrutil=0.27%
sda: ios=1/308, merge=0/11, ticks=0/47, in_queue=51, util=0.27%
About 50% faster with lower latency, and we haven't even done much to
optimize it yet. It's not as fast as your ZFS results below, but at
least for a single OSD without replication I'm glad to see crimson is
getting us close to being in the same ballpark. I will note that the
crimson-osd process was using 23GB (!) in this test so it's still very
alpha code (and take the test result with a grain of salt since we're
only doing minimal QA right now). At least we have a target to
maintain and hopefully improve though as we continue to work toward
stabilizing it.
Mark
On 2/5/21 7:51 AM, Wido den Hollander wrote:
(Sending it to dev list as people might know it
there)
Hi,
There are many talks and presentations out there about Ceph's
performance. Ceph is great when it comes to parallel I/O, large queue
depths and many applications sending I/O towards Ceph.
One thing where Ceph isn't the fastest are 4k blocks written at Queue
Depth 1.
Some applications benefit very much from high performance/low latency
I/O at qd=1, for example Single Threaded applications which are writing
small files inside a VM running on RBD.
With some tuning you can get to a ~700us latency for a 4k write with
qd=1 (Replication, size=3)
I benchmark this using fio:
$ fio --ioengine=librbd --bs=4k --iodepth=1 --direct=1 .. .. .. ..
700us latency means the result will be about ~1500 IOps (1000 / 0.7)
When comparing this to let's say a BSD machine running ZFS that's on the
low side. With ZFS+NVMe you'll be able to reach about somewhere between
7.000 and 10.000 IOps, the latency is simply much lower.
My benchmarking / test setup for this:
- Ceph Nautilus/Octopus (doesn't make a big difference)
- 3x SuperMicro 1U with:
- AMD Epyc 7302P 16-core CPU
- 128GB DDR4
- 10x Samsung PM983 3,84TB
- 10Gbit Base-T networking
Things to configure/tune:
- C-State pinning to 1
- CPU governer to performance
- Turn off all logging in Ceph (debug_osd, debug_ms, debug_bluestore=0)
Higher clock speeds (New AMD Epyc coming in March!) help to reduce the
latency and going towards 25Gbit/100Gbit might help as well.
These are however only very small increments and might help to reduce
the latency by another 15% or so.
It doesn't bring us anywhere near the 10k IOps other applications can
do.
And I totally understand that replication over a TCP/IP network takes
time and thus increases latency.
The Crimson project [0] is aiming to lower the latency with many things
like DPDK and SPDK, but this is far from finished and production ready.
In the meantime, am I overseeing some things here? Can we reduce the
latency further of the current OSDs?
Reaching a ~500us latency would already be great!
Thanks,
Wido
[0]:
https://docs.ceph.com/en/latest/dev/crimson/crimson/
_______________________________________________
Dev mailing list -- dev(a)ceph.io
To unsubscribe send an email to dev-leave(a)ceph.io
_______________________________________________
Dev mailing list -- dev(a)ceph.io
To unsubscribe send an email to dev-leave(a)ceph.io
Hi Mark/Wido
Yes this is definitely an important use case, getting high performance
without having a very large number of parallel clients/ops (as compared
to the number of OSDs). I am happy to hear progress on Crimson, can't
wait to see this maturing. One thing you mention on adding more threads
to Crimson in the future to match classic at high queue depths, this
sounds great but just in case this will lead to locks being added that
may impact single queue depth case, even if slightly, i would recommend
to maybe have an OSD config value to enable a single thread mode only to
give (/in case it gives) the best performance for single queue depth.
I run some 1 QD tests on a test Octopus/Bluestore system using 1 OSD ram
disk, only 1 replica, clients on localhost, gives approx 0.28 ms latency:
rbd bench --io-type write rbd/image-01 --io-threads=1 --io-size 4K
--io-pattern rand --rbd_cache=false
SEC OPS OPS/SEC BYTES/SEC
1 3584 3599.33 14 MiB/s
2 7241 3628.19 14 MiB/s
3 10833 3616.09 14 MiB/s
fio -ioengine=rbd --name=xx --pool=rbd --rbdname=image-01 --iodepth=1
--rw=randwrite --bs=4k --direct=1 --runtime=10 --time_based
Run status group 0 (all jobs):
WRITE: bw=12.8MiB/s (13.5MB/s), 12.8MiB/s-12.8MiB/s
(13.5MB/s-13.5MB/s), io=128MiB (135MB), run=10001-10001msec
I think ceph msgr contributes a large part of the latency
ceph_perf_msgr_server 127.0.0.1:9000 <http://127.0.0.1:9000>64 10
ceph_perf_msgr_client 127.0.0.1:9000 <http://127.0.0.1:9000>1 1 1000 10
4096
103154 us (count = 1000) -> 103 us (count = 1)
this 0.1 ms latency is quite a large overhead for a simple msgr echo
test, specially when compared to the latency tests below:
tcp client server socket echo test with EPOLL wait (same used in msgr
wait events) , 4k block size
Sent 1000000 messages, avg latency 7.117855 us
tcp latency
qperf 127.0.0.1 tcp_lat
latency = 6.04 us
ping latency/rtt
ping -c1 -q -W1 127.0.0.1
rtt min/avg/max/mdev = 0.018/0.018/0.018/0.000 ms
/Maged