Are you sure your ssd pool is only having ssd's and not maybe some hdd's? In past
versions of ceph you had to modify crush rules to separate ssd and hdd classes. Could be
this is not necessary any more in pacific.
-----Original Message-----
From: Schmid, Michael <m.schmid(a)fosbos-rosenheim.de>
Sent: 29 April 2021 15:52
To: ceph-users(a)ceph.io
Subject: [ceph-users] Performance questions - 4 node (commodity) cluster
- what to expect (and what not ;-)
Hello folks,
I am new to ceph and at the moment I am doing some performance tests
with a 4 node ceph-cluster (pacific, 16.2.1).
Node hardware (4 identical nodes):
* DELL 3620 workstation
* Intel Quad-Core i7-6700(a)3.4 GHz
* 8 GB RAM
* Debian Buster (base system, installed a dedicated on Patriot Burst
120 GB SATA-SSD)
* HP 530SPF+ 10 GBit dual-port NIC (tested with iperf to 9.4 GBit/s
from node to node)
* 1 x Kingston KC2500 M2 NVMe PCIe SSD (500 GB, NO power loss
protection !)
* 3 x Seagate Barracuda SATA disk drives (7200 rpm, 500 GB)
After bootstrapping a containerized (docker) ceph-cluster, I did some
performance tests on the NVMe storage by creating a storage pool called
„ssdpool“, consisting of 4 OSDs per (one) NVMe device (per node). A
first write-performance test yields
=============
root@ceph1:~# rados bench -p ssdpool 10 write -b 4M -t 16 --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size
4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_ceph1_78
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 -
0
1 16 30 14 55.997 56 0.0209977
0.493427
2 16 53 37 73.9903 92 0.0264305
0.692179
3 16 76 60 79.9871 92 0.559505
0.664204
4 16 99 83 82.9879 92 0.609332
0.721016
5 16 116 100 79.9889 68 0.686093
0.698084
6 16 132 116 77.3224 64 1.19715
0.731808
7 16 153 137 78.2741 84 0.622646
0.755812
8 16 171 155 77.486 72 0.25409
0.764022
9 16 192 176 78.2076 84 0.968321
0.775292
10 16 214 198 79.1856 88 0.401339
0.766764
11 1 214 213 77.4408 60 0.969693
0.784002
Total time run: 11.0698
Total writes made: 214
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 77.3272
Stddev Bandwidth: 13.7722
Max bandwidth (MB/sec): 92
Min bandwidth (MB/sec): 56
Average IOPS: 19
Stddev IOPS: 3.44304
Max IOPS: 23
Min IOPS: 14
Average Latency(s): 0.785372
Stddev Latency(s): 0.49011
Max latency(s): 2.16532
Min latency(s): 0.0144995
=============
... and I think that 80 MB/s throughput is a very poor result in
conjunction with NVMe devices and 10 GBit nics.
A bare write-test (with fsync=0 option) of the NVMe drives yields a
write throughput of round about 800 MB/s per device ... the second test
(with fsync=1) drops performance to 200 MB/s.
=============
root@ceph1:/home/mschmid# fio --rw=randwrite --name=IOPS-write --
bs=1024k --direct=1 --filename=/dev/nvme0n1 --numjobs=4 --
ioengine=libaio --iodepth=32 --refill_buffers --group_reporting --
runtime=30 --time_based --fsync=0
IOPS-write: (g=0): rw=randwrite, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-
1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32...
fio-3.12
Starting 4 processes
Jobs: 4 (f=4): [w(4)][100.0%][w=723MiB/s][w=722 IOPS][eta 00m:00s]
IOPS-write: (groupid=0, jobs=4): err= 0: pid=31585: Thu Apr 29 15:15:03
2021
write: IOPS=740, BW=740MiB/s (776MB/s)(21.8GiB/30206msec); 0 zone
resets
slat (usec): min=16, max=810, avg=106.48, stdev=30.48
clat (msec): min=7, max=1110, avg=172.09, stdev=120.18
lat (msec): min=7, max=1110, avg=172.19, stdev=120.18
clat percentiles (msec):
| 1.00th=[ 32], 5.00th=[ 48], 10.00th=[ 53], 20.00th=[
63],
| 30.00th=[ 115], 40.00th=[ 161], 50.00th=[ 169], 60.00th=[
178],
| 70.00th=[ 190], 80.00th=[ 220], 90.00th=[ 264], 95.00th=[
368],
| 99.00th=[ 667], 99.50th=[ 751], 99.90th=[ 894], 99.95th=[
986],
| 99.99th=[ 1036]
bw ( KiB/s): min=22528, max=639744, per=25.02%, avg=189649.94,
stdev=113845.69, samples=240
iops : min= 22, max= 624, avg=185.11, stdev=111.18,
samples=240
lat (msec) : 10=0.01%, 20=0.19%, 50=6.43%, 100=20.29%, 250=61.52%
lat (msec) : 500=8.21%, 750=2.85%, 1000=0.47%
cpu : usr=11.87%, sys=2.05%, ctx=13141, majf=0, minf=45
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.3%, 32=99.4%,
=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
=64=0.0%
issued rwts: total=0,22359,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
WRITE: bw=740MiB/s (776MB/s), 740MiB/s-740MiB/s (776MB/s-776MB/s),
io=21.8GiB (23.4GB), run=30206-30206msec
Disk stats (read/write):
nvme0n1: ios=0/89150, merge=0/0, ticks=0/15065724, in_queue=15118720,
util=99.75%
=============
Furthermore an IOPS-test on the NVMe device with block-size 4k shows
round about 1000 IOPS with fsnyc=1 and 35000 IOPS with fsync=0.
To my question: As CPU- and network-load seem to be low during my tests,
I would like to know, which bottleneck can cause such a huge performance
drop between the bare hardware-performance of the nvme-drives and the
write-speeds in the rados benchmark. Could the missing power loss
protection (fsync=1) be the problem, or what throughput should one
expect to be normal in such a setup?
Thanks for every advice!
Best regards,
Michael
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io