[ceph-users] Re: Performance questions - 4 node (commodity) cluster - what to expect (and what not ;-)

30 Apr 2021

Can you collect the output of this command on all 4 servers while your
test is running:

iostat -mtxy 1

This should show how busy the CPUs are as well as how busy each drive is.

On Thu, Apr 29, 2021 at 7:52 AM Schmid, Michael
&lt;m.schmid(a)fosbos-rosenheim.de&gt; wrote:
>
> Hello folks,
>
> I am new to ceph and at the moment I am doing some performance tests with a 4 node
ceph-cluster (pacific, 16.2.1).
>
> Node hardware (4 identical nodes):
>
>   *   DELL 3620 workstation
>   *   Intel Quad-Core i7-6700(a)3.4 GHz
>   *   8 GB RAM
>   *   Debian Buster (base system, installed a dedicated on Patriot Burst 120 GB
SATA-SSD)
>   *   HP 530SPF+ 10 GBit dual-port NIC (tested with iperf to 9.4 GBit/s from node to
node)
>   *   1 x Kingston KC2500 M2 NVMe PCIe SSD (500 GB, NO power loss protection !)
>   *   3 x Seagate Barracuda SATA disk drives (7200 rpm, 500 GB)
>
> After bootstrapping a containerized (docker) ceph-cluster, I did some performance
tests on the NVMe storage by creating a storage pool called „ssdpool“, consisting of 4
OSDs per (one) NVMe device (per node). A first write-performance test yields
>
> =============
> root@ceph1:~# rados bench -p ssdpool 10 write -b 4M -t 16 --no-cleanup
> hints = 1
> Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up
to 10 seconds or 0 objects
> Object prefix: benchmark_data_ceph1_78
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
>     0       0         0         0         0         0           -           0
>     1      16        30        14    55.997        56   0.0209977    0.493427
>     2      16        53        37   73.9903        92   0.0264305    0.692179
>     3      16        76        60   79.9871        92    0.559505    0.664204
>     4      16        99        83   82.9879        92    0.609332    0.721016
>     5      16       116       100   79.9889        68    0.686093    0.698084
>     6      16       132       116   77.3224        64     1.19715    0.731808
>     7      16       153       137   78.2741        84    0.622646    0.755812
>     8      16       171       155    77.486        72     0.25409    0.764022
>     9      16       192       176   78.2076        84    0.968321    0.775292
>    10      16       214       198   79.1856        88    0.401339    0.766764
>    11       1       214       213   77.4408        60    0.969693    0.784002
> Total time run:         11.0698
> Total writes made:      214
> Write size:             4194304
> Object size:            4194304
> Bandwidth (MB/sec):     77.3272
> Stddev Bandwidth:       13.7722
> Max bandwidth (MB/sec): 92
> Min bandwidth (MB/sec): 56
> Average IOPS:           19
> Stddev IOPS:            3.44304
> Max IOPS:               23
> Min IOPS:               14
> Average Latency(s):     0.785372
> Stddev Latency(s):      0.49011
> Max latency(s):         2.16532
> Min latency(s):         0.0144995
> =============
>
> ... and I think that 80 MB/s throughput is a very poor result in conjunction with
NVMe devices and 10 GBit nics.
>
> A bare write-test (with fsync=0 option) of the NVMe drives yields a write throughput
of round about 800 MB/s per device ... the second test (with fsync=1) drops performance to
200 MB/s.
>
> =============
> root@ceph1:/home/mschmid# fio --rw=randwrite --name=IOPS-write --bs=1024k --direct=1
--filename=/dev/nvme0n1 --numjobs=4 --ioengine=libaio --iodepth=32 --refill_buffers
--group_reporting --runtime=30 --time_based --fsync=0
> IOPS-write: (g=0): rw=randwrite, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T)
1024KiB-1024KiB, ioengine=libaio, iodepth=32...
> fio-3.12
> Starting 4 processes
> Jobs: 4 (f=4): [w(4)][100.0%][w=723MiB/s][w=722 IOPS][eta 00m:00s]
> IOPS-write: (groupid=0, jobs=4): err= 0: pid=31585: Thu Apr 29 15:15:03 2021
>   write: IOPS=740, BW=740MiB/s (776MB/s)(21.8GiB/30206msec); 0 zone resets
>     slat (usec): min=16, max=810, avg=106.48, stdev=30.48
>     clat (msec): min=7, max=1110, avg=172.09, stdev=120.18
>      lat (msec): min=7, max=1110, avg=172.19, stdev=120.18
>     clat percentiles (msec):
>      |  1.00th=[   32],  5.00th=[   48], 10.00th=[   53], 20.00th=[   63],
>      | 30.00th=[  115], 40.00th=[  161], 50.00th=[  169], 60.00th=[  178],
>      | 70.00th=[  190], 80.00th=[  220], 90.00th=[  264], 95.00th=[  368],
>      | 99.00th=[  667], 99.50th=[  751], 99.90th=[  894], 99.95th=[  986],
>      | 99.99th=[ 1036]
>    bw (  KiB/s): min=22528, max=639744, per=25.02%, avg=189649.94, stdev=113845.69,
samples=240
>    iops        : min=   22, max=  624, avg=185.11, stdev=111.18, samples=240
>   lat (msec)   : 10=0.01%, 20=0.19%, 50=6.43%, 100=20.29%, 250=61.52%
>   lat (msec)   : 500=8.21%, 750=2.85%, 1000=0.47%
>   cpu          : usr=11.87%, sys=2.05%, ctx=13141, majf=0, minf=45
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.3%, 32=99.4%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
>      issued rwts: total=0,22359,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
>   WRITE: bw=740MiB/s (776MB/s), 740MiB/s-740MiB/s (776MB/s-776MB/s), io=21.8GiB
(23.4GB), run=30206-30206msec
>
> Disk stats (read/write):
>   nvme0n1: ios=0/89150, merge=0/0, ticks=0/15065724, in_queue=15118720, util=99.75%
> =============
>
> Furthermore an IOPS-test on the NVMe device with block-size 4k shows round about 1000
IOPS with fsnyc=1 and 35000 IOPS with fsync=0.
>
> To my question: As CPU- and network-load seem to be low during my tests, I would like
to know, which bottleneck can cause such a huge performance drop between the bare
hardware-performance of the nvme-drives and the write-speeds in the rados benchmark. Could
the missing power loss protection (fsync=1) be the problem, or what throughput should one
expect to be normal in such a setup?
>
> Thanks for every advice!
>
> Best regards,
> Michael
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Performance questions - 4 node (commodity) cluster - what to expect (and what not ;-)