hello

here on the nvme partition directly

- libaio randwrite /dev/nvme1n1p4 => WRITE: bw=12.1MiB/s (12.7MB/s), 12.1MiB/s-12.1MiB/s (12.7MB/s-12.7MB/s), io=728MiB (763MB), run=60001-60001msec
- libaio randread /dev/nvme1n1p4 => READ: bw=35.6MiB/s (37.3MB/s), 35.6MiB/s-35.6MiB/s (37.3MB/s-37.3MB/s), io=2134MiB (2237MB), run=60001-60001msec

here on the rbd

- rbd read : READ: bw=580MiB/s (608MB/s), 580MiB/s-580MiB/s (608MB/s-608MB/s), io=10.0GiB (10.7GB), run=17668-17668msec (I want this perf ! :) )
- rbd write : WRITE: bw=90.9MiB/s (95.3MB/s), 90.9MiB/s-90.9MiB/s (95.3MB/s-95.3MB/s), io=5764MiB (6044MB), run=63404-63404msec (I want this perf ! :) )

here on the mapped rbd

- libaio  randwrite on mapped rbd : WRITE: bw=217KiB/s (223kB/s), 217KiB/s-217KiB/s (223kB/s-223kB/s), io=12.7MiB (13.4MB), run=60006-60006msec
- libaio  randread on mapped rbd : READ: bw=589KiB/s (603kB/s), 589KiB/s-589KiB/s (603kB/s-603kB/s), io=34.5MiB (36.2MB), run=60005-60005msec

here on the mounted fs :

rbd map bench  --pool kube --name client.admin
/sbin/mkfs.ext4  /dev/rbd/kube/bench
mount /dev/rbd/kube/bench /mnt/
cd /mnt/
dd if=/dev/zero of=test bs=8192k count=100 oflag=direct
838860800 bytes (839 MB, 800 MiB) copied, 24.5338 s, 34.2 MB/s

raw nvme performance does not looks very great .... but raw performance of the rbd are great
Once I map it. The performance goes bad. When I use dd in the fs or fio on the device.

what I don't understand is why the difference between raw rbd and mapped rbd are so important.

The nvme disk looks to be : Sandisk Corp WD Black 2018/PC SN720 NVMe SSD

Rand write :
fio -ioengine=libaio -name=test -bs=4k -iodepth=1 -direct=1 -fsync=1 -rw=randwrite -runtime=60 -filename=/dev/nvme1n1p4
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=12.3MiB/s][w=3158 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=177938: Fri Aug 16 17:18:18 2019
  write: IOPS=3105, BW=12.1MiB/s (12.7MB/s)(728MiB/60001msec); 0 zone resets
    slat (nsec): min=1545, max=141856, avg=6784.48, stdev=6188.10
    clat (nsec): min=688, max=3539.6k, avg=14665.22, stdev=13601.60
     lat (usec): min=9, max=3549, avg=21.66, stdev=16.07
    clat percentiles (usec):
     |  1.00th=[    8],  5.00th=[    8], 10.00th=[    9], 20.00th=[    9],
     | 30.00th=[   10], 40.00th=[   10], 50.00th=[   12], 60.00th=[   14],
     | 70.00th=[   16], 80.00th=[   19], 90.00th=[   26], 95.00th=[   36],
     | 99.00th=[   49], 99.50th=[   52], 99.90th=[   76], 99.95th=[   85],
     | 99.99th=[  135]
   bw (  KiB/s): min=10504, max=13232, per=100.00%, avg=12420.01, stdev=439.86, samples=119
   iops        : min= 2626, max= 3308, avg=3105.00, stdev=109.96, samples=119
  lat (nsec)   : 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.02%, 4=0.09%, 10=40.60%, 20=43.62%, 50=14.92%
  lat (usec)   : 100=0.73%, 250=0.02%, 500=0.01%, 750=0.01%
  lat (msec)   : 4=0.01%
  fsync/fdatasync/sync_file_range:
    sync (nsec): min=12, max=22490, avg=185.20, stdev=255.75
    sync percentiles (nsec):
     |  1.00th=[   28],  5.00th=[   35], 10.00th=[   42], 20.00th=[   59],
     | 30.00th=[   82], 40.00th=[  109], 50.00th=[  137], 60.00th=[  163],
     | 70.00th=[  197], 80.00th=[  253], 90.00th=[  390], 95.00th=[  572],
     | 99.00th=[  804], 99.50th=[  820], 99.90th=[ 1144], 99.95th=[ 1208],
     | 99.99th=[15552]
  cpu          : usr=2.42%, sys=4.91%, ctx=546845, majf=0, minf=12
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,186313,0,186313 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=12.1MiB/s (12.7MB/s), 12.1MiB/s-12.1MiB/s (12.7MB/s-12.7MB/s), io=728MiB (763MB), run=60001-60001msec

Disk stats (read/write):
  nvme1n1: ios=0/375662, merge=0/1492, ticks=0/57478, in_queue=59100, util=97.85%


RBD read

~# fio -ioengine=rbd -name=test -bs=4M -iodepth=32 -rw=read -runtime=60  -pool=kube -rbdname=bench
test: (g=0): rw=read, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=rbd, iodepth=32
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [R(1)][69.2%][r=1313MiB/s][r=328 IOPS][eta 00m:08s]
test: (groupid=0, jobs=1): err= 0: pid=184386: Fri Aug 16 17:20:45 2019
  read: IOPS=144, BW=580MiB/s (608MB/s)(10.0GiB/17668msec)
    slat (nsec): min=371, max=72009, avg=5416.87, stdev=3649.19
    clat (msec): min=6, max=1438, avg=220.83, stdev=192.27
     lat (msec): min=6, max=1438, avg=220.83, stdev=192.27
    clat percentiles (msec):
     |  1.00th=[   19],  5.00th=[   23], 10.00th=[   25], 20.00th=[   30],
     | 30.00th=[   33], 40.00th=[   47], 50.00th=[  249], 60.00th=[  288],
     | 70.00th=[  326], 80.00th=[  372], 90.00th=[  447], 95.00th=[  535],
     | 99.00th=[  802], 99.50th=[  844], 99.90th=[  936], 99.95th=[ 1011],
     | 99.99th=[ 1435]
   bw (  KiB/s): min=253952, max=4628480, per=93.74%, avg=556353.83, stdev=851591.15, samples=35
   iops        : min=   62, max= 1130, avg=135.83, stdev=207.91, samples=35
  lat (msec)   : 10=0.08%, 20=2.03%, 50=38.40%, 100=2.15%, 250=7.62%
  lat (msec)   : 500=43.48%, 750=4.73%, 1000=1.45%
  cpu          : usr=0.19%, sys=0.24%, ctx=2563, majf=0, minf=7
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.3%, 16=0.6%, 32=98.8%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=2560,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=580MiB/s (608MB/s), 580MiB/s-580MiB/s (608MB/s-608MB/s), io=10.0GiB (10.7GB), run=17668-17668msec

Disk stats (read/write):
    md2: ios=1/1405, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=1332/3191, aggrmerge=5911/5187, aggrticks=1031/1257, aggrin_queue=15480, aggrutil=91.95%
  nvme1n1: ios=0/1183, merge=0/450, ticks=0/141, in_queue=12452, util=69.94%
  nvme0n1: ios=2665/5199, merge=11822/9925, ticks=2062/2373, in_queue=18508, util=91.95%

RBD write

fio -ioengine=rbd -name=test -bs=4M -iodepth=32 -rw=write -runtime=60  -pool=kube -rbdname=bench
test: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=rbd, iodepth=32
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [W(1)][55.3%][eta 00m:51s]                        
test: (groupid=0, jobs=1): err= 0: pid=181899: Fri Aug 16 17:20:19 2019
  write: IOPS=22, BW=90.9MiB/s (95.3MB/s)(5764MiB/63404msec); 0 zone resets
    slat (usec): min=366, max=4889, avg=802.71, stdev=414.92
    clat (msec): min=189, max=7308, avg=1404.62, stdev=1278.48
     lat (msec): min=189, max=7308, avg=1405.43, stdev=1278.47
    clat percentiles (msec):
     |  1.00th=[  292],  5.00th=[  376], 10.00th=[  435], 20.00th=[  550],
     | 30.00th=[  676], 40.00th=[  776], 50.00th=[  877], 60.00th=[ 1083],
     | 70.00th=[ 1351], 80.00th=[ 1921], 90.00th=[ 3473], 95.00th=[ 4597],
     | 99.00th=[ 5470], 99.50th=[ 6007], 99.90th=[ 6745], 99.95th=[ 7282],
     | 99.99th=[ 7282]
   bw (  KiB/s): min= 8192, max=155648, per=100.00%, avg=95454.68, stdev=30757.07, samples=121
   iops        : min=    2, max=   38, avg=23.27, stdev= 7.52, samples=121
  lat (msec)   : 250=0.07%, 500=15.82%, 750=22.21%, 1000=18.74%
  cpu          : usr=1.71%, sys=0.15%, ctx=597, majf=0, minf=45137
  IO depths    : 1=0.1%, 2=0.1%, 4=0.3%, 8=0.6%, 16=1.1%, 32=97.8%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,1441,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=90.9MiB/s (95.3MB/s), 90.9MiB/s-90.9MiB/s (95.3MB/s-95.3MB/s), io=5764MiB (6044MB), run=63404-63404msec

Disk stats (read/write):
    md2: ios=0/5119, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=30/16383, aggrmerge=20/28021, aggrticks=5/18949, aggrin_queue=63466, aggrutil=90.70%
  nvme1n1: ios=0/4230, merge=0/1595, ticks=0/1835, in_queue=42440, util=65.48%
  nvme0n1: ios=60/28536, merge=41/54447, ticks=10/36063, in_queue=84492, util=90.70%

fio -ioengine=libaio -name=test -bs=4k -iodepth=1 -direct=1 -fsync=1 -rw=randread -runtime=60 -filename=/dev/nvme1n1p4
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=36.3MiB/s][r=9286 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=208060: Fri Aug 16 17:33:01 2019
  read: IOPS=9103, BW=35.6MiB/s (37.3MB/s)(2134MiB/60001msec)
    slat (nsec): min=1384, max=244751, avg=6077.86, stdev=5529.26
    clat (usec): min=3, max=8311, avg=101.98, stdev=42.25
     lat (usec): min=34, max=8341, avg=108.28, stdev=43.81
    clat percentiles (usec):
     |  1.00th=[   73],  5.00th=[   83], 10.00th=[   85], 20.00th=[   90],
     | 30.00th=[   92], 40.00th=[   94], 50.00th=[   96], 60.00th=[  100],
     | 70.00th=[  104], 80.00th=[  112], 90.00th=[  121], 95.00th=[  141],
     | 99.00th=[  182], 99.50th=[  198], 99.90th=[  253], 99.95th=[  297],
     | 99.99th=[ 2147]
   bw (  KiB/s): min=33088, max=40928, per=99.96%, avg=36400.00, stdev=1499.29, samples=119
   iops        : min= 8272, max=10232, avg=9099.98, stdev=374.82, samples=119
  lat (usec)   : 4=0.01%, 10=0.01%, 50=0.53%, 100=59.74%, 250=39.62%
  lat (usec)   : 500=0.08%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%
  cpu          : usr=4.80%, sys=8.33%, ctx=546238, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=546228,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=35.6MiB/s (37.3MB/s), 35.6MiB/s-35.6MiB/s (37.3MB/s-37.3MB/s), io=2134MiB (2237MB), run=60001-60001msec

Disk stats (read/write):
  nvme1n1: ios=545111/3564, merge=0/1430, ticks=54388/943, in_queue=60924, util=100.00%

Le vendredi 16 août 2019 à 18:06 +0300, vitalif@yourcmc.ru a écrit :
Now to go for "apples to apples" either run

fio -ioengine=libaio -name=test -bs=4k -iodepth=1 -direct=1 -fsync=1 
-rw=randwrite -runtime=60 -filename=/dev/nvmeXXXXXXXXX

to compare with the single-threaded RBD random write result (the test is 
destructive, so use a separate partition without data)

...Or run

fio -ioengine=rbd -name=test -bs=4M -iodepth=32 -rw=write -runtime=60 
-pool=kube -rbdname=bench

to compare with your dd's linear write result.

58 single-threaded random iops for NVMes is pretty sad either way. Are 
you NVMe's server ones? Do they have capacitors? :) in the response to a 
likely question about what they are I'll just post my link here again :) 
https://yourcmc.ru/wiki/Ceph_performance