Parallel I/O, by the way, is also awful. I can only reach 36000 write iops in a 14 NVMe
cluster with size=2 and 1 OSD per NVMe. This is only ~2500 iops per drive... OK, even if I
take 2x * 5x write amplification into account it's ~25000 iops per drive. And drives
can push over 300000 iops :-(.
Thankfully the performance isn't too critical in this environment. But it's very
low :).
Regards, Vitaliy
> Hi Wido
>
> No. My results with Ceph (yeah I still use it) are the same, and I use Threadrippers
which have
> almost 4 GHz clockspeed.
>
> Network isn't the main problem. The main problem is a lot of program logic
written in a complex way
> which leads to high CPU usage.
https://yourcmc.ru/wiki/Ceph_performance if you
haven't already seen
> it.
>
> I achieve ~7000 QD=1 iops with Vitastor just because it's much simpler. And
I'm gradually
> progressing feature-wise... :-)
>
> Regards, Vitaliy
>
>> (Sending it to dev list as people might know it there)
>>
>> Hi,
>>
>> There are many talks and presentations out there about Ceph's
>> performance. Ceph is great when it comes to parallel I/O, large queue
>> depths and many applications sending I/O towards Ceph.
>>
>> One thing where Ceph isn't the fastest are 4k blocks written at Queue
>> Depth 1.
>>
>> Some applications benefit very much from high performance/low latency
>> I/O at qd=1, for example Single Threaded applications which are writing
>> small files inside a VM running on RBD.
>>
>> With some tuning you can get to a ~700us latency for a 4k write with
>> qd=1 (Replication, size=3)
>>
>> I benchmark this using fio:
>>
>> $ fio --ioengine=librbd --bs=4k --iodepth=1 --direct=1 .. .. .. ..
>>
>> 700us latency means the result will be about ~1500 IOps (1000 / 0.7)
>>
>> When comparing this to let's say a BSD machine running ZFS that's on the
>> low side. With ZFS+NVMe you'll be able to reach about somewhere between
>> 7.000 and 10.000 IOps, the latency is simply much lower.
>>
>> My benchmarking / test setup for this:
>>
>> - Ceph Nautilus/Octopus (doesn't make a big difference)
>> - 3x SuperMicro 1U with:
>> - AMD Epyc 7302P 16-core CPU
>> - 128GB DDR4
>> - 10x Samsung PM983 3,84TB
>> - 10Gbit Base-T networking
>>
>> Things to configure/tune:
>>
>> - C-State pinning to 1
>> - CPU governer to performance
>> - Turn off all logging in Ceph (debug_osd, debug_ms, debug_bluestore=0)
>>
>> Higher clock speeds (New AMD Epyc coming in March!) help to reduce the
>> latency and going towards 25Gbit/100Gbit might help as well.
>>
>> These are however only very small increments and might help to reduce
>> the latency by another 15% or so.
>>
>> It doesn't bring us anywhere near the 10k IOps other applications can do.
>>
>> And I totally understand that replication over a TCP/IP network takes
>> time and thus increases latency.
>>
>> The Crimson project [0] is aiming to lower the latency with many things
>> like DPDK and SPDK, but this is far from finished and production ready.
>>
>> In the meantime, am I overseeing some things here? Can we reduce the
>> latency further of the current OSDs?
>>
>> Reaching a ~500us latency would already be great!
>>
>> Thanks,
>>
>> Wido
>>
>> [0]:
https://docs.ceph.com/en/latest/dev/crimson/crimson
>> _______________________________________________
>> Dev mailing list -- dev(a)ceph.io
>> To unsubscribe send an email to dev-leave(a)ceph.io
>
> _______________________________________________
> Dev mailing list -- dev(a)ceph.io
> To unsubscribe send an email to dev-leave(a)ceph.io