I have no idea how you get 66k write iops with one OSD )
I've just repeated a test by creating a test pool on one NVMe OSD with 8 PGs (all
pinned to the same OSD with pg-upmap). Then I ran 4x fio randwrite q128 over 4 RBD images.
I got 17k iops.
OK, in fact that's not the worst result for Ceph, but problem is that I only get 30k
write iops when benchmarking 4 RBD images spread over all OSDs _in_the_same_cluster_. And
there are 14 of them.
>> I've just finishing doing our own benchmarking, and I can say, you
>> want to do something very unbalanced and CPU bounded.
>>
>> 1. Ceph consume a LOT of CPU. My peak value was around 500% CPU per
>> ceph-osd at top-performance (see the recent thread on 'ceph on brd')
>> with more realistic numbers around 300-400% CPU per device.
>>> In fact in isolation on the test setup that Intel donated for
>>> community ceph R&D we've pushed a single OSD to consume around 1400%
>>> CPU at 80K write IOPS! :) I agree though, we typical see a peak of
>>> about 500-600% CPU per OSD on multi-node clusters with a
>>> correspondingly lower write throughput. I do believe that in some
>>> cases the mix of IO we are doing is causing us to at least be
>>> partially bound by disk write latency with the single writer thread
>>> in the rocksdb WAL though.
>>
>> I'd really like to see how they done this without offloading (their
>> configuration).
>
> I went back and looked over some of the old results. I didn't find the
> really high test scores (and now that I'm thinking about it they may
> have been from when I was ripping out pglog OMAP updates!), but here's
> one example I did find from earlier testing last winter that at least
> got into roughly the right ballpark with stock master from last December
> (~66K IOPS):
>
> Avg 4K FIO randwrite IOPS: 65841.7
>
> - 1 p4510 NVMe backed OSD
>
> - 8GB osd memory target
>
> - 4K min alloc size
>
> - 4 clients, 1 128GB RBD volume per client, io_depth=128, time=300s
>
> - 128 PGs (fixed)
>
> - latency-network tuned profile
>
> - bluestore_rocksdb_options =
>
"compression=kNoCompression,max_total_wal_size=1073741824,max_write_buffer_number=16,min_write_buffe
>
_number_to_merge=3,recycle_log_file_num=4,write_buffer_size=67108864,writable_file_max_buffer_size=0
>
compaction_readahead_size=2097152,max_background_compactions=2,compaction_style=kCompactionStyleUniv
> rsal"
>
> - bluestore_default_buffered_write = true
>
> - bluestore_default_buffered_read = true
>
> - rbd cache = false
>
> Beyond that general stuff like background scrubbing and pg autoscaling
> was disabled. I should note that these results are using universal
> compaction in rocksdb which you probably don't want to do in production
> because it can require 2x the total DB space to perform a compaction.
> It might actually be feasible now that we are doing column family
> sharding thanks to Adam's PR because you will only need 2x the space of
> any individual column family for compaction rather than the whole DB,
> but it's still unsupported for now.
>
> Mark
>
>>>
>>
>> 2. Ceph is unable to deliver more than 12k IOPS per ceph-osd (may be
>> a little more with top-tier low-core high-frequency CPU, but not
>> much). So, super-duper-nvme wont make difference. (btw, I have a
>> stupid idea to try to run two ceph-osd from the same LV with a
>> single PV underneath VG, but it not tested).
>>> I'm curious if you've tried octopus+ yet? We refactored
bluestore's
>>> caches which internally has proven to help quite a bit with latency
>>> bound workloads as it reduces lock contention in onode cache shards
>>> and the impact of cache trimming (no more single trimming trim thread
>>> constantly grabbing the lock for long periods of time!). In a 64
>>> NVMe drive setup (P4510s), we were able to do a little north of 400K
>>> write IOPS with 3x replication, so about 19K IOPs per OSD once you
>>> factor rep in. Also, in Nautilus you can see real benefits wtih
>>> running multiple OSDs on a single device but with Octopus and master
>>> we've pretty much closed the gap on our test setup:
>>
>> It's octopus. I was doing single-osd benchmark, removing all movable
>> parts (brd instead of nvme, no network, size=1, etc). Moreover, I've
>> focused on rados benchmark, as RBD is just a derivative from rados
>> performance.
>>
>> Anyway, big thank you for input.
>>
>>>
https://docs.google.com/spreadsheets/d/1e5eTeHdZnSizoY6AUjH0knb4jTCW7KMU4Ro…
>>>
>>> Generally speaking using the latency-performance or latency-network
>>> tuned profiles helps (mostly due to avoid C state CPU transitions) as
>>> does higher clock speeds. Not using replication helps but that's
>>> obviously not a realistic solution for most people. :)
>>
>> I used size=1 and 'no ssd, no network' as upper bound. If allows to
>> find limits for ceph-osd performance. Any real-life things
>> (replication, network, real block devices) will make things worse, not
>> better. Knowing upper performance bound is really nice when start to
>> choose server configuration.
>>
>>>
>>
>> 3. You wll find that any given client performance is heavily limited
>> by sum of all RTT in the network, plus own latencies of ceph, so
>> very fast NVME give a diminishing return.
>> 4. CPU bounded ceph-osd completely wipe any differences for
>> underlying devices (except for desktop-class crawlers).
>>
>> You can run your own tests, even without fancy 48-nvme boxes - just
>> run ceph-osd on brd (block ram disk). ceph-osd won't run any faster
>> on anything else (ramdisk is the fastest), so numbers you get from
>> brd is supremum (upper bound) for theoretical performance.
>>
>> Given max 400-500% CPU per ceph-osd I'd say you need to keep number
>> of NVME in server below 12, or, 15 (but sometimes you'll get CPU
>> saturation).
>>
>> In my opinion less fancy boxes with smaller number of drives per
>> server (but larger number of servers) would make your (or your
>> operation team's) life much less stressful.
>>> That's pretty much the advice I've been giving people since the
>>> Inktank days. It costs more and is lower density, but the design is
>>> simpler, you are less likely to under provision CPU, less likely to
>>> run into memory bandwidth bottlenecks, and you have less recovery to
>>> do when a node fails. Especially now with how many NVMe drives you
>>> can fit in a single 1U server!
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io