Block sizes, small files and bluestore - ceph-users

7 Jul 2020

Hi Cephers,

I have a lot of questions after reading a lot about bluestore, the
min_alloc_size and the impact of writing small files to it through cepfs
with erasure coding.

In a setup using VMware ESX with its VMFS6 and its 1 MB block size on an
ISCSI-LUN mapped from Ceph RBD (replicated)

   1. Wouldnt it be better to reduce RBD block size to 1 MB also?
   2. When a file with a size smaller than 4k is written to a filesystem
   within a virtual machine (and all the layers downwards) will the consumed
   space be 4k? So does the 4MB block size of RBD combine a lot of small files
   to one big object?

When using Cephfs and erasure coding:

   1. I assume using a 4k min_alloc_size_hdd would reduce wasted space, but
   increases fragmentation as Igor wrote.
   2. How is the official way to deal with fragmentation in bluestore? Is
   there a defrag tool available or planned?

From a performance perspective: My cluster runs on good old filestore using
nvme journals. I am about to migrate to bluestore.

   1. With a MaxIOSize of 512KB in VMware, wouldnt
   bluestore_prefer_deferred_size_hdd = 524288 give me a filestore like
   behavior? My aim is to have the write latency like in filestore because we
   have a lot of databases.
   2. Are there any tradeoffs doing this?

Regards,
Dennis