Hi Maged,
actually I expect no difference between various EC profiles in this
behavior.
Just verified EC42 against master branch:
Initial df report:
POOL ID STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP)
%USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR
ec42 3 16 KiB 16 KiB 0 B 1 384 KiB 384 KiB 0
B 0 392 GiB N/A N/A 1 0 B 0 B
executed commands:
dd if=./tmp of=/dev/rbd1 count=64 bs=4096 seek=0
dd if=./tmp of=/dev/rbd1 count=60 bs=4096 seek=4
dd if=./tmp of=/dev/rbd1 count=56 bs=4096 seek=8
Final df report:
POOL ID STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP)
%USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR
ec42 3 272 KiB 272 KiB 0 B 3 1.5 MiB 1.5 MiB 0
B 0 392 GiB N/A N/A 3 0 B 0 B
So the same significant ( (1.5 MIB - 384KiB) / 272 KiB = ~4.2x )
overhead as in my original report.
Thanks,
Igor
On 1/23/2020 10:34 AM, Maged Mokhtar wrote:
>
> On 23/01/2020 01:20, Igor Fedotov wrote:
>> Hi All!
>>
>> Preface:
>>
>> Recently we've got a customer report about a discrepancy between USED
>> and RAW USED columns in df report for a specific pool.
>>
>> Approx. 100% higher volume was reported for RAW USED. Pool in
>> question has EC 6+3 and keeps RBD images.
>>
>> Other relevant cluster/OSD information: Luminous v12.2.12, BlueStore,
>> HDDs as main devices, EC stripe width = 24K.
>>
>> Preliminary investigation showed significant difference between
>> bluestore_stored and bluestore_allocated performance counters at all
>> involved OSDs
>>
>> with pretty the same 100% increase ratio. According to perf counters
>> majority of writes are 'small' ones.
>>
>> Hence allocation overhead caused by small/fragmented writes has been
>> named as an intermediate cause.
>>
>>
>> But why it happens?
>>
>> Now I'd like to share deeper analysis on what happens to objects at
>> BlueStore when RBD performs writes to above-mentioned EC pool.
>>
>> And let me narrow the scope to a single 64K RBD data object at single
>> BlueStore instance which encompasses one of EC shard for 384K (16 *
>> 24K => 16 * 4K) RBD data span.
>>
>> Initially let's do a 384K write to RBD using 'dd if=./tmp
>> of=/dev/rbd1 count=96 bs=4096 seek=0'
>>
>> At Bluestore this dd write results in a single 64K write(append)
>> which lands as a single blob containing single 64K pextent(allocation).
>>
>> Then do second 360K write to RBD image at 24K offset which in fact
>> results in 0x1000~f000 write to the same object . In ideal world this
>> should
>>
>> reuse existing blob and data would be merged (via some bluestore
>> magic) and would take single 64K pextent again. But in reality this
>> doesn't happen and both new blob and 64K allocation are made.
>>
>> As a result one has 64K stored to BlueStore and 128K allocated.
>>
>> Third write at 48K offset and 322K of data results in third
>> blob/allocation and 196K of allocated data for the same 64K of stored
>> one.
>>
>> The same behavior lasts while target dd offset is below 384K
>> resulting in up to 16x space overhead.
>>
>> Here is the log snippet for one of the intermediate write req
>> handling in this sequence:
>>
>> ==============================================
>> 2020-01-22 23:41:57.808471 7f6190c55700 1 -- 10.100.2.124:6802/38298
>> <== osd.6 10.100.2.124:6826/39718 125 ==== MOSDECSubOpRead(3.24s1
>> 32/26 ECSubRead(tid=30,
>>
to_read={3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head=36864,4096,0},
>> attrs_to_read=)) v3 ==== 192+0+0 (1980749420 0 0) 0x55db8187af00 con
>> 0x55db817c7000
>> 2020-01-22 23:41:57.808669 7f6179f6a700 15
>> bluestore(/home/if/luminous/build/dev/osd0) read 3.24s1_head
>> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#
>> 0x9000~1000
>> 2020-01-22 23:41:57.808707 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _do_read 0x9000~1000 size
>> 0x10000 (65536)
>> 2020-01-22 23:41:57.808712 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _do_read defaulting to
>> buffered read
>> 2020-01-22 23:41:57.808723 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _do_read blob
>> Blob(0x55db82051110 blob([0x540000~10000] csum+has_unused
>> crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x8000)
>> SharedBlob(0x55db82050fc0 sbid 0x0)) need 0x9000~1000 cache has
>> 0x[9000~1000]
>> 2020-01-22 23:41:57.808745 7f6179f6a700 10
>> bluestore(/home/if/luminous/build/dev/osd0) read 3.24s1_head
>> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#
>> 0x9000~1000 = 4096
>> 2020-01-22 23:41:57.808766 7f6179f6a700 1 -- 10.100.2.124:6802/38298
>> --> 10.100.2.124:6826/39718 -- MOSDECSubOpReadReply(3.24s0 32/26
>> ECSubReadReply(tid=30, attrs_read=0)) v2 -- 0x55db821e8580 con 0
>> 2020-01-22 23:41:57.810251 7f6190c55700 1 -- 10.100.2.124:6802/38298
>> <== osd.6 10.100.2.124:6826/39718 126 ==== MOSDECSubOpWrite(3.24s1
>> 32/26 ECSubWrite(tid=29, reqid=client.4215.0:113, at_version=32'11,
>> trim_to=0'0, roll_forward_to=32'10)) v2 ==== 6671+0+0 (3945069128 0
>> 0) 0x55db81d45800 con 0x55db817c7000
>> 2020-01-22 23:41:57.810524 7f6179f6a700 10
>> bluestore(/home/if/luminous/build/dev/osd0) queue_transactions
>> existing 0x55db81bc7dc0 osr(3.24s1 0x55db81b30800)
>> 2020-01-22 23:41:57.810540 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _txc_create osr
>> 0x55db81bc7dc0 = 0x55db821dd200 seq 28
>> 2020-01-22 23:41:57.810563 7f6179f6a700 15
>> bluestore(/home/if/luminous/build/dev/osd0) _setattrs 3.24s1_head
>> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 2 keys
>> 2020-01-22 23:41:57.810583 7f6179f6a700 10
>> bluestore(/home/if/luminous/build/dev/osd0) _setattrs 3.24s1_head
>> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 2 keys = 0
>> 2020-01-22 23:41:57.810591 7f6179f6a700 15
>> bluestore(/home/if/luminous/build/dev/osd0) _set_alloc_hint
>> 3.24s1_head
>> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#
>> object_size 700416 write_size 700416 flags -
>> 2020-01-22 23:41:57.810598 7f6179f6a700 10
>> bluestore(/home/if/luminous/build/dev/osd0) _set_alloc_hint
>> 3.24s1_head
>> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#
>> object_size 700416 write_size 700416 flags - = 0
>> 2020-01-22 23:41:57.810610 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0).collection(3.24s1_head
>> 0x55db81cc0a00) get_onode oid
>> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b key
>>
0x81800000000000000325900c85217262'd_data.4.10716b8b4567.0000000000000000!='0xfffffffffffffffe000000000000000b'o'
>>
>> 2020-01-22 23:41:57.810678 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0).collection(3.24s1_head
>> 0x55db81cc0a00) r -2 v.len 0
>> 2020-01-22 23:41:57.810698 7f6179f6a700 15
>> bluestore(/home/if/luminous/build/dev/osd0) _touch 3.24s1_head
>> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b
>> 2020-01-22 23:41:57.810704 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _assign_nid 1193
>> 2020-01-22 23:41:57.810706 7f6179f6a700 10
>> bluestore(/home/if/luminous/build/dev/osd0) _touch 3.24s1_head
>> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b = 0
>> 2020-01-22 23:41:57.810713 7f6179f6a700 15
>> bluestore(/home/if/luminous/build/dev/osd0) _clone_range 3.24s1_head
>> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# ->
>> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b from
>> 0x9000~1000 to offset 0x9000
>> 2020-01-22 23:41:57.810721 7f6179f6a700 15
>> bluestore(/home/if/luminous/build/dev/osd0) _do_zero 3.24s1_head
>> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b
>> 0x9000~1000
>> 2020-01-22 23:41:57.810728 7f6179f6a700 20
>> bluestore.extentmap(0x55db821e8990) dirty_range mark inline shard dirty
>> 2020-01-22 23:41:57.810732 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _do_zero extending size
>> to 40960
>> 2020-01-22 23:41:57.810734 7f6179f6a700 10
>> bluestore(/home/if/luminous/build/dev/osd0) _do_zero 3.24s1_head
>> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b
>> 0x9000~1000 = 0
>> 2020-01-22 23:41:57.810740 7f6179f6a700 15
>> bluestore(/home/if/luminous/build/dev/osd0) _do_clone_range
>> 3.24s1_head
>> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# ->
>> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b
>> 0x9000~1000 -> 0x9000~1000
>> 2020-01-22 23:41:57.810750 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _do_clone_range src
>> 0x8000~8000: 0x8000~8000 Blob(0x55db82051110 blob([0x540000~10000]
>> csum+has_unused crc32c/0x1000 unused=0xff) use_tracker(0x10000
>> 0x8000) SharedBlob(0x55db82050fc0 sbid 0x0))
>> 2020-01-22 23:41:57.810761 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _assign_blobid 10250
>> 2020-01-22 23:41:57.810764 7f6179f6a700 10
>> bluestore(/home/if/luminous/build/dev/osd0).collection(3.24s1_head
>> 0x55db81cc0a00) make_blob_shared Blob(0x55db82051110
>> blob([0x540000~10000] csum+has_unused crc32c/0x1000 unused=0xff)
>> use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0 sbid 0x0))
>> 2020-01-22 23:41:57.810773 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0).collection(3.24s1_head
>> 0x55db81cc0a00) make_blob_shared now Blob(0x55db82051110
>> blob([0x540000~10000] csum+has_unused+shared crc32c/0x1000
>> unused=0xff) use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0
>> loaded (sbid 0x280a ref_map(0x540000~10000=1))))
>> 2020-01-22 23:41:57.810787 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _do_clone_range new
>> Blob(0x55db82051030 blob([0x540000~10000] csum+has_unused+shared
>> crc32c/0x1000 unused=0xff) use_tracker(0x0 0x0)
>> SharedBlob(0x55db82050fc0 loaded (sbid 0x280a
>> ref_map(0x540000~10000=2))))
>> 2020-01-22 23:41:57.810795 7f6179f6a700 20
>> bluestore.blob(0x55db82051030) get_ref 0x9000~1000
>> Blob(0x55db82051030 blob([0x540000~10000] csum+has_unused+shared
>> crc32c/0x1000 unused=0xff) use_tracker(0x0 0x0)
>> SharedBlob(0x55db82050fc0 loaded (sbid 0x280a
>> ref_map(0x540000~10000=2))))
>> 2020-01-22 23:41:57.810802 7f6179f6a700 20
>> bluestore.blob(0x55db82051030) get_ref init 0x10000, 10000
>> 2020-01-22 23:41:57.810805 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _do_clone_range dst
>> 0x9000~1000: 0x9000~1000 Blob(0x55db82051030 blob([0x540000~10000]
>> csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000
>> 0x1000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a
>> ref_map(0x540000~10000=2))))
>> 2020-01-22 23:41:57.810813 7f6179f6a700 20
>> bluestore.extentmap(0x55db81b2b1d0) dirty_range mark inline shard dirty
>> 2020-01-22 23:41:57.810817 7f6179f6a700 20
>> bluestore.extentmap(0x55db821e8990) dirty_range mark inline shard dirty
>> 2020-01-22 23:41:57.810819 7f6179f6a700 10
>> bluestore(/home/if/luminous/build/dev/osd0) _clone_range 3.24s1_head
>> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# ->
>> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b from
>> 0x9000~1000 to offset 0x9000 = 0
>> 2020-01-22 23:41:57.810829 7f6179f6a700 15
>> bluestore(/home/if/luminous/build/dev/osd0) _write 3.24s1_head
>> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#
>> 0x9000~1000
>> 2020-01-22 23:41:57.810838 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _do_write
>> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#
>> 0x9000~1000 - have 0x10000 (65536) bytes fadvise_flags 0x0
>> 2020-01-22 23:41:57.810844 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _choose_write_options
>> prefer csum_order 12 target_blob_size 0x80000 compress=0 buffered=0
>> 2020-01-22 23:41:57.810849 7f6179f6a700 10
>> bluestore(/home/if/luminous/build/dev/osd0) _do_write_small 0x9000~1000
>> 2020-01-22 23:41:57.810853 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _do_write_small
>> considering Blob(0x55db820511f0 blob([0x530000~10000]
>> csum+has_unused+shared crc32c/0x1000 unused=0x7f) use_tracker(0x10000
>> 0x1000) SharedBlob(0x55db820510a0 loaded (sbid 0x2809
>> ref_map(0x530000~10000=1)))) bstart 0x0
>> 2020-01-22 23:41:57.810861 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _do_write_small ignoring
>> immutable Blob(0x55db820511f0 blob([0x530000~10000]
>> csum+has_unused+shared crc32c/0x1000 unused=0x7f) use_tracker(0x10000
>> 0x1000) SharedBlob(0x55db820510a0 loaded (sbid 0x2809
>> ref_map(0x530000~10000=1))))
>> 2020-01-22 23:41:57.810868 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _do_write_small
>> considering Blob(0x55db820512d0 blob([0x520000~10000]
>> csum+has_unused+shared crc32c/0x1000 unused=0x3f) use_tracker(0x10000
>> 0x1000) SharedBlob(0x55db82051180 loaded (sbid 0x2808
>> ref_map(0x520000~10000=1)))) bstart 0x0
>> 2020-01-22 23:41:57.810876 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _do_write_small
>> considering Blob(0x55db82051110 blob([0x540000~10000]
>> csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000
>> 0x8000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a
>> ref_map(0x540000~10000=2)))) bstart 0x0
>> 2020-01-22 23:41:57.810883 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _do_write_small ignoring
>> immutable Blob(0x55db82051110 blob([0x540000~10000]
>> csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000
>> 0x8000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a
>> ref_map(0x540000~10000=2))))
>> 2020-01-22 23:41:57.810888 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _do_write_small
>> considering Blob(0x55db82050f50 blob([0x510000~10000]
>> csum+has_unused+shared crc32c/0x1000 unused=0x1f) use_tracker(0x10000
>> 0x1000) SharedBlob(0x55db82051260 loaded (sbid 0x2807
>> ref_map(0x510000~10000=1)))) bstart 0x0
>> 2020-01-22 23:41:57.810894 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _do_write_small
>> considering Blob(0x55db820513b0 blob([0x500000~10000]
>> csum+has_unused+shared crc32c/0x1000 unused=0xf) use_tracker(0x10000
>> 0x1000) SharedBlob(0x55db82051340 loaded (sbid 0x2806
>> ref_map(0x500000~10000=1)))) bstart 0x0
>> 2020-01-22 23:41:57.810900 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _do_write_small
>> considering Blob(0x55db81e38850 blob([0x4f0000~10000]
>> csum+has_unused+shared crc32c/0x1000 unused=0x7) use_tracker(0x10000
>> 0x1000) SharedBlob(0x55db81db2690 loaded (sbid 0x2805
>> ref_map(0x4f0000~10000=1)))) bstart 0x0
>> 2020-01-22 23:41:57.810951 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _do_write_small
>> considering Blob(0x55db81eadc70 blob([0x4e0000~10000]
>> csum+has_unused+shared crc32c/0x1000 unused=0x3) use_tracker(0x10000
>> 0x1000) SharedBlob(0x55db81db3500 loaded (sbid 0x2804
>> ref_map(0x4e0000~10000=1)))) bstart 0x0
>> 2020-01-22 23:41:57.810958 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _do_write_small
>> considering Blob(0x55db81eadb90 blob([0x4d0000~10000] csum+shared
>> crc32c/0x1000) use_tracker(0x10000 0x2000) SharedBlob(0x55db81eadc00
>> loaded (sbid 0x2803 ref_map(0x4d0000~10000=1)))) bstart 0x0
>> 2020-01-22 23:41:57.810967 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _pad_zeros pad 0x0 + 0x0
>> on front/back, now 0x9000~1000
>> 2020-01-22 23:41:57.810974 7f6179f6a700 20
>> bluestore.blob(0x55db82051110) put_ref 0x9000~1000
>> Blob(0x55db82051110 blob([0x540000~10000] csum+has_unused+shared
>> crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x8000)
>> SharedBlob(0x55db82050fc0 loaded (sbid 0x280a
>> ref_map(0x540000~10000=2))))
>> 2020-01-22 23:41:57.810987 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write txc
>> 0x55db821dd200 1 blobs
>> 2020-01-22 23:41:57.810992 7f6179f6a700 10 stupidalloc
>> 0x0x55db81107c80 allocate_int want_size 0x10000 alloc_unit 0x10000
>> hint 0x0
>> 2020-01-22 23:41:57.811003 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write prealloc
>> [0x550000~10000]
>> 2020-01-22 23:41:57.811006 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write forcing
>> csum_order to block_size_order 12
>> 2020-01-22 23:41:57.811009 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write
>> initialize csum setting for new blob Blob(0x55db81eadb20 blob([])
>> use_tracker(0x0 0x0) SharedBlob(0x55db81db3650 sbid 0x0)) csum_type
>> crc32c csum_order 12 csum_length 0x10000
>> 2020-01-22 23:41:57.811019 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write blob
>> Blob(0x55db81eadb20 blob([0x550000~10000] csum crc32c/0x1000)
>> use_tracker(0x0 0x0) SharedBlob(0x55db81db3650 sbid 0x0))
>> 2020-01-22 23:41:57.811029 7f6179f6a700 20
>> bluestore.blob(0x55db81eadb20) get_ref 0x9000~1000
>> Blob(0x55db81eadb20 blob([0x550000~10000] csum+has_unused
>> crc32c/0x1000 unused=0xfdff) use_tracker(0x0 0x0)
>> SharedBlob(0x55db81db3650 sbid 0x0))
>> 2020-01-22 23:41:57.811037 7f6179f6a700 20
>> bluestore.blob(0x55db81eadb20) get_ref init 0x10000, 10000
>> 2020-01-22 23:41:57.811042 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write lex
>> 0x9000~1000: 0x9000~1000 Blob(0x55db81eadb20 blob([0x550000~10000]
>> csum+has_unused crc32c/0x1000 unused=0xfdff) use_tracker(0x10000
>> 0x1000) SharedBlob(0x55db81db3650 sbid 0x0))
>> 2020-01-22 23:41:57.811050 7f6179f6a700 20
>> bluestore.BufferSpace(0x55db81db3668 in 0x55db812fa2a0) _discard
>> 0x9000~1000
>> 2020-01-22 23:41:57.811056 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write deferring
>> small 0x1000 write via deferred
>> 2020-01-22 23:41:57.811063 7f6179f6a700 20
>> bluestore(/home/if/luminous/build/dev/osd0) _wctx_finish lex_old
>> 0x9000~1000: 0x9000~1000 Blob(0x55db82051110 blob([0x540000~10000]
>> csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000
>> 0x7000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a
>> ref_map(0x540000~10000=2))))
>> 2020-01-22 23:41:57.811074 7f6179f6a700 20
>> bluestore.extentmap(0x55db81b2b1d0) dirty_range mark inline shard dirty
>> 2020-01-22 23:41:57.811077 7f6179f6a700 10
>> bluestore(/home/if/luminous/build/dev/osd0) _write 3.24s1_head
>> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#
>> 0x9000~1000 = 0==============================================
>>
>> One can see that _do_write_small() function considers some blobs for
>> reuse but is unable to do that (presumably because they are shared
>> and/or relevant unused bits are cleared) and ends up with new
>> allocation.
>>
>> Actually the above writing pattern is the simplest (but quite
>> artificial) scenario to present the issue. Generally speaking the
>> same behavior is observed when we overwrite some data at RBD image
>> and this maps to a previously used object. If written data partially
>> overlaps existing blob (including its unused part) and this blob is
>> prohibited for reuse (it's shared which seems to be the case for EC
>> overwrites, or relevant unused bits are cleared, i.e. it has already
>> been written at certain positions) BlueStore allocates new and
>> preserves previous ones (remember we do partial overwrite). To some
>> degree it reminds the behavior with compressed blobs where a stack of
>> partially overlapped blobs might appear until garbage collection
>> cleans this up.
>>
>> So, e.g. full RBD image prefill and subsequent random small
>> overwrites will most probably result in some space overhead - up to
>> 16x times in the worst (certainly very seldom) case.
>>
>>
>> Additional notes:
>>
>> - This issue isn't present in master with new
>> bluestore_min_alloc_size defaults (=4K).
>>
>> - In nautilus (and octopus with bluestore_min_alloc_size_hdd set back
>> to 64K) this behavior is less visible due to blob garbage collection
>> we introduced - see
https://github.com/ceph/ceph/pull/30144
>>
>> But up to 3x increase ratio is still observable though.
>>
>> - The issue isn't observed for replicated pools.
>>
>> - Shared blobs created during EC overwrite seems to lack a rollback
>> to non-shared state after op completion (and snapshot removal). Hence
>> most probably they pollute onodes and DB (remember their persistence
>> mechanics) and negatively impact the performance. Needs more
>> investigation/verification though.
>>
>>
>> The above analysis has two goals:
>>
>> 1) Show potential origin of space overhead for pre-Nautilus clusters.
>>
>> 2) Show the hidden danger of using allocation sizes higher than 4K
>> (i.e. device block size?) for EC pools. But our research shows that
>> 4K alloc size is less efficient for spinner-backed pools.
>>
>>
https://github.com/ceph/ceph/pull/31867 suggests 'partial' rollback
>> in this respect. At least for default setup.
>>
>>
>> Thanks,
>>
>> Igor
>>
>> _______________________________________________
>> Dev mailing list -- dev(a)ceph.io
>> To unsubscribe send an email to dev-leave(a)ceph.io
>
> a while back with Luminous we tested different ec profiles for rbd :
> 2+1,2+2,3+2,4+2,5+2,5+3,6+2: we found 5+3 stood out with significant
> higher overhead. stripe width and min alloc size were left at default.
> Tests were 4k/4m rand/seq as well as file copy io with both hdd and ssd.
>
> as per your tests, it seems over-writes at different overlapping
> offsets would cause this overhead.,,so maybe the io tests we had done
> just happen to cause less offset overlaps with all but 5+3, maybe but
> i am not sure. it would be interesting if you can run the same test on
> say 4+2 and see if you still get high alloc overhead as with 6+3.
>
> /Maged
>
>
>
>
>