Hi All!
Preface:
Recently we've got a customer report about a discrepancy between USED
and RAW USED columns in df report for a specific pool.
Approx. 100% higher volume was reported for RAW USED. Pool in question
has EC 6+3 and keeps RBD images.
Other relevant cluster/OSD information: Luminous v12.2.12, BlueStore,
HDDs as main devices, EC stripe width = 24K.
Preliminary investigation showed significant difference between
bluestore_stored and bluestore_allocated performance counters at all
involved OSDs
with pretty the same 100% increase ratio. According to perf counters
majority of writes are 'small' ones.
Hence allocation overhead caused by small/fragmented writes has been
named as an intermediate cause.
But why it happens?
Now I'd like to share deeper analysis on what happens to objects at
BlueStore when RBD performs writes to above-mentioned EC pool.
And let me narrow the scope to a single 64K RBD data object at single
BlueStore instance which encompasses one of EC shard for 384K (16 * 24K
=> 16 * 4K) RBD data span.
Initially let's do a 384K write to RBD using 'dd if=./tmp of=/dev/rbd1
count=96 bs=4096 seek=0'
At Bluestore this dd write results in a single 64K write(append) which
lands as a single blob containing single 64K pextent(allocation).
Then do second 360K write to RBD image at 24K offset which in fact
results in 0x1000~f000 write to the same object . In ideal world this
should
reuse existing blob and data would be merged (via some bluestore magic)
and would take single 64K pextent again. But in reality this doesn't
happen and both new blob and 64K allocation are made.
As a result one has 64K stored to BlueStore and 128K allocated.
Third write at 48K offset and 322K of data results in third
blob/allocation and 196K of allocated data for the same 64K of stored one.
The same behavior lasts while target dd offset is below 384K resulting
in up to 16x space overhead.
Here is the log snippet for one of the intermediate write req handling
in this sequence:
==============================================
2020-01-22 23:41:57.808471 7f6190c55700 1 -- 10.100.2.124:6802/38298
<== osd.6 10.100.2.124:6826/39718 125 ==== MOSDECSubOpRead(3.24s1 32/26
ECSubRead(tid=30,
to_read={3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head=36864,4096,0},
attrs_to_read=)) v3 ==== 192+0+0 (1980749420 0 0) 0x55db8187af00 con
0x55db817c7000
2020-01-22 23:41:57.808669 7f6179f6a700 15
bluestore(/home/if/luminous/build/dev/osd0) read 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 0x9000~1000
2020-01-22 23:41:57.808707 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_read 0x9000~1000 size
0x10000 (65536)
2020-01-22 23:41:57.808712 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_read defaulting to
buffered read
2020-01-22 23:41:57.808723 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_read blob
Blob(0x55db82051110 blob([0x540000~10000] csum+has_unused crc32c/0x1000
unused=0xff) use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0 sbid
0x0)) need 0x9000~1000 cache has 0x[9000~1000]
2020-01-22 23:41:57.808745 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) read 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#
0x9000~1000 = 4096
2020-01-22 23:41:57.808766 7f6179f6a700 1 -- 10.100.2.124:6802/38298
--> 10.100.2.124:6826/39718 -- MOSDECSubOpReadReply(3.24s0 32/26
ECSubReadReply(tid=30, attrs_read=0)) v2 -- 0x55db821e8580 con 0
2020-01-22 23:41:57.810251 7f6190c55700 1 -- 10.100.2.124:6802/38298
<== osd.6 10.100.2.124:6826/39718 126 ==== MOSDECSubOpWrite(3.24s1 32/26
ECSubWrite(tid=29, reqid=client.4215.0:113, at_version=32'11,
trim_to=0'0, roll_forward_to=32'10)) v2 ==== 6671+0+0 (3945069128 0 0)
0x55db81d45800 con 0x55db817c7000
2020-01-22 23:41:57.810524 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) queue_transactions existing
0x55db81bc7dc0 osr(3.24s1 0x55db81b30800)
2020-01-22 23:41:57.810540 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _txc_create osr
0x55db81bc7dc0 = 0x55db821dd200 seq 28
2020-01-22 23:41:57.810563 7f6179f6a700 15
bluestore(/home/if/luminous/build/dev/osd0) _setattrs 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 2 keys
2020-01-22 23:41:57.810583 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) _setattrs 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 2 keys = 0
2020-01-22 23:41:57.810591 7f6179f6a700 15
bluestore(/home/if/luminous/build/dev/osd0) _set_alloc_hint 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#
object_size 700416 write_size 700416 flags -
2020-01-22 23:41:57.810598 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) _set_alloc_hint 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#
object_size 700416 write_size 700416 flags - = 0
2020-01-22 23:41:57.810610 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0).collection(3.24s1_head
0x55db81cc0a00) get_onode oid
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b key
0x81800000000000000325900c85217262'd_data.4.10716b8b4567.0000000000000000!='0xfffffffffffffffe000000000000000b'o'
2020-01-22 23:41:57.810678 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0).collection(3.24s1_head
0x55db81cc0a00) r -2 v.len 0
2020-01-22 23:41:57.810698 7f6179f6a700 15
bluestore(/home/if/luminous/build/dev/osd0) _touch 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b
2020-01-22 23:41:57.810704 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _assign_nid 1193
2020-01-22 23:41:57.810706 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) _touch 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b = 0
2020-01-22 23:41:57.810713 7f6179f6a700 15
bluestore(/home/if/luminous/build/dev/osd0) _clone_range 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# ->
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b from
0x9000~1000 to offset 0x9000
2020-01-22 23:41:57.810721 7f6179f6a700 15
bluestore(/home/if/luminous/build/dev/osd0) _do_zero 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b 0x9000~1000
2020-01-22 23:41:57.810728 7f6179f6a700 20
bluestore.extentmap(0x55db821e8990) dirty_range mark inline shard dirty
2020-01-22 23:41:57.810732 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_zero extending size to 40960
2020-01-22 23:41:57.810734 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) _do_zero 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b
0x9000~1000 = 0
2020-01-22 23:41:57.810740 7f6179f6a700 15
bluestore(/home/if/luminous/build/dev/osd0) _do_clone_range 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# ->
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b
0x9000~1000 -> 0x9000~1000
2020-01-22 23:41:57.810750 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_clone_range src
0x8000~8000: 0x8000~8000 Blob(0x55db82051110 blob([0x540000~10000]
csum+has_unused crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x8000)
SharedBlob(0x55db82050fc0 sbid 0x0))
2020-01-22 23:41:57.810761 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _assign_blobid 10250
2020-01-22 23:41:57.810764 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0).collection(3.24s1_head
0x55db81cc0a00) make_blob_shared Blob(0x55db82051110
blob([0x540000~10000] csum+has_unused crc32c/0x1000 unused=0xff)
use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0 sbid 0x0))
2020-01-22 23:41:57.810773 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0).collection(3.24s1_head
0x55db81cc0a00) make_blob_shared now Blob(0x55db82051110
blob([0x540000~10000] csum+has_unused+shared crc32c/0x1000 unused=0xff)
use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0 loaded (sbid
0x280a ref_map(0x540000~10000=1))))
2020-01-22 23:41:57.810787 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_clone_range new
Blob(0x55db82051030 blob([0x540000~10000] csum+has_unused+shared
crc32c/0x1000 unused=0xff) use_tracker(0x0 0x0)
SharedBlob(0x55db82050fc0 loaded (sbid 0x280a ref_map(0x540000~10000=2))))
2020-01-22 23:41:57.810795 7f6179f6a700 20
bluestore.blob(0x55db82051030) get_ref 0x9000~1000 Blob(0x55db82051030
blob([0x540000~10000] csum+has_unused+shared crc32c/0x1000 unused=0xff)
use_tracker(0x0 0x0) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a
ref_map(0x540000~10000=2))))
2020-01-22 23:41:57.810802 7f6179f6a700 20
bluestore.blob(0x55db82051030) get_ref init 0x10000, 10000
2020-01-22 23:41:57.810805 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_clone_range dst
0x9000~1000: 0x9000~1000 Blob(0x55db82051030 blob([0x540000~10000]
csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000
0x1000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a
ref_map(0x540000~10000=2))))
2020-01-22 23:41:57.810813 7f6179f6a700 20
bluestore.extentmap(0x55db81b2b1d0) dirty_range mark inline shard dirty
2020-01-22 23:41:57.810817 7f6179f6a700 20
bluestore.extentmap(0x55db821e8990) dirty_range mark inline shard dirty
2020-01-22 23:41:57.810819 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) _clone_range 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# ->
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b from
0x9000~1000 to offset 0x9000 = 0
2020-01-22 23:41:57.810829 7f6179f6a700 15
bluestore(/home/if/luminous/build/dev/osd0) _write 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 0x9000~1000
2020-01-22 23:41:57.810838 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#
0x9000~1000 - have 0x10000 (65536) bytes fadvise_flags 0x0
2020-01-22 23:41:57.810844 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _choose_write_options prefer
csum_order 12 target_blob_size 0x80000 compress=0 buffered=0
2020-01-22 23:41:57.810849 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small 0x9000~1000
2020-01-22 23:41:57.810853 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering
Blob(0x55db820511f0 blob([0x530000~10000] csum+has_unused+shared
crc32c/0x1000 unused=0x7f) use_tracker(0x10000 0x1000)
SharedBlob(0x55db820510a0 loaded (sbid 0x2809
ref_map(0x530000~10000=1)))) bstart 0x0
2020-01-22 23:41:57.810861 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small ignoring
immutable Blob(0x55db820511f0 blob([0x530000~10000]
csum+has_unused+shared crc32c/0x1000 unused=0x7f) use_tracker(0x10000
0x1000) SharedBlob(0x55db820510a0 loaded (sbid 0x2809
ref_map(0x530000~10000=1))))
2020-01-22 23:41:57.810868 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering
Blob(0x55db820512d0 blob([0x520000~10000] csum+has_unused+shared
crc32c/0x1000 unused=0x3f) use_tracker(0x10000 0x1000)
SharedBlob(0x55db82051180 loaded (sbid 0x2808
ref_map(0x520000~10000=1)))) bstart 0x0
2020-01-22 23:41:57.810876 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering
Blob(0x55db82051110 blob([0x540000~10000] csum+has_unused+shared
crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x8000)
SharedBlob(0x55db82050fc0 loaded (sbid 0x280a
ref_map(0x540000~10000=2)))) bstart 0x0
2020-01-22 23:41:57.810883 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small ignoring
immutable Blob(0x55db82051110 blob([0x540000~10000]
csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000
0x8000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a
ref_map(0x540000~10000=2))))
2020-01-22 23:41:57.810888 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering
Blob(0x55db82050f50 blob([0x510000~10000] csum+has_unused+shared
crc32c/0x1000 unused=0x1f) use_tracker(0x10000 0x1000)
SharedBlob(0x55db82051260 loaded (sbid 0x2807
ref_map(0x510000~10000=1)))) bstart 0x0
2020-01-22 23:41:57.810894 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering
Blob(0x55db820513b0 blob([0x500000~10000] csum+has_unused+shared
crc32c/0x1000 unused=0xf) use_tracker(0x10000 0x1000)
SharedBlob(0x55db82051340 loaded (sbid 0x2806
ref_map(0x500000~10000=1)))) bstart 0x0
2020-01-22 23:41:57.810900 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering
Blob(0x55db81e38850 blob([0x4f0000~10000] csum+has_unused+shared
crc32c/0x1000 unused=0x7) use_tracker(0x10000 0x1000)
SharedBlob(0x55db81db2690 loaded (sbid 0x2805
ref_map(0x4f0000~10000=1)))) bstart 0x0
2020-01-22 23:41:57.810951 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering
Blob(0x55db81eadc70 blob([0x4e0000~10000] csum+has_unused+shared
crc32c/0x1000 unused=0x3) use_tracker(0x10000 0x1000)
SharedBlob(0x55db81db3500 loaded (sbid 0x2804
ref_map(0x4e0000~10000=1)))) bstart 0x0
2020-01-22 23:41:57.810958 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering
Blob(0x55db81eadb90 blob([0x4d0000~10000] csum+shared crc32c/0x1000)
use_tracker(0x10000 0x2000) SharedBlob(0x55db81eadc00 loaded (sbid
0x2803 ref_map(0x4d0000~10000=1)))) bstart 0x0
2020-01-22 23:41:57.810967 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _pad_zeros pad 0x0 + 0x0 on
front/back, now 0x9000~1000
2020-01-22 23:41:57.810974 7f6179f6a700 20
bluestore.blob(0x55db82051110) put_ref 0x9000~1000 Blob(0x55db82051110
blob([0x540000~10000] csum+has_unused+shared crc32c/0x1000 unused=0xff)
use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0 loaded (sbid
0x280a ref_map(0x540000~10000=2))))
2020-01-22 23:41:57.810987 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write txc
0x55db821dd200 1 blobs
2020-01-22 23:41:57.810992 7f6179f6a700 10 stupidalloc 0x0x55db81107c80
allocate_int want_size 0x10000 alloc_unit 0x10000 hint 0x0
2020-01-22 23:41:57.811003 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write prealloc
[0x550000~10000]
2020-01-22 23:41:57.811006 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write forcing
csum_order to block_size_order 12
2020-01-22 23:41:57.811009 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write initialize
csum setting for new blob Blob(0x55db81eadb20 blob([]) use_tracker(0x0
0x0) SharedBlob(0x55db81db3650 sbid 0x0)) csum_type crc32c csum_order 12
csum_length 0x10000
2020-01-22 23:41:57.811019 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write blob
Blob(0x55db81eadb20 blob([0x550000~10000] csum crc32c/0x1000)
use_tracker(0x0 0x0) SharedBlob(0x55db81db3650 sbid 0x0))
2020-01-22 23:41:57.811029 7f6179f6a700 20
bluestore.blob(0x55db81eadb20) get_ref 0x9000~1000 Blob(0x55db81eadb20
blob([0x550000~10000] csum+has_unused crc32c/0x1000 unused=0xfdff)
use_tracker(0x0 0x0) SharedBlob(0x55db81db3650 sbid 0x0))
2020-01-22 23:41:57.811037 7f6179f6a700 20
bluestore.blob(0x55db81eadb20) get_ref init 0x10000, 10000
2020-01-22 23:41:57.811042 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write lex
0x9000~1000: 0x9000~1000 Blob(0x55db81eadb20 blob([0x550000~10000]
csum+has_unused crc32c/0x1000 unused=0xfdff) use_tracker(0x10000 0x1000)
SharedBlob(0x55db81db3650 sbid 0x0))
2020-01-22 23:41:57.811050 7f6179f6a700 20
bluestore.BufferSpace(0x55db81db3668 in 0x55db812fa2a0) _discard 0x9000~1000
2020-01-22 23:41:57.811056 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write deferring
small 0x1000 write via deferred
2020-01-22 23:41:57.811063 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _wctx_finish lex_old
0x9000~1000: 0x9000~1000 Blob(0x55db82051110 blob([0x540000~10000]
csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000
0x7000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a
ref_map(0x540000~10000=2))))
2020-01-22 23:41:57.811074 7f6179f6a700 20
bluestore.extentmap(0x55db81b2b1d0) dirty_range mark inline shard dirty
2020-01-22 23:41:57.811077 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) _write 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#
0x9000~1000 = 0==============================================
One can see that _do_write_small() function considers some blobs for
reuse but is unable to do that (presumably because they are shared
and/or relevant unused bits are cleared) and ends up with new allocation.
Actually the above writing pattern is the simplest (but quite
artificial) scenario to present the issue. Generally speaking the same
behavior is observed when we overwrite some data at RBD image and this
maps to a previously used object. If written data partially overlaps
existing blob (including its unused part) and this blob is prohibited
for reuse (it's shared which seems to be the case for EC overwrites, or
relevant unused bits are cleared, i.e. it has already been written at
certain positions) BlueStore allocates new and preserves previous ones
(remember we do partial overwrite). To some degree it reminds the
behavior with compressed blobs where a stack of partially overlapped
blobs might appear until garbage collection cleans this up.
So, e.g. full RBD image prefill and subsequent random small overwrites
will most probably result in some space overhead - up to 16x times in
the worst (certainly very seldom) case.
Additional notes:
- This issue isn't present in master with new bluestore_min_alloc_size
defaults (=4K).
- In nautilus (and octopus with bluestore_min_alloc_size_hdd set back to
64K) this behavior is less visible due to blob garbage collection we
introduced - see https://github.com/ceph/ceph/pull/30144
But up to 3x increase ratio is still observable though.
- The issue isn't observed for replicated pools.
- Shared blobs created during EC overwrite seems to lack a rollback to
non-shared state after op completion (and snapshot removal). Hence most
probably they pollute onodes and DB (remember their persistence
mechanics) and negatively impact the performance. Needs more
investigation/verification though.
The above analysis has two goals:
1) Show potential origin of space overhead for pre-Nautilus clusters.
2) Show the hidden danger of using allocation sizes higher than 4K
(i.e. device block size?) for EC pools. But our research shows that 4K
alloc size is less efficient for spinner-backed pools.
https://github.com/ceph/ceph/pull/31867 suggests 'partial' rollback in
this respect. At least for default setup.
Thanks,
Igor
Hi All,
I have a query regarding objecter behaviour for homeless session. In
situations when all OSDs containing copies (*let say replication 3*) of an
object are down, the objecter assigns a homeless session (OSD=-1) to a
client request. This request makes radosgw thread hang indefinitely as the
data could never be served because all required OSDs are down. With
multiple similar requests, all the radosgw threads gets exhausted and
hanged indefinitely waiting for the OSDs to come up. This creates complete
service unavailability as no rgw threads are present to process valid
requests which could have been directed towards active PGs/OSDs.
I think we should have behaviour in objecter or radosgw to terminate
request and return early in case of a homeless session. Let me know your
thoughts on this.
Regards,
Biswajeet
--
*-----------------------------------------------------------------------------------------*
*This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the
system manager. This message contains confidential information and is
intended only for the individual named. If you are not the named addressee,
you should not disseminate, distribute or copy this email. Please notify
the sender immediately by email if you have received this email by mistake
and delete this email from your system. If you are not the intended
recipient, you are notified that disclosing, copying, distributing or
taking any action in reliance on the contents of this information is
strictly prohibited.*****
****
*Any views or opinions presented in this
email are solely those of the author and do not necessarily represent those
of the organization. Any information on shares, debentures or similar
instruments, recommended product pricing, valuations and the like are for
information purposes only. It is not meant to be an instruction or
recommendation, as the case may be, to buy or to sell securities, products,
services nor an offer to buy or sell securities, products or services
unless specifically stated to be so on behalf of the Flipkart group.
Employees of the Flipkart group of companies are expressly required not to
make defamatory statements and not to infringe or authorise any
infringement of copyright or any other legal right by email communications.
Any such communication is contrary to organizational policy and outside the
scope of the employment of the individual concerned. The organization will
not accept any liability in respect of such communication, and the employee
responsible will be personally liable for any damages or other liability
arising.*****
****
*Our organization accepts no liability for the
content of this email, or for the consequences of any actions taken on the
basis of the information *provided,* unless that information is
subsequently confirmed in writing. If you are not the intended recipient,
you are notified that disclosing, copying, distributing or taking any
action in reliance on the contents of this information is strictly
prohibited.*
_-----------------------------------------------------------------------------------------_
Hi everyone.
The next DocuBetter meeting is scheduled for tomorrow. This is at
the following time:
1800 PST 22 Jan 2020
0100 UTC 22 Jan 2020
1200 AEST 23 Nov 2020
Etherpad: https://pad.ceph.com/p/Ceph_Documentation
Meeting: https://bluejeans.com/908675367
Agenda: This week we will be discussing the new Getting Started
Guide, and improving docs bug reporting from the Ceph community
of users.
Dear Community,
Would like to share some design ideas around the above topic. Feedback is
welcomed!
Current State
- in "pull mode" [1] we have the same guarantees as the multisite syncing
mechanism (guarantee against HW/SW failures). On top of that, if writing
the event to RADOS fails, this trickle back as sync failure, which means
that the master zone will try to sync the pubsub zone
- in "push mode" [2] we send the notification from the ops context that
triggered the notification. The original operation is blocked until we get
a reply from the endpoint. As part of the configuration for the endpoint,
we also configure the "ack level", indicating whether we block until we get
a reply from the endpoint or not.
Since the operation response is not sent back to the client until the
endpoint acks, this method guarantees against any failure in the radosgw
(at the cost of adding latency to the operation).
This, however, does not guarantee delivery if the endpoint is down or
disconnected. The endpoint we interact with (rabbitmq, kafka) , usually
have built in redundancy mechanism, but this does not cover the case where
there is a network disconnect between our gateways and these systems.
In some cases we can get a nack from the endpoint, indicating that our
message would never reach the endpoint. But we can only log these cases:
- we cannot fail the operation that triggered us, because we send the
notification only after the actual operation (e.g. "put object") was done
(=no atomicity)
- no retry mechanism (in theory, we can add one)
Next Phase Requirements
We would like to add delivery guarantee to "push mode" for endpoint
failures. For that we would use a message queue with the following features:
- rados backed, so it would survive HW/SW failures
- blocking only on local read/writes (so it introduces smaller latency than
over-the-wire endpoint acks)
- has reserve/commit semantics, so we can "reserve" before the operation
(e.g. "put object") was done, and fail it if we cannot reserve a slot on
the queue, and commit the notification to the queue only after the
operation was successful (and unreserve if the operation failed)
- we would have a retry mechanism based on the queue, which means that if a
notification was successfully pushed into the queue, we can assume it would
(eventually) be successfully delivered to the endpoint
Proposed Solution
- use the cls_queue [3] (cls_queue is not omap based, hence, no builtin
iops limitations)
- add reserve/commit functionality (probably store that info in the queue
head)
- a dedicated thread(s) should be reading requests from the queue, sending
the notifications to the endpoints, and waiting for the replies (if needed)
- this should be done via coroutines
- acked requests are removed from the queue, nacked or timed-out requests
should be retried (at least for a while)
- both mechanism would coexist, as this would be configurable per topic
- as a stretch goal, we may add a "best effort queue". This would be
similar to the cls_queue solution, but won't address radosgw failures (as
the queue would be in-memory), only endpoint failures/disconnects
- for now, this mechanism won't be supported for pushing events from the
pubsub zone (="pull+push mode"), but might be added if users would find it
useful
Yuval
[1] https://docs.ceph.com/docs/master/radosgw/pubsub-module/
[2] https://docs.ceph.com/docs/master/radosgw/notifications/
[3] https://github.com/ceph/ceph/tree/master/src/cls/queue
Hi everyone,
Quick reminder that the early-bird registration for Cephalocon Seoul (Mar
3-5) ends tonight! We also have the hotel booking link and code up on the
site (finally--sorry for the delay).
https://ceph.io/cephalocon/seoul-2020/
Hope to see you there!
sage
This is probably a question for David Galloway (?)
Did Redmine get updated recently? (Like, very recently?) Up until, well,
very recently (yesterday?) curl commands like the following were producing
meaningful output even on issues that have no relations (43725 has one):
curl --silent 'https://tracker.ceph.com/issues/43725.json?include=relations' | jq '.issue.relations[]'
Now it produces an error:
jq: error (at <stdin>:0): Cannot iterate over null (null)
Maybe that is related to fact that the following curl command is not producing
any output, even though issue 43725 has relations:
curl --silent 'https://tracker.ceph.com/issues/43725/relations.json'
Any ideas appreciated.
Nathan
--
Nathan Cutler
Software Engineer Distributed Storage
SUSE LINUX, s.r.o.
Tel.: +420 284 084 037
Hi Cephers,
We will have a booth at DevConf this year!
https://www.devconf.info/cz/
If you're interested in helping at the booth at DevConf, please reply to me
directly, and I can provide more details.
I'm still working on getting a shared booth for FOSDEM with the CentOS
table. Also, I will try to get us signed with a BoF session. If you're
interested, please also let me know directly. Thanks!
https://fosdem.org/2020/
--
Mike Perez
he/him
Ceph Community Manager
M: +1-951-572-2633
494C 5D25 2968 D361 65FB 3829 94BC D781 ADA8 8AEA
@Thingee <https://twitter.com/thingee> Thingee
<https://www.linkedin.com/thingee> <https://www.facebook.com/RedHatInc>
<https://www.redhat.com>
[Adding Sebastian, dev(a)ceph.io]
Some things to improve with the OSD create path!
On Mon, 20 Jan 2020, Yaarit Hatuka wrote:
> Here are a few Insights from this debugging process - I hope I got it right:
>
> 1. Adding the device with "/dev/disk/by-id/...." did not work for me, it
> failed in pybind/mgr/cephadm/module.py at:
> https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/module.py#L…
> "if len(list(set(devices) & set(osd['devices']))) == 0"
> because osd['devices'] has the devices listed as "/dev/sdX", but
> set(devices) has them by their dev-id.... (which is the syntax specified as
> the example in the docs, which I followed).
> It took me a couple of days to debug this :-)
>
> 2. I think that cephadm should be more verbose by default. When creating
> OSD it only writes "Created osd(s) on host 'mira027.front.sepia.ceph.com'"
> (even in case creation failed...). It will help if it outputs the different
> stages so that the user can see where it stopped in case of error.
>
> 3. ceph status shows that the OSD was added even if the orchestrator failed
> to add it (but it's marked down and out).
IIUC this is ceph-volumes failure path not cleaning up? Is this the
failure you saw when you passed the /dev/disk/by-id device path?
> 4. I couldn't find the logs that cephadm produces.
> I searched for them on both the source (mira010) and the target (mira027)
> machines in /var/log/ceph/<fsid>/* and couldn't find any print from either
> the cephadm mgr module nor the cephadm script. I also looked at /var/log/*.
> Where are they hiding?
The ceph-volume.log is the one to look at.
> 5. After ceph-volume creates its LVs, the host's
> lvdisplay/vgdisplay/pvdisplay showed nothing. I had to run "pvscan --cache"
> on the host in order for those commands to output the current state. This
> may confuse the user.
>
> 6. I think it's also a good idea to have another cephadm feature "cephadm
> shell --host=<host>" to open a cephadm shell on a remote host. I wanted to
> run "ceph-volume lvm zap" on one of the remote hosts and to do that I sshed
> over, copied the cephadm script and ran "cephadm shell". it would be cool
> if we could do that from the original machine.
The cephadm script doesn't know how to ssh. We could probably teach it,
though, for something like this... but it might be simpler for the
user to just 'scp cephadm $host:', as that's basically what cephadm would
do to "install" itself remotely?
sage