January 2020 - Dev - lists.ceph.io

01/23/2019 perf meeting is on at 8AM PST!

by Mark Nelson

Hi Folks, Today we'll be discussing min_alloc size, space amplification, and Igor's PR to flexibly choose the allocation size. Please feel free to add your own topic as well! Etherpad: https://pad.ceph.com/p/performance_weekly Bluejeans: https://bluejeans.com/908675367 Thanks, Mark

4 years, 2 months

1
0
0 0

Root cause analysis for space overhead with erasure coded pools.

by Igor Fedotov

Hi All! Preface: Recently we've got a customer report about a discrepancy between USED and RAW USED columns in df report for a specific pool. Approx. 100% higher volume was reported for RAW USED. Pool in question has EC 6+3 and keeps RBD images. Other relevant cluster/OSD information: Luminous v12.2.12, BlueStore, HDDs as main devices, EC stripe width = 24K. Preliminary investigation showed significant difference between bluestore_stored and bluestore_allocated performance counters at all involved OSDs with pretty the same 100% increase ratio. According to perf counters majority of writes are 'small' ones. Hence allocation overhead caused by small/fragmented writes has been named as an intermediate cause. But why it happens? Now I'd like to share deeper analysis on what happens to objects at BlueStore when RBD performs writes to above-mentioned EC pool. And let me narrow the scope to a single 64K RBD data object at single BlueStore instance which encompasses one of EC shard for 384K (16 * 24K => 16 * 4K) RBD data span. Initially let's do a 384K write to RBD using 'dd if=./tmp of=/dev/rbd1 count=96 bs=4096 seek=0' At Bluestore this dd write results in a single 64K write(append) which lands as a single blob containing single 64K pextent(allocation). Then do second 360K write to RBD image at 24K offset which in fact results in 0x1000~f000 write to the same object . In ideal world this should reuse existing blob and data would be merged (via some bluestore magic) and would take single 64K pextent again. But in reality this doesn't happen and both new blob and 64K allocation are made. As a result one has 64K stored to BlueStore and 128K allocated. Third write at 48K offset and 322K of data results in third blob/allocation and 196K of allocated data for the same 64K of stored one. The same behavior lasts while target dd offset is below 384K resulting in up to 16x space overhead. Here is the log snippet for one of the intermediate write req handling in this sequence: ============================================== 2020-01-22 23:41:57.808471 7f6190c55700 1 -- 10.100.2.124:6802/38298 <== osd.6 10.100.2.124:6826/39718 125 ==== MOSDECSubOpRead(3.24s1 32/26 ECSubRead(tid=30, to_read={3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head=36864,4096,0}, attrs_to_read=)) v3 ==== 192+0+0 (1980749420 0 0) 0x55db8187af00 con 0x55db817c7000 2020-01-22 23:41:57.808669 7f6179f6a700 15 bluestore(/home/if/luminous/build/dev/osd0) read 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 0x9000~1000 2020-01-22 23:41:57.808707 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_read 0x9000~1000 size 0x10000 (65536) 2020-01-22 23:41:57.808712 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_read defaulting to buffered read 2020-01-22 23:41:57.808723 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_read blob Blob(0x55db82051110 blob([0x540000~10000] csum+has_unused crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0 sbid 0x0)) need 0x9000~1000 cache has 0x[9000~1000] 2020-01-22 23:41:57.808745 7f6179f6a700 10 bluestore(/home/if/luminous/build/dev/osd0) read 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 0x9000~1000 = 4096 2020-01-22 23:41:57.808766 7f6179f6a700 1 -- 10.100.2.124:6802/38298 --> 10.100.2.124:6826/39718 -- MOSDECSubOpReadReply(3.24s0 32/26 ECSubReadReply(tid=30, attrs_read=0)) v2 -- 0x55db821e8580 con 0 2020-01-22 23:41:57.810251 7f6190c55700 1 -- 10.100.2.124:6802/38298 <== osd.6 10.100.2.124:6826/39718 126 ==== MOSDECSubOpWrite(3.24s1 32/26 ECSubWrite(tid=29, reqid=client.4215.0:113, at_version=32'11, trim_to=0'0, roll_forward_to=32'10)) v2 ==== 6671+0+0 (3945069128 0 0) 0x55db81d45800 con 0x55db817c7000 2020-01-22 23:41:57.810524 7f6179f6a700 10 bluestore(/home/if/luminous/build/dev/osd0) queue_transactions existing 0x55db81bc7dc0 osr(3.24s1 0x55db81b30800) 2020-01-22 23:41:57.810540 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _txc_create osr 0x55db81bc7dc0 = 0x55db821dd200 seq 28 2020-01-22 23:41:57.810563 7f6179f6a700 15 bluestore(/home/if/luminous/build/dev/osd0) _setattrs 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 2 keys 2020-01-22 23:41:57.810583 7f6179f6a700 10 bluestore(/home/if/luminous/build/dev/osd0) _setattrs 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 2 keys = 0 2020-01-22 23:41:57.810591 7f6179f6a700 15 bluestore(/home/if/luminous/build/dev/osd0) _set_alloc_hint 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# object_size 700416 write_size 700416 flags - 2020-01-22 23:41:57.810598 7f6179f6a700 10 bluestore(/home/if/luminous/build/dev/osd0) _set_alloc_hint 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# object_size 700416 write_size 700416 flags - = 0 2020-01-22 23:41:57.810610 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0).collection(3.24s1_head 0x55db81cc0a00) get_onode oid 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b key 0x81800000000000000325900c85217262'd_data.4.10716b8b4567.0000000000000000!='0xfffffffffffffffe000000000000000b'o' 2020-01-22 23:41:57.810678 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0).collection(3.24s1_head 0x55db81cc0a00) r -2 v.len 0 2020-01-22 23:41:57.810698 7f6179f6a700 15 bluestore(/home/if/luminous/build/dev/osd0) _touch 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b 2020-01-22 23:41:57.810704 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _assign_nid 1193 2020-01-22 23:41:57.810706 7f6179f6a700 10 bluestore(/home/if/luminous/build/dev/osd0) _touch 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b = 0 2020-01-22 23:41:57.810713 7f6179f6a700 15 bluestore(/home/if/luminous/build/dev/osd0) _clone_range 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# -> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b from 0x9000~1000 to offset 0x9000 2020-01-22 23:41:57.810721 7f6179f6a700 15 bluestore(/home/if/luminous/build/dev/osd0) _do_zero 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b 0x9000~1000 2020-01-22 23:41:57.810728 7f6179f6a700 20 bluestore.extentmap(0x55db821e8990) dirty_range mark inline shard dirty 2020-01-22 23:41:57.810732 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_zero extending size to 40960 2020-01-22 23:41:57.810734 7f6179f6a700 10 bluestore(/home/if/luminous/build/dev/osd0) _do_zero 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b 0x9000~1000 = 0 2020-01-22 23:41:57.810740 7f6179f6a700 15 bluestore(/home/if/luminous/build/dev/osd0) _do_clone_range 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# -> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b 0x9000~1000 -> 0x9000~1000 2020-01-22 23:41:57.810750 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_clone_range src 0x8000~8000: 0x8000~8000 Blob(0x55db82051110 blob([0x540000~10000] csum+has_unused crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0 sbid 0x0)) 2020-01-22 23:41:57.810761 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _assign_blobid 10250 2020-01-22 23:41:57.810764 7f6179f6a700 10 bluestore(/home/if/luminous/build/dev/osd0).collection(3.24s1_head 0x55db81cc0a00) make_blob_shared Blob(0x55db82051110 blob([0x540000~10000] csum+has_unused crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0 sbid 0x0)) 2020-01-22 23:41:57.810773 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0).collection(3.24s1_head 0x55db81cc0a00) make_blob_shared now Blob(0x55db82051110 blob([0x540000~10000] csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a ref_map(0x540000~10000=1)))) 2020-01-22 23:41:57.810787 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_clone_range new Blob(0x55db82051030 blob([0x540000~10000] csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x0 0x0) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a ref_map(0x540000~10000=2)))) 2020-01-22 23:41:57.810795 7f6179f6a700 20 bluestore.blob(0x55db82051030) get_ref 0x9000~1000 Blob(0x55db82051030 blob([0x540000~10000] csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x0 0x0) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a ref_map(0x540000~10000=2)))) 2020-01-22 23:41:57.810802 7f6179f6a700 20 bluestore.blob(0x55db82051030) get_ref init 0x10000, 10000 2020-01-22 23:41:57.810805 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_clone_range dst 0x9000~1000: 0x9000~1000 Blob(0x55db82051030 blob([0x540000~10000] csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x1000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a ref_map(0x540000~10000=2)))) 2020-01-22 23:41:57.810813 7f6179f6a700 20 bluestore.extentmap(0x55db81b2b1d0) dirty_range mark inline shard dirty 2020-01-22 23:41:57.810817 7f6179f6a700 20 bluestore.extentmap(0x55db821e8990) dirty_range mark inline shard dirty 2020-01-22 23:41:57.810819 7f6179f6a700 10 bluestore(/home/if/luminous/build/dev/osd0) _clone_range 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# -> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b from 0x9000~1000 to offset 0x9000 = 0 2020-01-22 23:41:57.810829 7f6179f6a700 15 bluestore(/home/if/luminous/build/dev/osd0) _write 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 0x9000~1000 2020-01-22 23:41:57.810838 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_write 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 0x9000~1000 - have 0x10000 (65536) bytes fadvise_flags 0x0 2020-01-22 23:41:57.810844 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _choose_write_options prefer csum_order 12 target_blob_size 0x80000 compress=0 buffered=0 2020-01-22 23:41:57.810849 7f6179f6a700 10 bluestore(/home/if/luminous/build/dev/osd0) _do_write_small 0x9000~1000 2020-01-22 23:41:57.810853 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering Blob(0x55db820511f0 blob([0x530000~10000] csum+has_unused+shared crc32c/0x1000 unused=0x7f) use_tracker(0x10000 0x1000) SharedBlob(0x55db820510a0 loaded (sbid 0x2809 ref_map(0x530000~10000=1)))) bstart 0x0 2020-01-22 23:41:57.810861 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_write_small ignoring immutable Blob(0x55db820511f0 blob([0x530000~10000] csum+has_unused+shared crc32c/0x1000 unused=0x7f) use_tracker(0x10000 0x1000) SharedBlob(0x55db820510a0 loaded (sbid 0x2809 ref_map(0x530000~10000=1)))) 2020-01-22 23:41:57.810868 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering Blob(0x55db820512d0 blob([0x520000~10000] csum+has_unused+shared crc32c/0x1000 unused=0x3f) use_tracker(0x10000 0x1000) SharedBlob(0x55db82051180 loaded (sbid 0x2808 ref_map(0x520000~10000=1)))) bstart 0x0 2020-01-22 23:41:57.810876 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering Blob(0x55db82051110 blob([0x540000~10000] csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a ref_map(0x540000~10000=2)))) bstart 0x0 2020-01-22 23:41:57.810883 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_write_small ignoring immutable Blob(0x55db82051110 blob([0x540000~10000] csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a ref_map(0x540000~10000=2)))) 2020-01-22 23:41:57.810888 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering Blob(0x55db82050f50 blob([0x510000~10000] csum+has_unused+shared crc32c/0x1000 unused=0x1f) use_tracker(0x10000 0x1000) SharedBlob(0x55db82051260 loaded (sbid 0x2807 ref_map(0x510000~10000=1)))) bstart 0x0 2020-01-22 23:41:57.810894 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering Blob(0x55db820513b0 blob([0x500000~10000] csum+has_unused+shared crc32c/0x1000 unused=0xf) use_tracker(0x10000 0x1000) SharedBlob(0x55db82051340 loaded (sbid 0x2806 ref_map(0x500000~10000=1)))) bstart 0x0 2020-01-22 23:41:57.810900 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering Blob(0x55db81e38850 blob([0x4f0000~10000] csum+has_unused+shared crc32c/0x1000 unused=0x7) use_tracker(0x10000 0x1000) SharedBlob(0x55db81db2690 loaded (sbid 0x2805 ref_map(0x4f0000~10000=1)))) bstart 0x0 2020-01-22 23:41:57.810951 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering Blob(0x55db81eadc70 blob([0x4e0000~10000] csum+has_unused+shared crc32c/0x1000 unused=0x3) use_tracker(0x10000 0x1000) SharedBlob(0x55db81db3500 loaded (sbid 0x2804 ref_map(0x4e0000~10000=1)))) bstart 0x0 2020-01-22 23:41:57.810958 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering Blob(0x55db81eadb90 blob([0x4d0000~10000] csum+shared crc32c/0x1000) use_tracker(0x10000 0x2000) SharedBlob(0x55db81eadc00 loaded (sbid 0x2803 ref_map(0x4d0000~10000=1)))) bstart 0x0 2020-01-22 23:41:57.810967 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _pad_zeros pad 0x0 + 0x0 on front/back, now 0x9000~1000 2020-01-22 23:41:57.810974 7f6179f6a700 20 bluestore.blob(0x55db82051110) put_ref 0x9000~1000 Blob(0x55db82051110 blob([0x540000~10000] csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a ref_map(0x540000~10000=2)))) 2020-01-22 23:41:57.810987 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write txc 0x55db821dd200 1 blobs 2020-01-22 23:41:57.810992 7f6179f6a700 10 stupidalloc 0x0x55db81107c80 allocate_int want_size 0x10000 alloc_unit 0x10000 hint 0x0 2020-01-22 23:41:57.811003 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write prealloc [0x550000~10000] 2020-01-22 23:41:57.811006 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write forcing csum_order to block_size_order 12 2020-01-22 23:41:57.811009 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write initialize csum setting for new blob Blob(0x55db81eadb20 blob([]) use_tracker(0x0 0x0) SharedBlob(0x55db81db3650 sbid 0x0)) csum_type crc32c csum_order 12 csum_length 0x10000 2020-01-22 23:41:57.811019 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write blob Blob(0x55db81eadb20 blob([0x550000~10000] csum crc32c/0x1000) use_tracker(0x0 0x0) SharedBlob(0x55db81db3650 sbid 0x0)) 2020-01-22 23:41:57.811029 7f6179f6a700 20 bluestore.blob(0x55db81eadb20) get_ref 0x9000~1000 Blob(0x55db81eadb20 blob([0x550000~10000] csum+has_unused crc32c/0x1000 unused=0xfdff) use_tracker(0x0 0x0) SharedBlob(0x55db81db3650 sbid 0x0)) 2020-01-22 23:41:57.811037 7f6179f6a700 20 bluestore.blob(0x55db81eadb20) get_ref init 0x10000, 10000 2020-01-22 23:41:57.811042 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write lex 0x9000~1000: 0x9000~1000 Blob(0x55db81eadb20 blob([0x550000~10000] csum+has_unused crc32c/0x1000 unused=0xfdff) use_tracker(0x10000 0x1000) SharedBlob(0x55db81db3650 sbid 0x0)) 2020-01-22 23:41:57.811050 7f6179f6a700 20 bluestore.BufferSpace(0x55db81db3668 in 0x55db812fa2a0) _discard 0x9000~1000 2020-01-22 23:41:57.811056 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write deferring small 0x1000 write via deferred 2020-01-22 23:41:57.811063 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _wctx_finish lex_old 0x9000~1000: 0x9000~1000 Blob(0x55db82051110 blob([0x540000~10000] csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x7000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a ref_map(0x540000~10000=2)))) 2020-01-22 23:41:57.811074 7f6179f6a700 20 bluestore.extentmap(0x55db81b2b1d0) dirty_range mark inline shard dirty 2020-01-22 23:41:57.811077 7f6179f6a700 10 bluestore(/home/if/luminous/build/dev/osd0) _write 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 0x9000~1000 = 0============================================== One can see that _do_write_small() function considers some blobs for reuse but is unable to do that (presumably because they are shared and/or relevant unused bits are cleared) and ends up with new allocation. Actually the above writing pattern is the simplest (but quite artificial) scenario to present the issue. Generally speaking the same behavior is observed when we overwrite some data at RBD image and this maps to a previously used object. If written data partially overlaps existing blob (including its unused part) and this blob is prohibited for reuse (it's shared which seems to be the case for EC overwrites, or relevant unused bits are cleared, i.e. it has already been written at certain positions) BlueStore allocates new and preserves previous ones (remember we do partial overwrite). To some degree it reminds the behavior with compressed blobs where a stack of partially overlapped blobs might appear until garbage collection cleans this up. So, e.g. full RBD image prefill and subsequent random small overwrites will most probably result in some space overhead - up to 16x times in the worst (certainly very seldom) case. Additional notes: - This issue isn't present in master with new bluestore_min_alloc_size defaults (=4K). - In nautilus (and octopus with bluestore_min_alloc_size_hdd set back to 64K) this behavior is less visible due to blob garbage collection we introduced - see https://github.com/ceph/ceph/pull/30144 But up to 3x increase ratio is still observable though. - The issue isn't observed for replicated pools. - Shared blobs created during EC overwrite seems to lack a rollback to non-shared state after op completion (and snapshot removal). Hence most probably they pollute onodes and DB (remember their persistence mechanics) and negatively impact the performance. Needs more investigation/verification though. The above analysis has two goals: 1) Show potential origin of space overhead for pre-Nautilus clusters. 2) Show the hidden danger of using allocation sizes higher than 4K (i.e. device block size?) for EC pools. But our research shows that 4K alloc size is less efficient for spinner-backed pools. https://github.com/ceph/ceph/pull/31867 suggests 'partial' rollback in this respect. At least for default setup. Thanks, Igor

4 years, 2 months

3
4
0 0

Radosgw/Objecter behaviour for homeless session

by Biswajeet Patra

Hi All, I have a query regarding objecter behaviour for homeless session. In situations when all OSDs containing copies (*let say replication 3*) of an object are down, the objecter assigns a homeless session (OSD=-1) to a client request. This request makes radosgw thread hang indefinitely as the data could never be served because all required OSDs are down. With multiple similar requests, all the radosgw threads gets exhausted and hanged indefinitely waiting for the OSDs to come up. This creates complete service unavailability as no rgw threads are present to process valid requests which could have been directed towards active PGs/OSDs. I think we should have behaviour in objecter or radosgw to terminate request and return early in case of a homeless session. Let me know your thoughts on this. Regards, Biswajeet -- *-----------------------------------------------------------------------------------------* *This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee, you should not disseminate, distribute or copy this email. Please notify the sender immediately by email if you have received this email by mistake and delete this email from your system. If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.***** **** *Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the organization. Any information on shares, debentures or similar instruments, recommended product pricing, valuations and the like are for information purposes only. It is not meant to be an instruction or recommendation, as the case may be, to buy or to sell securities, products, services nor an offer to buy or sell securities, products or services unless specifically stated to be so on behalf of the Flipkart group. Employees of the Flipkart group of companies are expressly required not to make defamatory statements and not to infringe or authorise any infringement of copyright or any other legal right by email communications. Any such communication is contrary to organizational policy and outside the scope of the employment of the individual concerned. The organization will not accept any liability in respect of such communication, and the employee responsible will be personally liable for any damages or other liability arising.***** **** *Our organization accepts no liability for the content of this email, or for the consequences of any actions taken on the basis of the information *provided,* unless that information is subsequently confirmed in writing. If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.* _-----------------------------------------------------------------------------------------_

4 years, 2 months

1
1
0 0

DocuBetter meeting 22 Jan 2020

by John Zachary Dover

Hi everyone. The next DocuBetter meeting is scheduled for tomorrow. This is at the following time: 1800 PST 22 Jan 2020 0100 UTC 22 Jan 2020 1200 AEST 23 Nov 2020 Etherpad: https://pad.ceph.com/p/Ceph_Documentation Meeting: https://bluejeans.com/908675367 Agenda: This week we will be discussing the new Getting Started Guide, and improving docs bug reporting from the Ceph community of users.

4 years, 2 months

1
0
0 0

bucket notification delivery guarantees

by Yuval Lifshitz

Dear Community, Would like to share some design ideas around the above topic. Feedback is welcomed! Current State - in "pull mode" [1] we have the same guarantees as the multisite syncing mechanism (guarantee against HW/SW failures). On top of that, if writing the event to RADOS fails, this trickle back as sync failure, which means that the master zone will try to sync the pubsub zone - in "push mode" [2] we send the notification from the ops context that triggered the notification. The original operation is blocked until we get a reply from the endpoint. As part of the configuration for the endpoint, we also configure the "ack level", indicating whether we block until we get a reply from the endpoint or not. Since the operation response is not sent back to the client until the endpoint acks, this method guarantees against any failure in the radosgw (at the cost of adding latency to the operation). This, however, does not guarantee delivery if the endpoint is down or disconnected. The endpoint we interact with (rabbitmq, kafka) , usually have built in redundancy mechanism, but this does not cover the case where there is a network disconnect between our gateways and these systems. In some cases we can get a nack from the endpoint, indicating that our message would never reach the endpoint. But we can only log these cases: - we cannot fail the operation that triggered us, because we send the notification only after the actual operation (e.g. "put object") was done (=no atomicity) - no retry mechanism (in theory, we can add one) Next Phase Requirements We would like to add delivery guarantee to "push mode" for endpoint failures. For that we would use a message queue with the following features: - rados backed, so it would survive HW/SW failures - blocking only on local read/writes (so it introduces smaller latency than over-the-wire endpoint acks) - has reserve/commit semantics, so we can "reserve" before the operation (e.g. "put object") was done, and fail it if we cannot reserve a slot on the queue, and commit the notification to the queue only after the operation was successful (and unreserve if the operation failed) - we would have a retry mechanism based on the queue, which means that if a notification was successfully pushed into the queue, we can assume it would (eventually) be successfully delivered to the endpoint Proposed Solution - use the cls_queue [3] (cls_queue is not omap based, hence, no builtin iops limitations) - add reserve/commit functionality (probably store that info in the queue head) - a dedicated thread(s) should be reading requests from the queue, sending the notifications to the endpoints, and waiting for the replies (if needed) - this should be done via coroutines - acked requests are removed from the queue, nacked or timed-out requests should be retried (at least for a while) - both mechanism would coexist, as this would be configurable per topic - as a stretch goal, we may add a "best effort queue". This would be similar to the cls_queue solution, but won't address radosgw failures (as the queue would be in-memory), only endpoint failures/disconnects - for now, this mechanism won't be supported for pushing events from the pubsub zone (="pull+push mode"), but might be added if users would find it useful Yuval [1] https://docs.ceph.com/docs/master/radosgw/pubsub-module/ [2] https://docs.ceph.com/docs/master/radosgw/notifications/ [3] https://github.com/ceph/ceph/tree/master/src/cls/queue

4 years, 2 months

3
6
0 0

Cephalocon early-bird registration ends today

by Sage Weil

Hi everyone, Quick reminder that the early-bird registration for Cephalocon Seoul (Mar 3-5) ends tonight! We also have the hotel booking link and code up on the site (finally--sorry for the delay). https://ceph.io/cephalocon/seoul-2020/ Hope to see you there! sage

4 years, 2 months

1
0
0 0

Redmine API partially broken

by Nathan Cutler

This is probably a question for David Galloway (?) Did Redmine get updated recently? (Like, very recently?) Up until, well, very recently (yesterday?) curl commands like the following were producing meaningful output even on issues that have no relations (43725 has one): curl --silent 'https://tracker.ceph.com/issues/43725.json?include=relations' | jq '.issue.relations[]' Now it produces an error: jq: error (at <stdin>:0): Cannot iterate over null (null) Maybe that is related to fact that the following curl command is not producing any output, even though issue 43725 has relations: curl --silent 'https://tracker.ceph.com/issues/43725/relations.json' Any ideas appreciated. Nathan -- Nathan Cutler Software Engineer Distributed Storage SUSE LINUX, s.r.o. Tel.: +420 284 084 037

4 years, 2 months

2
3
0 0

Ceph at DevConf and FOSDEM

by Mike Perez

Hi Cephers, We will have a booth at DevConf this year! https://www.devconf.info/cz/ If you're interested in helping at the booth at DevConf, please reply to me directly, and I can provide more details. I'm still working on getting a shared booth for FOSDEM with the CentOS table. Also, I will try to get us signed with a BoF session. If you're interested, please also let me know directly. Thanks! https://fosdem.org/2020/ -- Mike Perez he/him Ceph Community Manager M: +1-951-572-2633 494C 5D25 2968 D361 65FB 3829 94BC D781 ADA8 8AEA @Thingee <https://twitter.com/thingee> Thingee <https://www.linkedin.com/thingee> <https://www.facebook.com/RedHatInc> <https://www.redhat.com>

4 years, 2 months

1
0
0 0

Re: cephadm - issue with osd create

by Sage Weil

[Adding Sebastian, dev(a)ceph.io] Some things to improve with the OSD create path! On Mon, 20 Jan 2020, Yaarit Hatuka wrote: > Here are a few Insights from this debugging process - I hope I got it right: > > 1. Adding the device with "/dev/disk/by-id/...." did not work for me, it > failed in pybind/mgr/cephadm/module.py at: > https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/module.py#L… > "if len(list(set(devices) & set(osd['devices']))) == 0" > because osd['devices'] has the devices listed as "/dev/sdX", but > set(devices) has them by their dev-id.... (which is the syntax specified as > the example in the docs, which I followed). > It took me a couple of days to debug this :-) > > 2. I think that cephadm should be more verbose by default. When creating > OSD it only writes "Created osd(s) on host 'mira027.front.sepia.ceph.com'" > (even in case creation failed...). It will help if it outputs the different > stages so that the user can see where it stopped in case of error. > > 3. ceph status shows that the OSD was added even if the orchestrator failed > to add it (but it's marked down and out). IIUC this is ceph-volumes failure path not cleaning up? Is this the failure you saw when you passed the /dev/disk/by-id device path? > 4. I couldn't find the logs that cephadm produces. > I searched for them on both the source (mira010) and the target (mira027) > machines in /var/log/ceph/<fsid>/* and couldn't find any print from either > the cephadm mgr module nor the cephadm script. I also looked at /var/log/*. > Where are they hiding? The ceph-volume.log is the one to look at. > 5. After ceph-volume creates its LVs, the host's > lvdisplay/vgdisplay/pvdisplay showed nothing. I had to run "pvscan --cache" > on the host in order for those commands to output the current state. This > may confuse the user. > > 6. I think it's also a good idea to have another cephadm feature "cephadm > shell --host=<host>" to open a cephadm shell on a remote host. I wanted to > run "ceph-volume lvm zap" on one of the remote hosts and to do that I sshed > over, copied the cephadm script and ran "cephadm shell". it would be cool > if we could do that from the original machine. The cephadm script doesn't know how to ssh. We could probably teach it, though, for something like this... but it might be simpler for the user to just 'scp cephadm $host:', as that's basically what cephadm would do to "install" itself remotely? sage

4 years, 2 months

3
2
0 0

12.2.13 luminous QE validation status

by Yuri Weinstein

QA Validation started Branch is locked Summary - https://tracker.ceph.com/issues/42377#note-2 RC branch - wip-luminous-12.2.13_RC0 SHA1 - 2a788890d42cb111c71cfb747fd571f65c72b7f6 Thx YuriW

4 years, 3 months

1
0
0 0

2024

2023

2022

2021

2020

2019

Dev January 2020