Re: sparse-read OSD op guarantees

27 Apr 2023

On Wed, Apr 26, 2023 at 11:50 PM Sam Just &lt;sjust(a)redhat.com&gt; wrote:
...

 This came up again in the dev summit at cephalocon, so I figure it's
 worth reviving this thread.

 First, I'll try to recap the situation (Ilya, feel free to correct me
 here).  My understanding of the issue is that rbd has features (most
 notably encryption) which depend on the librados SPARSE_READ operation
 reflecting accurately which ranges have been written or trimmed at a
 4k granularity.  This appears to work correctly on replicated pools on
 bluestore, but erasure coded pools always return the full object
 contents up to the object size including regions the client has not
 written to. 
Hi Sam,

As Jeff said in another email, fscrypt support in kcephfs has a hard
dependency on accurate allocation information.  librbd wants to grow
a similar dependency to enhance its built-in LUKS encryption support
(currently reads from unallocated areas on encrypted images are handled
inconsistently: if the underlying object doesn't exist, zeroes are
returned; if it does exist, we are at the mercy of sparse-read behavior
and can return random garbage obtained by decrypting zeroes).

...

 I don't think this was originally a guarantee of the interface.  I
 think the original guarantee was simply that SPARSE_READ would return
 any non-zero regions, not that it was guaranteed not to return
 unwritten or trimmed regions.  The OSD does not track this state above
 the ObjectStore layer -- SPARSE_READ and MAPEXT both rely directly on
 ObjectStore::fiemap.  MAPEXT actually returns -ENOTSUPP on erasure
 coded pools.

 Adam: the observed behavior is that fiemap on bluestore does
 accurately reflect the client's written extents at a 4k granularity.
 Is that reliable, or is it a property of only some bluestore
 configurations?

 As it appears desirable that we actually guarantee this, we probably
 want to do two things:
 1) codify this guarantee in the ObjectStore interface (4k in all
 cases?), and ensure that all configurations satisfy it going forward
 (including seastore)
 2) update the ec implementation to track allocation at the granularity
 of an EC stripe.  HashInfo is the natural place to put the
 information, probably?  We'll need to also implement ZERO.  Radek: I
 know you're looking into EC for crimson, perhaps you can evaluate how
 much work would be required here? 
The EC stripe that is referred to here is configurable on a per-pool
basis with the default taken from osd_pool_erasure_code_stripe_unit,
right?  If the user configures it to e.g. 16k for a particular pool
(EC profile), how would that interact with the 4k guarantee at the
ObjectStore layer?

Thanks,

                Ilya

2024

2023

2022

2021

2020

2019

Re: sparse-read OSD op guarantees