On Wed, Apr 26, 2023 at 11:50 PM Sam Just <sjust(a)redhat.com> wrote:
This came up again in the dev summit at cephalocon, so I figure it's
worth reviving this thread.
First, I'll try to recap the situation (Ilya, feel free to correct me
here). My understanding of the issue is that rbd has features (most
notably encryption) which depend on the librados SPARSE_READ operation
reflecting accurately which ranges have been written or trimmed at a
4k granularity. This appears to work correctly on replicated pools on
bluestore, but erasure coded pools always return the full object
contents up to the object size including regions the client has not
written to.
Hi Sam,
As Jeff said in another email, fscrypt support in kcephfs has a hard
dependency on accurate allocation information. librbd wants to grow
a similar dependency to enhance its built-in LUKS encryption support
(currently reads from unallocated areas on encrypted images are handled
inconsistently: if the underlying object doesn't exist, zeroes are
returned; if it does exist, we are at the mercy of sparse-read behavior
and can return random garbage obtained by decrypting zeroes).
I don't think this was originally a guarantee of the interface. I
think the original guarantee was simply that SPARSE_READ would return
any non-zero regions, not that it was guaranteed not to return
unwritten or trimmed regions. The OSD does not track this state above
the ObjectStore layer -- SPARSE_READ and MAPEXT both rely directly on
ObjectStore::fiemap. MAPEXT actually returns -ENOTSUPP on erasure
coded pools.
Adam: the observed behavior is that fiemap on bluestore does
accurately reflect the client's written extents at a 4k granularity.
Is that reliable, or is it a property of only some bluestore
configurations?
As it appears desirable that we actually guarantee this, we probably
want to do two things:
1) codify this guarantee in the ObjectStore interface (4k in all
cases?), and ensure that all configurations satisfy it going forward
(including seastore)
2) update the ec implementation to track allocation at the granularity
of an EC stripe. HashInfo is the natural place to put the
information, probably? We'll need to also implement ZERO. Radek: I
know you're looking into EC for crimson, perhaps you can evaluate how
much work would be required here?
The EC stripe that is referred to here is configurable on a per-pool
basis with the default taken from osd_pool_erasure_code_stripe_unit,
right? If the user configures it to e.g. 16k for a particular pool
(EC profile), how would that interact with the 4k guarantee at the
ObjectStore layer?
Thanks,
Ilya