On Thu, Mar 11, 2021 at 8:18 PM Patrick Donnelly
<pdonnell(a)redhat.com> wrote:
On Thu, Mar 11, 2021 at 8:15 AM Jeff Layton <jlayton(a)redhat.com> wrote:
tl;dr version: in cephfs, the MDS handles truncating object data when
inodes are truncated. This is problematic with fscrypt.
Longer version:
I've been working on a patchset to add fscrypt support to kcephfs, and
have hit a problem with the way that truncation is handled. The main
issue is that fscrypt uses block-based ciphers, so we must ensure that
we read and write complete crypto blocks on the OSDs.
I'm currently using 4k crypto blocks, but we may want to allow this to
be tunable eventually (though it will need to be smaller than and align
with the OSD object size). For simplicity's sake, I'm planning to
disallow custom layouts on encrypted inodes. We could consider adding
that later (but it doesn't sound likely to be worthwhile).
Normally, when a file is truncated (usually via a SETATTR MDS call), the
MDS handles truncating or deleting objects on the OSDs. This is done
somewhat lazily in that the MDS replies to the client before this
process is complete (AFAICT).
So I've done some more research on this and it's not that simplistic.
Broadly, a truncate causes the following to happen:
- Revoke all write caps (but not Fcb) from clients.
- Journal the truncate operation.
- Respond with unsafe reply.
- After setattr is journalled, regrant Fs with new file size,
truncate_seq, truncate_size
- issue trunc cap update with new file size, truncate_seq,
truncate_size (looks redundant with prior step)
- actually start truncating objects above file size; concurrently
grant all wanted Fwb... caps wanted by client
- reply safe
From what I can tell, the clients use the truncate_seq/truncate_size
to avoid writing to data what the MDS plans to truncate. I haven't
really dug into how that works. Maybe someone more familiar with that
code can chime in.
So the MDS seems to truncate/delete objects lazily in the background
but it does so safely and consistently.
Right; ti's lazy in that it's not done immediately in a blocking
manner, but it's absolutely safe. Truncate seq and size are also
fields you can send to the OSD on read or write operations, and the
client includes them on every op. It just has to do a (reasonably)
simple conversion from the total truncate size the MDS gives it to
what that means for the object being accessed (based on the striping
pattern and object number).
I'll try and think a bit more on how to handle the special extra size
for encryption.
...although in my current sleep-addled state, I'm actually not sure we
need to add any permanent storage to the MDS to handle this case! We
can probably just extend the front-end truncate op so that it can take
a separate "real-truncate-size" and the logical file size, can't we?