On Fri, 2019-08-16 at 14:12 +0200, Jonas Jelten wrote:
Hi!
I've missed your previous post, but we do have inline_data enabled on our cluster.
We've not yet benchmarked, but the filesystem has a wide variety of file sizes, and
it sounded like a good idea to speed
up performance. We mount it with the kernel client only, and I've had the subjective
impression that latency was better
once we enabled the feature. Now that you say the kernel client has no write support for
it, my impression is probably
wrong.
I think inline_data is a nice and easy way to improve performance when the CephFS
metadata are on SSDs but the bulk data
is on HDDs. So I'd vote against removal and would instead vouch for improvements of
this feature :)
If storage on the MDS is a problem, files could be stored on a different (e.g. SSD) pool
instead, and the file size
limit and pool selection could be configured via xattrs. And there was some idea to store
small objects not in the OSD
block, but only in the OSD's DB (which is more complicated to use than separate
SSD-pool and HDD-pool, but when block.db
is on an SSD the speed would be better). Maybe this could all be combined to have better
small-file performance in CephFS!
The main problem is developer time and the maintenance burden this
feature represents. This is very much a non-trivial thing to implement.
Consider that the read() and write() codepaths in the kernel already
have 3 main branches each:
buffered I/O (when Fcb caps are held)
synchronous I/O (when Fcb caps are not held)
O_DIRECT I/O
We could probably consolidate the O_DIRECT and sync I/O code somewhat,
but buffered is handled entirely differently. Once we mix in inline_data
support, we have to add a completely new branch for each of those cases,
effectively doubling the complexity.
We'd also need to add similar handing for mmap'ed I/O and for things
like copy_file_range.
But, even before that...I have some real concerns about the existing
handling, even with a single client.
While I haven't attempted to roll a testcase for it, I think we can
probably hit races where multiple tasks handling write page faults can
compete to uninline the data, potentially clobbering the others' writes.
Again, this is non-trivial to fix.
In summary I don't see a real future for this feature unless someone
wants to step up to own it and commit to fixing up these problems.
On 16/08/2019 13.15, Jeff Layton wrote:
> A couple of weeks ago, I sent a request to the mailing list asking
> whether anyone was using the inline_data support in cephfs:
>
>
https://docs.ceph.com/docs/mimic/cephfs/experimental-features/#inline-data
>
> I got exactly zero responses, so I'm going to formally propose that we
> move to start deprecating this feature for Octopus.
>
> Why deprecate this feature?
> ===========================
> While the userland clients have support for both reading and writing,
> the kernel only has support for reading, and aggressively uninlines
> everything as soon as it needs to do any writing. That uninlining has
> some rather nasty potential race conditions too that could cause data
> corruption.
>
> We could work to fix this, and maybe add write support for the kernel,
> but it adds a lot of complexity to the read and write codepaths in the
> clients, which are already pretty complex. Given that there isn't a lot
> of interest in this feature, I think we ought to just pull the plug on
> it.
>
> How should we do this?
> ======================
> We should start by disabling this feature in master for Octopus.
>
> In particular, we should stop allowing users to call "fs set inline_data
> true" on filesystems where it's disabled, and maybe throw a loud warning
> about the feature being deprecated if the mds is started on a filesystem
> that has it enabled.
>
> We could also consider creating a utility to crawl an existing
> filesystem and uninline anything there, if there was need for it.
>
> Then, in a few release cycles, once we're past the point where someone
> can upgrade directly from Nautilus (release Q or R?) we'd rip out
> support for this feature entirely.
>
> Thoughts, comments, questions welcome.
>
--
Jeff Layton <jlayton(a)redhat.com>