Hi Stefan,
On Wed, Sep 20, 2023 at 11:00:12AM +0200, Stefan Kooman wrote:
On 19-09-2023 13:35, Tim Bishop wrote:
The Ceph cluster is running Pacific 16.2.13 on
Ubuntu 20.04. Almost all
clients are working fine, with the exception of our backup server. This
is using the kernel CephFS client on Ubuntu 22.04 with kernel 6.2.0 [1]
(so I suspect a newer Ceph version?).
The backup server has multiple (12) CephFS mount points. One of them,
the busiest, regularly causes this error on the cluster:
HEALTH_WARN 1 clients failing to respond to capability release
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
mds.mds-server(mds.0): Client backupserver:cephfs-backupserver failing to respond to
capability release client_id: 521306112
And occasionally, which may be unrelated, but occurs at the same time:
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
mds.mds-server(mds.0): 1 slow requests are blocked > 30 secs
The second one clears itself, but the first sticks until I can unmount
the filesystem on the client after the backup completes.
You are not alone. We also have a backup server running 22.04 and 6.2 and
occasionally hit this issue. We hit this with mainly 5.12.19 clients and a
6.2 backup server. We're on 16.2.11.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sidenote:
For those of you who are wondering: why would you want to use latest
(greatest?) linux kernel for CephFS ... this is why. To try to get rid of 1)
slow requests because of some deadlock / locking issue, clients failing to
capability release, and 3) bug fixes / improvements (thx devs!).
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Questions:
Do you have the filesystem read only mounted and given the backup server
CephFS client read only caps on the MDS?
Yes, mounted read-only and the caps for the client are read-only for the
MDS.
I do have multiple mounts from the same CephFS filesystem though, and
I've been wondering if that could be causing more parallel requests from
the backup server. I'd been thinking about doing it through a single
mount, but then all the paths change which doesn't make the backups
overly happy.
Are you running a multiple active MDS setup?
No. We tried it for a while but after seeing some issues like this we
backtracked to a single active MDS to rule out multiple active being the
issue.
It appears
that whilst it's in this stuck state there may be one or more
directory trees that are inaccessible to all clients. The backup server
is walking the whole tree but never gets stuck itself, so either the
inaccessible directory entry is caused after it has gone past, or it's
not affected. Maybe the backup server is holding a directory when it
shouldn't?
We have seen both cases, yet most of the time the backup server would not be
able to make progress and be stuck on a file.
Interesting. Backups have never got stuck for us. Whilst we regularly,
pretty much daily, see the above mentioned error.
But because nothing we're directly running gets stuck I only find out if
a directory somewhere is inaccessible if a user reports it to us from
one of our other client machines, usually a HPC node.
It may be that
an upgrade to Quincy resolves this, since it's more
likely to be inline with the kernel client version wise, but I don't
want to knee-jerk upgrade just to try and fix this problem.
We are testing with 6.5 kernel clients (see other recent threads about
this). We have not seen this issue there (but time will tell, it does not
happen *that* often, but hit other issues).
The MDS server itself is indeed older than the newer kernel clients. It
might certainly be a factor. And that raises the question what kind of
interoperability / compatibility tests (if any) are done between CephFS
(kernel) clients and MDS server versions. This might be a good "focus topic"
for a ceph User + Dev meeting ...
Thanks for any advice.
You might want to try 6.5.x kernel on the clients. But might run into other
issues. Not sure about that, these might be only relevant for one of our
workloads, only one way to find out ...
I've been sticking with what's available in Ubuntu - the 6.2 kernel is
part of their HWE enablement stack, which is handy. It won't be long
until 23.10 is out with the 6.5 kernel though. I'll definitely give it a
try then.
Enable debug logging on the MDS to gather logs that
might shine some light
on what is happening with that request.
ceph daemon mds.name dump_ops_in_flight might help here to get client id and
request.
I've done both of these in the past, but I should look again (of course,
it's not broken right now!). From what I recall there was nothing
unusual looking about the request, and certainly nothing that Googling
and searching list archives and bug reports led me to anything useful.
Another thing that you might do is to dump the cache
on the MDS to gather
more info. This however is highly dependent on the amount of RAM the MDS is
using. In the past we would kill the MDS (unresponsive, replaced by
standby-replay). Improvements to prevent that have been made ... but we have
not tried after that. See this thread [1]. What MDS_MEMORY_TARGET have you
set? Make sure you have enough disk space to store the dump file. To
actually make sense of that dump file / debug logging you should understand
_exactly_ how the CAPS mechanism works, and see if it is violated somewhere
... and then look in the code to see why. Short of that knowledge, the
CephFS developers might help out.
I did try that before too, but ran out of disk space. Current MDS memory
usage is around 16GB, with a cache limit set of 8GB. This is likely
straying beyond my understand of Ceph though.
Thanks for your advice. Looks like giving a newer kernel a try is
something to consider. Also we'll need to be looking at Quincy soon
anyway, so that might mix things up a bit too. It's just about
manageable at the moment, but just needs more hand-holding than I'd
really like to be giving.
Tim.
--
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x6C226B37FDF38D55