Clients failing to respond to capability release

List overview All Threads
Download

newer

older

CephFS: convert directory into...

slow recovery with Quincy

Tim Bishop

19 Sep 2023 19 Sep '23

2:35 p.m.

Hi, I've seen this issue mentioned in the past, but with older releases. So I'm wondering if anybody has any pointers. The Ceph cluster is running Pacific 16.2.13 on Ubuntu 20.04. Almost all clients are working fine, with the exception of our backup server. This is using the kernel CephFS client on Ubuntu 22.04 with kernel 6.2.0 [1] (so I suspect a newer Ceph version?). The backup server has multiple (12) CephFS mount points. One of them, the busiest, regularly causes this error on the cluster: HEALTH_WARN 1 clients failing to respond to capability release [WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release mds.mds-server(mds.0): Client backupserver:cephfs-backupserver failing to respond to capability release client_id: 521306112 And occasionally, which may be unrelated, but occurs at the same time: [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests mds.mds-server(mds.0): 1 slow requests are blocked > 30 secs The second one clears itself, but the first sticks until I can unmount the filesystem on the client after the backup completes. It appears that whilst it's in this stuck state there may be one or more directory trees that are inaccessible to all clients. The backup server is walking the whole tree but never gets stuck itself, so either the inaccessible directory entry is caused after it has gone past, or it's not affected. Maybe the backup server is holding a directory when it shouldn't? It may be that an upgrade to Quincy resolves this, since it's more likely to be inline with the kernel client version wise, but I don't want to knee-jerk upgrade just to try and fix this problem. Thanks for any advice. Tim. [1] The reason for the newer kernel is that the backup performance from CephFS was terrible with older kernels. This newer kernel does at least resolve that issue.

Show replies by date

Stefan Kooman

20 Sep 20 Sep

noon

Hi, On 19-09-2023 13:35, Tim Bishop wrote:

...

You are not alone. We also have a backup server running 22.04 and 6.2 and occasionally hit this issue. We hit this with mainly 5.12.19 clients and a 6.2 backup server. We're on 16.2.11. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sidenote: For those of you who are wondering: why would you want to use latest (greatest?) linux kernel for CephFS ... this is why. To try to get rid of 1) slow requests because of some deadlock / locking issue, clients failing to capability release, and 3) bug fixes / improvements (thx devs!). ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Questions: Do you have the filesystem read only mounted and given the backup server CephFS client read only caps on the MDS? Are you running a multiple active MDS setup?

...

It appears that whilst it's in this stuck state there may be one or more directory trees that are inaccessible to all clients. The backup server is walking the whole tree but never gets stuck itself, so either the inaccessible directory entry is caused after it has gone past, or it's not affected. Maybe the backup server is holding a directory when it shouldn't?

We have seen both cases, yet most of the time the backup server would not be able to make progress and be stuck on a file.

...

It may be that an upgrade to Quincy resolves this, since it's more likely to be inline with the kernel client version wise, but I don't want to knee-jerk upgrade just to try and fix this problem.

We are testing with 6.5 kernel clients (see other recent threads about this). We have not seen this issue there (but time will tell, it does not happen *that* often, but hit other issues). The MDS server itself is indeed older than the newer kernel clients. It might certainly be a factor. And that raises the question what kind of interoperability / compatibility tests (if any) are done between CephFS (kernel) clients and MDS server versions. This might be a good "focus topic" for a ceph User + Dev meeting ...

...

Thanks for any advice.

You might want to try 6.5.x kernel on the clients. But might run into other issues. Not sure about that, these might be only relevant for one of our workloads, only one way to find out ... Enable debug logging on the MDS to gather logs that might shine some light on what is happening with that request. ceph daemon mds.name dump_ops_in_flight might help here to get client id and request. Another thing that you might do is to dump the cache on the MDS to gather more info. This however is highly dependent on the amount of RAM the MDS is using. In the past we would kill the MDS (unresponsive, replaced by standby-replay). Improvements to prevent that have been made ... but we have not tried after that. See this thread [1]. What MDS_MEMORY_TARGET have you set? Make sure you have enough disk space to store the dump file. To actually make sense of that dump file / debug logging you should understand _exactly_ how the CAPS mechanism works, and see if it is violated somewhere ... and then look in the code to see why. Short of that knowledge, the CephFS developers might help out. If running a multple active MDS setup, you might set an export pin on the problematic path and export it to a dedicated MDS. This one might be easier to troubleshoot (isolate the problem). Gr. Stefan [1]: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/LZ25PAD4YFL…

Tim Bishop

4:25 p.m.

Hi Stefan, On Wed, Sep 20, 2023 at 11:00:12AM +0200, Stefan Kooman wrote:

...

On 19-09-2023 13:35, Tim Bishop wrote:

The Ceph cluster is running Pacific 16.2.13 on Ubuntu 20.04. Almost all clients are working fine, with the exception of our backup server. This is using the kernel CephFS client on Ubuntu 22.04 with kernel 6.2.0 [1] (so I suspect a newer Ceph version?). The backup server has multiple (12) CephFS mount points. One of them, the busiest, regularly causes this error on the cluster: HEALTH_WARN 1 clients failing to respond to capability release [WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release mds.mds-server(mds.0): Client backupserver:cephfs-backupserver failing to respond to capability release client_id: 521306112 And occasionally, which may be unrelated, but occurs at the same time: [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests mds.mds-server(mds.0): 1 slow requests are blocked > 30 secs The second one clears itself, but the first sticks until I can unmount the filesystem on the client after the backup completes.

Yes, mounted read-only and the caps for the client are read-only for the MDS. I do have multiple mounts from the same CephFS filesystem though, and I've been wondering if that could be causing more parallel requests from the backup server. I'd been thinking about doing it through a single mount, but then all the paths change which doesn't make the backups overly happy.

...

Are you running a multiple active MDS setup?

No. We tried it for a while but after seeing some issues like this we backtracked to a single active MDS to rule out multiple active being the issue.

...

We have seen both cases, yet most of the time the backup server would not be able to make progress and be stuck on a file.

Interesting. Backups have never got stuck for us. Whilst we regularly, pretty much daily, see the above mentioned error. But because nothing we're directly running gets stuck I only find out if a directory somewhere is inaccessible if a user reports it to us from one of our other client machines, usually a HPC node.

...

It may be that an upgrade to Quincy resolves this, since it's more likely to be inline with the kernel client version wise, but I don't want to knee-jerk upgrade just to try and fix this problem.

Thanks for any advice.

You might want to try 6.5.x kernel on the clients. But might run into other issues. Not sure about that, these might be only relevant for one of our workloads, only one way to find out ...

I've been sticking with what's available in Ubuntu - the 6.2 kernel is part of their HWE enablement stack, which is handy. It won't be long until 23.10 is out with the 6.5 kernel though. I'll definitely give it a try then.

...

Enable debug logging on the MDS to gather logs that might shine some light on what is happening with that request. ceph daemon mds.name dump_ops_in_flight might help here to get client id and request.

I've done both of these in the past, but I should look again (of course, it's not broken right now!). From what I recall there was nothing unusual looking about the request, and certainly nothing that Googling and searching list archives and bug reports led me to anything useful.

...

Another thing that you might do is to dump the cache on the MDS to gather more info. This however is highly dependent on the amount of RAM the MDS is using. In the past we would kill the MDS (unresponsive, replaced by standby-replay). Improvements to prevent that have been made ... but we have not tried after that. See this thread [1]. What MDS_MEMORY_TARGET have you set? Make sure you have enough disk space to store the dump file. To actually make sense of that dump file / debug logging you should understand _exactly_ how the CAPS mechanism works, and see if it is violated somewhere ... and then look in the code to see why. Short of that knowledge, the CephFS developers might help out.

I did try that before too, but ran out of disk space. Current MDS memory usage is around 16GB, with a cache limit set of 8GB. This is likely straying beyond my understand of Ceph though. Thanks for your advice. Looks like giving a newer kernel a try is something to consider. Also we'll need to be looking at Quincy soon anyway, so that might mix things up a bit too. It's just about manageable at the moment, but just needs more hand-holding than I'd really like to be giving. Tim. -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x6C226B37FDF38D55

E Taka

2 Oct 2 Oct

12:16 p.m.

Same problem here with Ceph 17.2.6 on Ubuntu 22.04 and Clients Debian 11, Kernel 6.0.12-1~bpo11+1. We are still looking for a solution. At the time being we let restart the Orchestrator MDS daemons by removig/adding labels to the servers. We use multiple MDS and have many CPU cores and memory. The problem should not be due to a lack of resources. Am Di., 19. Sept. 2023 um 13:36 Uhr schrieb Tim Bishop < tim-lists(a)bishnet.net>gt;:

...

Ivan Clayson

12 Oct 12 Oct

11:22 a.m.

Hi Tim, We've been seeing something that may be similar to you with concurrent MDS_CLIENT_LATE_RELEASE and MDS_SLOW_REQUEST warning messages as well as frequently MDS_CLIENT_RECALL and MDS_SLOW_METADATA_IO warnings from the same MDS referring to the same client. We are using 1 MDS for our non-containerised filesystem on a 5.7 PiB sized Alma8.8 cluster with 29 nodes and 348 spinning OSDs for bulk data and 4 OSDs for the metadata on NVMe SSDs with 174 million files (we also have additional NVMe drive partitions as DB and WAL devices for each OSD). All of our clients are using the kernel mount where we also have a SMB gateway which kernel mounts the filesystem and shares it to Windows and Mac machines. This problem seems to have various symptoms including but not limited to: (i) a particular file or directory hanging on openfs, read, and or statx system calls for all clients mounting the filesystem, (ii) all our connected clients hanging on the aforementioned system calls when performing any metadata or bulk data I/O on the filesystem, (iii) 1 client being unable to stat or cd into the filesystem whilst all other clients are completely unaffected, and (iv) possibly a near complete loss of metadata due to the MDS journal swelling beyond the size of the MDS pool where we eventually had some success following the recovery docs (https://docs.ceph.com/en/quincy/cephfs/disaster-recovery-experts/) but opted for a backup restore instead (we were able to restore the bulk data to a read-only state but not all the metadata where we would be more than happy to write a "worked example" of this if the community would find it useful). We use to solve this by forcefully re-mounting the filesystem on the client but as we run multi-user systems and have 100s of clients, this was not a sustainable option for us. We notice generally 3 classes of error: * The oldest slow or blocked request in the ceph-mds log is waiting to acquire locks for an inode (visible via `ceph tell mds.N dump_blocked_ops`). By tracing the oldest FS client which has the capability for this file and seeing which OSD the client is waiting for with `watch 'cat /sys/kernel/debug/ceph/*/osdc'` or by looking at the output of `ceph tell mds.N objecter requests`, we can determine which exact OSD is being slow (always a spinning disk). These issues are due to hardware issues or the OSD crashing (where a restart seems to remedy the problem) and thus counterintuitively (at least for me) suggests that our error messages for the MDS can refer to OSD hard-drive problems. These HDDs are in a completely separate pool to where I would have thought the metadata I/O would be being performed since our metadata pool is exclusively SSDs. * The oldest, or near oldest, blocked MDS op being an "internal op fragmentdir" operation and many client_requests at the flag_point "failed to authpin, dir is being fragmented". This seems to be an MDS bug as the documentation suggests that this should only happen when in a multi-MDS system but we have been using only 1 MDS. A simple restart of the MDS solves this problem. * Client has a crash and either has been blocklisted or should be. Blocklisted clients may not immediately know they've been evicted and thus may be hanging indefinitely when stating the filesystem. This can be detected by searching through the client list via `ceph tell mds.N client ls` and seeing whether there is an ID associated with the client. However whilst a manual re-mount will allow the former client to establish new session, we prefer to use the "recover_session=clean" mount option to do this automatically and now we have no incidents of this. Yesterday for example, we had an incident of multiple MDS warning messages of: MDS_SLOW_REQUEST, MDS_TRIM, MDS_CLIENT_RECALL, and MDS_CLIENT_LATE_RELEASE. This was caused by a non-responsive hard-drive leading to a build up of the MDS cache and trims being unable to be completed where we managed to narrow down the hard-drive for the inode which the blocked client was waiting for a rdlock on and restarted the OSD for that the drive. Notably this hard-drive had no errors with smartctl or elsewhere and only had the following slow ops message on OSD systemctl status/ log: osd.281 123892 get_health_metrics reporting 4 slow ops, oldest is osd_op(client.49576654.0:9101 3.d9bs0 3.ee6f2d9b (undecoded) ondisk+read+known_if_redirected e123889) Whilst restarting the hard-drive that the client with the oldest blocked op was waiting for did "clear" this /sys/kernel/debug/ceph/*/osdc queue, the oldest blocked MDS op then became an "internal op fragmentdir:mds.0:1" one where restarting the active MDS cleared this. Alas, this resulted in another blocked getattr op at the flag point "failed to authpin, dir is being fragmented" which was similarly tackled by restarting the MDS that just took over. This finally resulted in only two clients failing to respond to caps releases on inodes they were holding (despite rebooting at the time) where performing a "ceph tell mds.N session kill CLIENT_ID" removed them from the session map and allow the MDS' cache to become manageable again, thereby clearing all of these warning messages. We've had this problem since the beginning of this year and upgrading from octopus to quincy has unfortunately not solved our problem. We've only really been able to solve this problem by undergoing an aggressive campaign of replacing hard-drives which were reaching the end of their lives. This has substantially reduced the amount of problems we've had in relation to this. We would be very interested to hear about the rest of the community's experience in relation to this and I would recommend looking at your underlying OSDs Tim to see whether there are any timeout or uncorrectable errors. We would also be very eager to hear if these approaches are sub-optimal and whether anyone else has any insight into our problems. Sorry as well for resurrecting an old thread but we thought our experiences may be helpfully for others! Kindest regards, Ivan Clayson On 19/09/2023 12:35, Tim Bishop wrote:

...

-- Ivan Clayson ----------------- Scientific Computing Officer MRC Laboratory of Molecular Biology Francis Crick Ave, Cambridge CB2 0QH

Tim Bishop

1:33 p.m.

Hi Ivan, I don't think we're necessarily seeing the same issue. Mine didn't seem to be related to OSDs, and in fact I could unblock it by killing the backup job on our backup server and unmounting the filesystem. This would then release all other stuck ops on the MDS. I've been waiting to follow-up, so as not to tempt fate, but I upgraded to Quincy at the end of last week and so far I haven't had a single issue. Whether this is due to a bugfix in the MDS code, or running MDS and client versions that are nearer to each other (the client was much newer), I don't know. But for now, I'm relieved. I did find that enabling snapshots on CephFS exacerbated the problem, so for now they're disabled. Once I'm happy things are working I'll give them another go. Hopefully your detailed response will be useful to others facing similar issues. Tim. On Thu, Oct 12, 2023 at 09:22:03AM +0100, Ivan Clayson wrote:

...

-- Ivan Clayson ----------------- Scientific Computing Officer MRC Laboratory of Molecular Biology Francis Crick Ave, Cambridge CB2 0QH

Tim. -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x6C226B37FDF38D55

221

days inactive

244

days old

ceph-users@ceph.io

Manage subscription

5 comments

4 participants

tags (0)

participants (4)

E Taka
Ivan Clayson
Stefan Kooman
Tim Bishop