client isn't responding to mclientcaps(revoke), pending pAsLsXsFsc issued pAsLsXsFsc - ceph-users

Loic Tortay

1 May 1 May

2:05 p.m.

On 01/05/2023 11:35, Frank Schilder wrote:

Hi all, I think we might be hitting a known problem (https://tracker.ceph.com/issues/57244). I don't want to fail the mds yet, because we have troubles with older kclients that miss the mds restart and hold on to cache entries referring to the killed instance, leading to hanging jobs on our HPC cluster. I have seen this issue before and there was a process in D-state that dead-locked itself. Usually, killing this process succeeded and resolved the issue. However, this time I can't find such a process. The tracker mentions that one can delete the file/folder. I have the inode number, but really don't want to start a find on a 1.5PB file system. Is there a better way to find what path is causing the issue (ask the MDS directly, look at a cache dump, or similar)? Is there an alternative to deletion or MDS fail?

Hello, If you have the inode number, you can retrieve the name with something like: rados getxattr -p $POOL ${ino}.00000000 parent | \ ceph-dencoder type inode_backtrace_t import - decode dump_json | \ jq -M '[.ancestors[].dname]' | tr -d '[[",\]]' | \ awk 't!=""{t=$1 "/" t;}t==""{t=$1;}END{print t}' Where $POOL is the "default pool" name (for files) or the metadata pool name (for directories) and $ino is the inode number (in hexadecimal). Loïc. -- | Loīc Tortay <tortay(a)cc.in2p3.fr> - IN2P3 Computing Centre |

Reply

Frank Schilder

10:34 a.m.

Hi Arnaud, thanks, that's a good one. The inode in question should be in cache at this time. It actually accepts the hex-code given in the log message and is really fast. I hope I remember that for next time. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: MARTEL Arnaud <arnaud.martel(a)cea.fr> Sent: Tuesday, May 2, 2023 11:20 AM To: Frank Schilder; ceph-users(a)ceph.io Subject: Re: [ceph-users] Re: client isn't responding to mclientcaps(revoke), pending pAsLsXsFsc issued pAsLsXsFsc Hi, Or you can query the MDS(s) with: ceph tell mds.* dump inode <inode> 2>/dev/null | grep path for example: user@server:~$ ceph tell mds.* dump inode 1099836155033 2>/dev/null | grep path "path": "/ec42/default/joliot/gipsi/gpu_burn.sif", "stray_prior_path": "", Arnaud Le 01/05/2023 15:07, « Loic Tortay » <tortay(a)cc.in2p3.fr <mailto:tortay@cc.in2p3.fr>> a écrit : On 01/05/2023 11:35, Frank Schilder wrote:

...

Hi all, I think we might be hitting a known problem (https://tracker.ceph.com/issues/57244 <https://tracker.ceph.com/issues/57244>). I don't want to fail the mds yet, because we have troubles with older kclients that miss the mds restart and hold on to cache entries referring to the killed instance, leading to hanging jobs on our HPC cluster. I have seen this issue before and there was a process in D-state that dead-locked itself. Usually, killing this process succeeded and resolved the issue. However, this time I can't find such a process. The tracker mentions that one can delete the file/folder. I have the inode number, but really don't want to start a find on a 1.5PB file system. Is there a better way to find what path is causing the issue (ask the MDS directly, look at a cache dump, or similar)? Is there an alternative to deletion or MDS fail?

Hello, If you have the inode number, you can retrieve the name with something like: rados getxattr -p $POOL ${ino}.00000000 parent | \ ceph-dencoder type inode_backtrace_t import - decode dump_json | \ jq -M '[.ancestors[].dname]' | tr -d '[[",\]]' | \ awk 't!=""{t=$1 "/" t;}t==""{t=$1;}END{print t}' Where $POOL is the "default pool" name (for files) or the metadata pool name (for directories) and $ino is the inode number (in hexadecimal). Loïc. -- | Loīc Tortay <tortay(a)cc.in2p3.fr <mailto:tortay@cc.in2p3.fr>> - IN2P3 Computing Centre | _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

Reply

Frank Schilder

9 May 9 May

9:23 a.m.

Dear Xiubo, both issues will cause problems, the one reported in the subject (https://tracker.ceph.com/issues/57244) and the potential follow-up on MDS restart (https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/LYY7TBK63XP…). Either one will cause compute jobs on our HPC cluster to hang and users will need to run the jobs again. Our queues are full, so not very popular to loose your spot. The process in D-state is a user process. Interestingly it is often possible to kill it despite the D-state (if one can find the process) and the stuck recall gets resolved. If I restart the MDS, the stuck process might continue working, but we run a significant risk of other processed getting stuck due to the libceph/MDS wrong peer issue. We actually have these kind of messages [Mon Mar 6 12:56:46 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address [Mon Mar 6 13:05:18 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-1572619386 all over the HPC cluster and each of them means that some files/dirs are inaccessible on the compute node and jobs either died or are/got stuck there. Every MDS restart bears the risk of such events happening and with many nodes this probability approaches 1 - every time we restart an MDS jobs get stuck. I have a reproducer for an instance of https://tracker.ceph.com/issues/57244. Unfortunately, this is a big one that I would need to pack into a container. I was not able to reduce it to something small, it seems to depend on a very specific combination of codes with certain internal latencies between threads that trigger a race. It sounds like you have a patch for https://tracker.ceph.com/issues/57244 although its not linked from the tracker item. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Xiubo Li <xiubli(a)redhat.com> Sent: Friday, May 5, 2023 2:40 AM To: Frank Schilder; ceph-users(a)ceph.io Subject: Re: [ceph-users] client isn't responding to mclientcaps(revoke), pending pAsLsXsFsc issued pAsLsXsFsc On 5/1/23 17:35, Frank Schilder wrote:

...

Hi all, I think we might be hitting a known problem (https://tracker.ceph.com/issues/57244). I don't want to fail the mds yet, because we have troubles with older kclients that miss the mds restart and hold on to cache entries referring to the killed instance, leading to hanging jobs on our HPC cluster.

Will this cause any issue in your case ?

...

I have seen this issue before and there was a process in D-state that dead-locked itself. Usually, killing this process succeeded and resolved the issue. However, this time I can't find such a process.

BTW, what's the D-state process ? A ceph one ? Thanks

...

The tracker mentions that one can delete the file/folder. I have the inode number, but really don't want to start a find on a 1.5PB file system. Is there a better way to find what path is causing the issue (ask the MDS directly, look at a cache dump, or similar)? Is there an alternative to deletion or MDS fail? Thanks and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Reply

Frank Schilder

12:35 p.m.

Hi Xiubo.

...

IMO evicting the corresponding client could also resolve this issue instead of restarting the MDS.

Yes, it can get rid of the stuck caps release request, but it will also make any process accessing the file system crash. After a client eviction we usually have to reboot the server to get everything back clean. An MDS restart would achieve this in a transparent way and when replaying the journal execute the pending caps recall successfully without making processes crash - if there wasn't the wrong peer issue. As far as I can tell, the operation is stuck in the MDS because its never re-scheduled/re-tried or checked if the condition still exists (the client still holds the caps requested). An MDS restart re-schedules all pending operations and then it succeeds. In every ceph version so far there were examples where hand-shaking between a client and an MDS had small flaws. For situations like that I would really like to have a light-weight MDS daemon command to force a re-schedule/re-play without having to restart the entire MDS and reconnect all its clients from scratch. It would be great to have light-weight tools available to rectify such simple conditions in an as non-disruptive as possible way. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Xiubo Li <xiubli(a)redhat.com> Sent: Wednesday, May 10, 2023 4:01 AM To: Frank Schilder; ceph-users(a)ceph.io Subject: Re: [ceph-users] client isn't responding to mclientcaps(revoke), pending pAsLsXsFsc issued pAsLsXsFsc On 5/9/23 16:23, Frank Schilder wrote:

...

Dear Xiubo, both issues will cause problems, the one reported in the subject (https://tracker.ceph.com/issues/57244) and the potential follow-up on MDS restart (https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/LYY7TBK63XP…). Either one will cause compute jobs on our HPC cluster to hang and users will need to run the jobs again. Our queues are full, so not very popular to loose your spot. The process in D-state is a user process. Interestingly it is often possible to kill it despite the D-state (if one can find the process) and the stuck recall gets resolved. If I restart the MDS, the stuck process might continue working, but we run a significant risk of other processed getting stuck due to the libceph/MDS wrong peer issue. We actually have these kind of messages [Mon Mar 6 12:56:46 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address [Mon Mar 6 13:05:18 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-1572619386 all over the HPC cluster and each of them means that some files/dirs are inaccessible on the compute node and jobs either died or are/got stuck there. Every MDS restart bears the risk of such events happening and with many nodes this probability approaches 1 - every time we restart an MDS jobs get stuck. I have a reproducer for an instance of https://tracker.ceph.com/issues/57244. Unfortunately, this is a big one that I would need to pack into a container. I was not able to reduce it to something small, it seems to depend on a very specific combination of codes with certain internal latencies between threads that trigger a race. It sounds like you have a patch for https://tracker.ceph.com/issues/57244 although its not linked from the tracker item.

IMO evicting the corresponding client could also resolve this issue instead of restarting the MDS. Have you tried this ? Thanks - Xiubo

...

Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Xiubo Li <xiubli(a)redhat.com> Sent: Friday, May 5, 2023 2:40 AM To: Frank Schilder; ceph-users(a)ceph.io Subject: Re: [ceph-users] client isn't responding to mclientcaps(revoke), pending pAsLsXsFsc issued pAsLsXsFsc On 5/1/23 17:35, Frank Schilder wrote:

Hi all, I think we might be hitting a known problem (https://tracker.ceph.com/issues/57244). I don't want to fail the mds yet, because we have troubles with older kclients that miss the mds restart and hold on to cache entries referring to the killed instance, leading to hanging jobs on our HPC cluster.

Will this cause any issue in your case ?

I have seen this issue before and there was a process in D-state that dead-locked itself. Usually, killing this process succeeded and resolved the issue. However, this time I can't find such a process.

BTW, what's the D-state process ? A ceph one ? Thanks

The tracker mentions that one can delete the file/folder. I have the inode number, but really don't want to start a find on a 1.5PB file system. Is there a better way to find what path is causing the issue (ask the MDS directly, look at a cache dump, or similar)? Is there an alternative to deletion or MDS fail? Thanks and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Reply