Dear Xiubo,
both issues will cause problems, the one reported in the subject
(
https://tracker.ceph.com/issues/57244) and the potential follow-up on MDS restart
(
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/LYY7TBK63XP…).
Either one will cause compute jobs on our HPC cluster to hang and users will need to run
the jobs again. Our queues are full, so not very popular to loose your spot.
The process in D-state is a user process. Interestingly it is often possible to kill it
despite the D-state (if one can find the process) and the stuck recall gets resolved. If I
restart the MDS, the stuck process might continue working, but we run a significant risk
of other processed getting stuck due to the libceph/MDS wrong peer issue. We actually have
these kind of messages
[Mon Mar 6 12:56:46 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:05:18 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got
192.168.32.87:6801/-1572619386
all over the HPC cluster and each of them means that some files/dirs are inaccessible on
the compute node and jobs either died or are/got stuck there. Every MDS restart bears the
risk of such events happening and with many nodes this probability approaches 1 - every
time we restart an MDS jobs get stuck.
I have a reproducer for an instance of
https://tracker.ceph.com/issues/57244.
Unfortunately, this is a big one that I would need to pack into a container. I was not
able to reduce it to something small, it seems to depend on a very specific combination of
codes with certain internal latencies between threads that trigger a race.
It sounds like you have a patch for
https://tracker.ceph.com/issues/57244 although its not
linked from the tracker item.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Xiubo Li <xiubli(a)redhat.com>
Sent: Friday, May 5, 2023 2:40 AM
To: Frank Schilder; ceph-users(a)ceph.io
Subject: Re: [ceph-users] client isn't responding to mclientcaps(revoke), pending
pAsLsXsFsc issued pAsLsXsFsc
On 5/1/23 17:35, Frank Schilder wrote:
Hi all,
I think we might be hitting a known problem (
https://tracker.ceph.com/issues/57244). I
don't want to fail the mds yet, because we have troubles with older kclients that miss
the mds restart and hold on to cache entries referring to the killed instance, leading to
hanging jobs on our HPC cluster.
Will this cause any issue in your case ?
I have seen this issue before and there was a process
in D-state that dead-locked itself. Usually, killing this process succeeded and resolved
the issue. However, this time I can't find such a process.
BTW, what's the D-state process ? A ceph one ?
Thanks
The tracker mentions that one can delete the
file/folder. I have the inode number, but really don't want to start a find on a 1.5PB
file system. Is there a better way to find what path is causing the issue (ask the MDS
directly, look at a cache dump, or similar)? Is there an alternative to deletion or MDS
fail?
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io