Hello,
I am seeing some commands running on CephFS mounts getting stuck in an uninterruptible
sleep, at which point I can only terminate them by rebooting the client. Has anyone
experienced anything similar and found a way to safe-guard against this?
My mount is using the ceph kernel driver, with the following config in fstab:
10.225.44.236,10.225.44.237,10.225.44.238:6789:/albacore/system/deploy on /opt/dcl/deploy
type ceph
(rw,noatime,name=albacore,secret=<hidden>,acl,wsize=32768,rsize=32768,_netdev)
The vast majority of commands complete successfully on the mounted filesystem but on one
occasion a "chmod -R +r *" command hung indefinitely (despite having run
successfully numerous times before). Attempts to terminate the process using `kill` fail.
Repeated attempts to run the same command also get blocked in the same state. A `ps`
command shows the processes are stuck in uninterruptable sleep:
[root@svr01 albacore] ~> ps -Al | grep chmod
4 D 0 18657 18656 0 80 0 - 26998 rwsem_ pts/2 00:00:00 chmod
4 D 0 21835 1 0 80 0 - 26998 rwsem_ ? 00:00:00 chmod
Ceph seems to be unaware of the hung process. There are no slow ops / ops in flight in
either of the dump_ops_in_flight output on the server, or under sys/kernel/debug/ceph/ on
the client. Similarly there are no logs in dmesg for the command / process. Ceph health
reports no MDS issues, and there's nothing in the logs from my MDS from when the
processes hung.
The only method I've found of clearing the processes is to reboot my client.
Has anyone got experience with this? Are there ceph mount options that would guard against
this?
Some details of the current setup:
• ceph version 14.2.5 (ad5bd132e1492173c85fda2cc863152730b16a92) nautilus (stable)
• We're using the ceph kernel driver, kernel: 5.5.7-1.el7.elrepo.x86_64
• The client server has 38 separate directories mounted, all from the same CephFS
filesystem.
• All 38 directories are mounted with the same config by three separate clients.
• Mount config (in fstab):
10.225.44.236,10.225.44.237,10.225.44.238:6789:/albacore/system/deploy on /opt/dcl/deploy
type ceph
(rw,noatime,name=albacore,secret=<hidden>,acl,wsize=32768,rsize=32768,_netdev)
Kind regards,
Dave
Show replies by date