On Feb 18, 2020, at 4:11 AM, Jeff Layton
<jlayton(a)redhat.com> wrote:
On Fri, 2020-02-14 at 07:13 -0800, Yiming Zhang wrote:
On Feb
13, 2020, at 3:52 AM, Jeff Layton <jlayton(a)redhat.com> wrote:
If the OSD daemon dies, then it will have closed all of its fd's and
there should be no more lock. Therefore you almost certainly have some
other process running that is holding the lock.
You may have to do a bit of digging in /proc/locks. Determine the
dev+inode number of the file on which the lock is being set and find it
in /proc/locks. Then you can track down the PID that's holding that
lock.
I have checked the locks with lslocks, here is the locks when I vstarted ceph
(bluestore block = /dev/sdc where sdc is a raw device):
COMMAND PID TYPE SIZE MODE M START END PATH
ceph-mgr 19852 POSIX WRITE 0 0 0 /...
iscsid 1061 POSIX WRITE 0 0 0 /run...
ceph-mgr 14889 POSIX WRITE 0 0 0 /...
rpcbind 990 FLOCK WRITE 0 0 0 /run...
ceph-mon 16430 POSIX WRITE 0 0 0 /...
ceph-mon 16430 POSIX WRITE 0 0 0 /...
ceph-mon 18107 POSIX WRITE 0 0 0 /...
ceph-mon 18107 POSIX WRITE 0 0 0 /...
ceph-mon 19711 POSIX WRITE 0 0 0 /...
ceph-mon 19711 POSIX WRITE 0 0 0 /...
ceph-mon 10495 POSIX WRITE 0 0 0 /...
ceph-mon 10495 POSIX WRITE 0 0 0 /...
ceph-mon 14748 POSIX WRITE 0 0 0 /...
ceph-mon 14748 POSIX WRITE 0 0 0 /...
cron 1085 FLOCK WRITE 0 0 0 /run...
ceph-mgr 18247 POSIX WRITE 0 0 0 /...
atd 1111 POSIX WRITE 0 0 0 /run...
lvmetad 807 POSIX WRITE 0 0 0 /run...
ceph-mgr 10635 POSIX WRITE 0 0 0 /...
ceph-mgr 16571 POSIX WRITE 0 0 0 /…
Then I kill all related processes and restart cluster, the error “_lock flock failed on
/users/xxx/ceph/build/dev/osd0/block” persists.
After the kill, locks are:
COMMAND PID TYPE SIZE MODE M START END PATH
rpcbind 20267 FLOCK WRITE 0 0 0 /run...
lvmetad 20266 POSIX WRITE 0 0 0 /run…
The error happens in KernelDevice.cc:
int r = ::flock(fd_directs[WRITE_LIFE_NOT_SET], LOCK_EX | LOCK_NB);
Where r gives -1, and fd_directs[WRITE_LIFE_NOT_SET] will give 11, and WRITE_LIFE_NOT_SET
is 0.
Any suggestions how to proceed with the issue?
Sorry, no. Any lock set on a block device should show up in /proc/locks
(as it uses the kernel's generic flock lock mechanism for local
filesystems).
You may want to play with strace and verify that the error is coming
from the kernel and that the program is attempting to set the lock on
the file you think it is.
What kernel is this running on?
The kernel is 4.15.0-70-generic( I also has the same issue on another kernel
4.15.18-041518-generic). I used the strace to track the issue, and it led to this
paticular function _lock in KernelDevice (`r = _lock();` in KernelDevice::open function).
If I commented it out, the error goest away. But it’s not a fix.
Maybe there is a bug here. I’ll keep digging this.
Thanks,
-ym