Re: flock is held after ceph-osd daemon being stopped

18 Feb 2020

...
  On Feb 18, 2020, at 4:11 AM, Jeff Layton
&lt;jlayton(a)redhat.com&gt; wrote:

 On Fri, 2020-02-14 at 07:13 -0800, Yiming Zhang wrote:
   On Feb
13, 2020, at 3:52 AM, Jeff Layton &lt;jlayton(a)redhat.com&gt; wrote:

 If the OSD daemon dies, then it will have closed all of its fd's and
 there should be no more lock. Therefore you almost certainly have some
 other process running that is holding the lock.

 You may have to do a bit of digging in /proc/locks. Determine the
 dev+inode number of the file on which the lock is being set and find it
 in /proc/locks. Then you can track down the PID that's holding that
 lock.
   I have checked the locks with lslocks, here is the locks when I vstarted ceph
(bluestore block = /dev/sdc where sdc is a raw device):
 COMMAND           PID  TYPE SIZE MODE  M START END PATH
 ceph-mgr        19852 POSIX      WRITE 0     0   0 /...
 iscsid           1061 POSIX      WRITE 0     0   0 /run...
 ceph-mgr        14889 POSIX      WRITE 0     0   0 /...
 rpcbind           990 FLOCK      WRITE 0     0   0 /run...
 ceph-mon        16430 POSIX      WRITE 0     0   0 /...
 ceph-mon        16430 POSIX      WRITE 0     0   0 /...
 ceph-mon        18107 POSIX      WRITE 0     0   0 /...
 ceph-mon        18107 POSIX      WRITE 0     0   0 /...
 ceph-mon        19711 POSIX      WRITE 0     0   0 /...
 ceph-mon        19711 POSIX      WRITE 0     0   0 /...
 ceph-mon        10495 POSIX      WRITE 0     0   0 /...
 ceph-mon        10495 POSIX      WRITE 0     0   0 /...
 ceph-mon        14748 POSIX      WRITE 0     0   0 /...
 ceph-mon        14748 POSIX      WRITE 0     0   0 /...
 cron             1085 FLOCK      WRITE 0     0   0 /run...
 ceph-mgr        18247 POSIX      WRITE 0     0   0 /...
 atd              1111 POSIX      WRITE 0     0   0 /run...
 lvmetad           807 POSIX      WRITE 0     0   0 /run...
 ceph-mgr        10635 POSIX      WRITE 0     0   0 /...
 ceph-mgr        16571 POSIX      WRITE 0     0   0 /…

 Then I kill all related processes and restart cluster, the error “_lock flock failed on
/users/xxx/ceph/build/dev/osd0/block” persists. 

 After the kill, locks are:
 COMMAND           PID  TYPE SIZE MODE  M START END PATH
 rpcbind         20267 FLOCK      WRITE 0     0   0 /run...
 lvmetad         20266 POSIX      WRITE 0     0   0 /run…

 The error happens in KernelDevice.cc:
 int r = ::flock(fd_directs[WRITE_LIFE_NOT_SET], LOCK_EX | LOCK_NB);
 Where r gives -1, and fd_directs[WRITE_LIFE_NOT_SET] will give 11, and WRITE_LIFE_NOT_SET
is 0.

 Any suggestions how to proceed with the issue? 

 Sorry, no. Any lock set on a block device should show up in /proc/locks
 (as it uses the kernel's generic flock lock mechanism for local
 filesystems).

 You may want to play with strace and verify that the error is coming
 from the kernel and that the program is attempting to set the lock on
 the file you think it is.

 What kernel is this running on? 
The kernel is 4.15.0-70-generic( I also has the same issue on another kernel
4.15.18-041518-generic). I used the strace to track the issue, and it led to this
paticular function _lock in KernelDevice (`r = _lock();` in KernelDevice::open function).
If I commented it out, the error goest away. But it’s not a fix.
Maybe there is a bug here. I’ll keep digging this.

Thanks,
-ym

...
  -- 
 Jeff Layton &lt;jlayton(a)redhat.com&gt; 

2024

2023

2022

2021

2020

2019

Re: flock is held after ceph-osd daemon being stopped