Dead node (watcher) won't timeout on RBD - ceph-users

15 Apr 2023


After a critical node failure on my lab cluster, which won't come back up and is still
down, the RBD objects are still being watched / mounted according to ceph. I can't
shell to the node to rbd unbind them as the node is down. I am absolutely certain that
nothing is using these images and they don't have snapshots either (and this IP is not
even remotely close to the those of the monitors in the cluster). I blocked the IP
usingceph osd blocklist add but after 30 minutes, they are still being watched. Them being
watched (they are RWO ceph-csi volumes) prevents me from re-using them in the cluster. As
far as I'm aware, ceph should remove the watchers after 30 minutes and they've
been blocklisted for hours now.root@node0:~# rbd status
kubernetes/csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff
Watchers:
	watcher=10.0.0.103:0/992994811 client.1634081 cookie=139772597209280
root@node0:~# rbd snap list kubernetes/csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff
root@node0:~# rbd info kubernetes/csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff
rbd image 'csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff':
	size 10 GiB in 2560 objects
	order 22 (4 MiB objects)
	snapshot_count: 0
	id: 4ff5353b865e1
	block_name_prefix: rbd_data.4ff5353b865e1
	format: 2
	features: layering
	op_features: 
	flags: 
	create_timestamp: Fri Mar 31 14:46:51 2023
	access_timestamp: Fri Mar 31 14:46:51 2023
	modify_timestamp: Fri Mar 31 14:46:51 2023
root@node0:~# rados -p kubernetes listwatchers rbd_header.4ff5353b865e1
watcher=10.0.0.103:0/992994811 client.1634081 cookie=139772597209280
root@node0:~# ceph osd blocklist ls
10.0.0.103:0/0 2023-04-16T13:58:34.854232+0200
listed 1 entries
root@node0:~# ceph daemon osd.0 config get osd_client_watch_timeout
{
    "osd_client_watch_timeout": "30"
}
 Is it possible to kick a watcher out manually, or is there not much I can do here besides
shutting down the entire cluster (or OSDs) and getting them back up? If it is a bug,
I'm happy to help figuring out it's root cause and see if I can help writing a
fix. Cheers, Max.