On Sat, Apr 15, 2023 at 4:58 PM Max Boone <max(a)netpunt.nl> wrote:
After a critical node failure on my lab cluster, which won't come
back up and is still down, the RBD objects are still being watched
/ mounted according to ceph. I can't shell to the node to rbd unbind
them as the node is down. I am absolutely certain that nothing is
using these images and they don't have snapshots either (and this IP
is not even remotely close to the those of the monitors in the
cluster). I blocked the IP usingceph osd blocklist add but after 30
minutes, they are still being watched. Them being watched (they are
RWO ceph-csi volumes) prevents me from re-using them in the cluster.
As far as I'm aware, ceph should remove the watchers after 30 minutes
and they've been blocklisted for hours now.
Hi Max,
A couple of general points:
- watch timeout is 30 seconds, not 30 minutes
- watcher IP doesn't have to match that of any of the monitors
root@node0:~# rbd status
kubernetes/csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff
Watchers:
watcher=10.0.0.103:0/992994811 client.1634081 cookie=139772597209280
root@node0:~# rbd snap list kubernetes/csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff
root@node0:~# rbd info kubernetes/csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff
rbd image 'csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff':
size 10 GiB in 2560 objects
order 22 (4 MiB objects)
snapshot_count: 0
id: 4ff5353b865e1
block_name_prefix: rbd_data.4ff5353b865e1
format: 2
features: layering
op_features:
flags:
create_timestamp: Fri Mar 31 14:46:51 2023
access_timestamp: Fri Mar 31 14:46:51 2023
modify_timestamp: Fri Mar 31 14:46:51 2023
root@node0:~# rados -p kubernetes listwatchers rbd_header.4ff5353b865e1
watcher=10.0.0.103:0/992994811 client.1634081 cookie=139772597209280
root@node0:~# ceph osd blocklist ls
10.0.0.103:0/0 2023-04-16T13:58:34.854232+0200
listed 1 entries
root@node0:~# ceph daemon osd.0 config get osd_client_watch_timeout
{
"osd_client_watch_timeout": "30"
}
Is it possible to kick a watcher out manually, or is there not much
I can do here besides shutting down the entire cluster (or OSDs) and
getting them back up? If it is a bug, I'm happy to help figuring out
it's root cause and see if I can help writing a fix. Cheers, Max.
You may have hit
https://tracker.ceph.com/issues/58120.
Try restarting the OSD that is holding the header object. To determine
the OSD, run "ceph osd map kubernetes rbd_header.4ff5353b865e1". The
output should end with something like "acting ([X, Y, Z], pX)", where X,
Y and Z are numbers. X is the OSD you want to restart.
Thanks,
Ilya