Rbd image corrupt or locked somehow - ceph-users

2 Sep 2020

Hi,

I have set a 3 host cluster with 30 OSDs total. Cluster has health OK and no warning
whatsoever. I set an RBD pool and 14 images which werer all rbd-mirrored to a second
cluster (which was disconnected since problems began) and also an iSCSI interface. Then I
connected a Windows 2019 Server through iSCSI, mounted all 14 drives and created a spanned
volume with all the drives. Everything was working fine, but I had to disconnect the
server, so I disconnected the iSCSI interface and when I tried to reconnect my volume was
unusable and drives seemed stuck. I ended rebooting each cluster node and then later,
since I still couldn't use my images, removed and recreated all images.

in this second run all was good and I had a robocopy syncing files for almost a week to my
ceph cluster and had copied more than 5TB of data already when my Windows Server got
stuck. Still not sure why it got stuck, but some services like FTP were responding but
others, including login, were not. So I reset Windows server and when it was back up my
spanned volume was bad again. I've been trying to recover it for the last 2 days but
without success.

Right now all images are disconnected, I have no locks (found some at some point and
removed, but not sure who was locking) and no watchers in any of the images, but the 3
images that had data in it are corrupt or locked somehow. Nothing I try works on them and
the operation gets stuck. I can edit the images' config, but not these 3. I can create
snapshots, but not these 3. I managed to mount images using iSCSI in a Linux box, but
these 3 get Linux commands (fdisk, parted) hanging. Ceph dashboard shows some stats like
read and write rate for all images, but these 3.

It seems something inside the image is broken or stuck but as I said no locks on them.

I tried a lot of options and somehow my cluster now has some RGW pools that I have no idea
where they came from.

Any idea what I should do?

--
Salsa