made a huge mistake, seeking recovery advice (osd zapped) - ceph-users

6 Aug 2020

hey folks,

I was deploying a new set of NVMe cards into my cluster, and while getting
the new devices ready, it seems the device names got mixed up, and I
managed to to run "sgdisk --zap-all" and "dd if=/dev/zero
of="/dev/sd"
bs=1M count=100" on some of the active devices.

I was adding new cards so I could migrate off the 2k+2m erasure coded setup
to a more redundant config, but in my mess up I ran the commands above on 3
of the 4 devices before the ceph status changed and I noticed the mistake.

I managed to restore the LVM partition table from backup but it seems to
not be enough to restart the OSD... I just need to recover one of the 3
drives to save all of my VM+Docker backing filesystem.

I'm running on Kubernetes with Rook, after restoring the partition table it
seems to be starting up ok, but then I get a stack trace and the container
goes into Error state: https://pastebin.com/5wk1bKy9

Any ideas how to fix this? Or somehow extract the data and put it back
together?

-- 
Cheers,
Peter Sarossy
Technical Program Manager
Data Center Data Security - Google LLC.