getting rid of incomplete pg errors - ceph-users

28 Jan 2020

Hi.

before I descend into what happened and why it happened: I'm talking about a
test-cluster so I don't really care about the data in this case.

We've recently started upgrading from luminous to nautilus, and for us that
means we're retiring ceph-disk in favour of ceph-volume with lvm and
dmcrypt.

Our setup is in containers and we've got DBs separated from Data.
When testing our upgrade-path we discovered that running the host on
ubuntu-xenial and the containers on centos-7.7 leads to lvm inside the
containers not using lvmetad because it's too old. That in turn means that
not running `vgscan --cache` on the host before adding a LV to a VG
essentially zeros the metadata for all LVs in that VG.

That happened on two out of three hosts for a bunch of OSDs and those OSDs
are gone. I have no way of getting them back, they've been overwritten
multiple times trying to figure out what went wrong.

So now I have a cluster that's got 16 pgs in 'incomplete', 14 of them with 0
objects, 2 with about 150 objects each.

I have found a couple of howtos that tell me to use ceph-objectstore-tool to
find the pgs on the active osds and I've given that a try, but
ceph-objectstore-tool always tells me it can't find the pg I am looking for.

Can I tell ceph to re-init the pgs? Do I have to delete the pools and
recreate them?

There's no data I can't get back in there, I just don't feel like
scrapping and redeploying the whole cluster.

-- 
Cheers,
	Hardy