Quick Update in case anyone reads my previous post.
No ideas were forthcoming on how to fix the assert that was flapping the
OSD (caused by deleting unfound objects).
The affected pg was readable, so we decided to recycle the OSD...
destroy the flapping primary OSD
# ceph osd destroy 443 --force
purge the lvm entry for this disk
#lvremove
/dev/ceph-64b0010b-e397-49c2-ab01-6e43e6e5b41a/osd-block-fb824e45-d35f-486c-a4ca-05e5937eceae
zap the disk, it's the only way to be sure...
# ceph-volume lvm zap /dev/sdab
reuse the drive & OSD number
# ceph-volume lvm prepare --osd-id 443 --data /dev/sdab
activate the OSD
# ceph-volume lvm activate 443 6e252371-d158-4d16-ac31-fed8f7d0cb1f
Now watching to see if the cluster recovers...
best,
Jake
On 2/10/20 3:31 PM, Jake Grimmett wrote:
Dear All,
Following a clunky* cluster restart, we had
23 "objects unfound"
14 pg recovery_unfound
We could see no way to recover the unfound objects, we decided to mark
the objects in one pg unfound...
[root@ceph1 bad_oid]# ceph pg 5.f2f mark_unfound_lost delete
pg has 2 objects unfound and apparently lost marking
Unfortunately, this immediately crashed the primary OSD for this PG:
OSD log showing the osd crashing 3 times here: <http://p.ip.fi/gV8r>
the assert was :>
2020-02-10 13:38:45.003 7fa713ef3700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.6/rpm/el7/BUILD/ceph-14.2.6/src/osd/PrimaryLogPG.cc:
In function 'int PrimaryLogPG::recover_missing(const hobject_t&,
eversion_t, int, PGBackend::RecoveryHandle*)' thread 7fa713ef3700 time
2020-02-10 13:38:45.000875
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.6/rpm/el7/BUILD/ceph-14.2.6/src/osd/PrimaryLogPG.cc:
11550: FAILED ceph_assert(head_obc)
Questions..
1) Is it possible to recover the flapping OSD? or should we fail out the
flapping OSD and hope the cluster recovers?
2) We have 13 other pg with unfound objects. Do we need to mark_unfound
these one at a time, and then fail out their primary OSD? (allowing the
cluster to recover before mark_unfound the next pg & failing it's
primary OSD)
* thread describing the bad restart :>
<https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/IRKCDRRAH7YZEVXN5CH4JT2NH4EWYRGI/#IRKCDRRAH7YZEVXN5CH4JT2NH4EWYRGI>
many thanks!
Jake
--
Dr Jake Grimmett
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.