Fwd: PrimaryLogPG.cc: 11550: FAILED ceph_assert(head_obc) - ceph-users

10 Feb 2020

Dear All,

Following a clunky* cluster restart, we had

23 "objects unfound"
14 pg recovery_unfound

We could see no way to recover the unfound objects, we decided to mark
the objects in one pg unfound...

[root@ceph1 bad_oid]# ceph pg 5.f2f mark_unfound_lost delete
pg has 2 objects unfound and apparently lost marking

Unfortunately, this immediately crashed the primary OSD for this PG:

OSD log showing the osd crashing 3 times here: <http://p.ip.fi/gV8r>

the assert was :>

2020-02-10 13:38:45.003 7fa713ef3700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.6/rpm/el7/BUILD/ceph-14.2.6/src/osd/PrimaryLogPG.cc:
In function 'int PrimaryLogPG::recover_missing(const hobject_t&,
eversion_t, int, PGBackend::RecoveryHandle*)' thread 7fa713ef3700 time
2020-02-10 13:38:45.000875
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.6/rpm/el7/BUILD/ceph-14.2.6/src/osd/PrimaryLogPG.cc:
11550: FAILED ceph_assert(head_obc)

Questions..

1) Is it possible to recover the flapping OSD? or should we fail out the
flapping OSD and hope the cluster recovers?

2) We have 13 other pg with unfound objects. Do we need to mark_unfound
these one at a time, and then fail out their primary OSD? (allowing the
cluster to recover before mark_unfound the next pg & failing it's
primary OSD)

* thread describing the bad restart :>
<https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/IRKCDRRAH7YZEVXN5CH4JT2NH4EWYRGI/#IRKCDRRAH7YZEVXN5CH4JT2NH4EWYRGI>

many thanks!

Jake

-- 
Dr Jake Grimmett
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.

-- 
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
Phone 01223 267019
Mobile 0776 9886539