Hi Paul,
Many thanks for your helpful suggestions.
Yes, we have 13 pgs with "might_have_unfound" entries.
(also 1 pgs without "might_have_unfound" stuck in
active+recovery_unfound+degraded+repair state)
Taking one pg with unfound objects:
[root@ceph1 ~]# ceph health detail | grep 5.5c9
pg 5.5c9 has 2 unfound objects
pg 5.5c9 is active+recovery_unfound+degraded, acting
[347,442,381,215,91,260,31,94,178,302], 2 unfound
pg 5.5c9 is active+recovery_unfound+degraded, acting
[347,442,381,215,91,260,31,94,178,302], 2 unfound
pg 5.5c9 not deep-scrubbed since 2020-01-16 08:05:43.119336
pg 5.5c9 not scrubbed since 2020-01-16 08:05:43.119336
Checking the state:
[root@ceph1 ~]# ceph pg 5.5c9 query | jq .recovery_state
[
{
"name": "Started/Primary/Active",
"enter_time": "2020-02-03 09:57:30.982038",
"might_have_unfound": [
{
"osd": "31(6)",
"status": "already probed"
},
{
"osd": "91(4)",
"status": "already probed"
},
{
"osd": "94(7)",
"status": "already probed"
},
{
"osd": "178(8)",
"status": "already probed"
},
{
"osd": "215(3)",
"status": "already probed"
},
{
"osd": "260(5)",
"status": "already probed"
},
{
"osd": "302(9)",
"status": "already probed"
},
{
"osd": "381(2)",
"status": "already probed"
},
{
"osd": "442(1)",
"status": "already probed"
}
],
"recovery_progress": {
"backfill_targets": [],
"waiting_on_backfill": [],
"last_backfill_started": "MIN",
"backfill_info": {
"begin": "MIN",
"end": "MIN",
"objects": []
},
"peer_backfill_info": [],
"backfills_in_flight": [],
"recovering": [],
"pg_backend": {
"recovery_ops": [],
"read_ops": []
}
},
"scrub": {
"scrubber.epoch_start": "0",
"scrubber.active": false,
"scrubber.state": "INACTIVE",
"scrubber.start": "MIN",
"scrubber.end": "MIN",
"scrubber.max_end": "MIN",
"scrubber.subset_last_update": "0'0",
"scrubber.deep": false,
"scrubber.waiting_on_whom": []
}
},
{
"name": "Started",
"enter_time": "2020-02-03 09:57:29.788310"
}
]
-----------------------------------------------------
Taking your advice, I restart the primary osd for this pg:
[root@ceph1 ~]# ceph osd down 347
This doesn't change the output of "ceph pg 5.5c9 query", apart from
updating the Started time, and ceph health still shows unfound objects.
To fix this, do we need to issue a scrub (or deep scrub) so that the
objects can be found?
Just in case, I've issued a manual scrub:
[root@ceph1 ~]# ceph pg scrub 5.5c9
instructing pg 5.5c9s0 on osd.347 to scrub
The cluster is currently busy deleting snapshots, so it may take a while
before the scrub starts.
best regards,
Jake
On 2/3/20 6:31 PM, Paul Emmerich wrote:
This might be related to recent problems with OSDs not
being queried
for unfound objects properly in some cases (which I think was fixed in
master?)
Anyways: run ceph pg <pg> query on the affected PGs, check for "might
have unfound" and try restarting the OSDs mentioned there. Probably
also sufficient to just run "ceph osd down" on the primaries on the
affected PGs to get them to re-check.
Paul
--
Jake Grimmett
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.