recovery_unfound - ceph-users

3 Feb 2020

Dear All,

Due to a mistake in my "rolling restart" script, one of our ceph
clusters now has a number of unfound objects:

There is an 8+2 erasure encoded data pool, 3x replicated metadata pool,
all data is stored as cephfs.

root@ceph7 ceph-archive]# ceph health
HEALTH_ERR 24/420880027 objects unfound (0.000%); Possible data damage:
14 pgs recovery_unfound; Degraded data redundancy: 64/4204261148 objects
degraded (0.000%), 14 pgs degraded

"ceph health detail" gives me a handle on which pgs are affected.
e.g:
    pg 5.f2f has 2 unfound objects
    pg 5.5c9 has 2 unfound objects
    pg 5.4c1 has 1 unfound objects
and so on...

plus more entries of this type:
  pg 5.6d is active+recovery_unfound+degraded, acting
[295,104,57,442,240,338,219,33,150,382], 1 unfound
    pg 5.3fa is active+recovery_unfound+degraded, acting
[343,147,21,131,315,63,214,365,264,437], 2 unfound
    pg 5.41d is active+recovery_unfound+degraded, acting
[20,104,190,377,52,141,418,358,240,289], 1 unfound

Digging deeper into one of the bad pg, we see the oid for the two
unfound objects:

root@ceph7 ceph-archive]# ceph pg 5.f2f list_unfound
{
    "num_missing": 4,
    "num_unfound": 2,
    "objects": [
        {
            "oid": {
                "oid": "1000ba25e49.00000207",
                "key": "",
                "snapid": -2,
                "hash": 854007599,
                "max": 0,
                "pool": 5,
                "namespace": ""
            },
            "need": "22541'3088478",
            "have": "0'0",
            "flags": "none",
            "locations": [
                "189(8)",
                "263(9)"
            ]
        },
        {
            "oid": {
                "oid": "1000bb25a5b.00000091",
                "key": "",
                "snapid": -2,
                "hash": 3637976879,
                "max": 0,
                "pool": 5,
                "namespace": ""
            },
            "need": "22541'3088476",
            "have": "0'0",
            "flags": "none",
            "locations": [
                "189(8)",
                "263(9)"
            ]
        }
    ],
    "more": false
}

While it would be nice to recover the data, this cluster is only used
for storing backups.

As all OSD are up and running, presumably the data blocks are
permanently lost?

If it's hard / impossible to recover the data, presumably we should now
consider using "ceph pg 5.f2f  mark_unfound_lost delete" on each
affected pg?

Finally, can we use the oid to identify the affected files?

best regards,

Jake

-- 
Jake Grimmett
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.