stale+active+clean PG - ceph-users

14 May 2020

Dear all,

We're running Ceph Luminous and we've recently hit an issue with some OSD's
(autoout states, IO/CPU overload) which unfortunately resulted with one placement group
with the state "stale+active+clean", it's a placement group from .rgw.root
pool:

1.15          0                  0        0         0       0          0     1        1   
                            stale+active+clean 2020-05-11 23:22:51.396288         40'1
     2142:152 [3,2,6]          3 [3,2,6]              3        40'1 2020-04-22
00:46:05.904418            40'1 2020-04-20 20:18:13.371396             0 

I guess there is no active replica of that object anywhere on the cluster. Restarting
osd.3, osd.2 or osd.6 daemons does not help.

I've used ceph-objectstore-tool and successfully exported placement group from osd.3,
osd.2 and osd.6 and tried to import it on a completely different OSD, the exports differ
in filesize slightly, but the osd.3 wihch was the latest primary is the biggest so
I've tried to import it on a different OSD, when starting up I see the following (this
is from osd.1):
2020-05-14 21:43:19.779740 7f7880ac3700  1 osd.1 pg_epoch: 2459 pg[1.15( v 40'1
(0'0,40'1] local-lis/les=2073/2074 n=0 ec=73/39 lis/c 2073/2073 les/c/f
2074/2074/633 2145/39/2145) [] r=-1 lpr=2455 crt=40'1 lcod 0'0 unknown NOTIFY]
state<Start>: transitioning to Stray

I see from previous pg dumps (several weeks before while it was still active+clean) that
it was 115 bytes with zero objects in it but I am not sure how to interpret that.

As this is a pg from .rgw.root pool, I cannot get any response from the cluster when
accessing (everything timeouts).

What is the correct course of action with this pg?

Any help would be greatly appriciated.

Thanks,
Tomislav