Dear all,
We're running Ceph Luminous and we've recently hit an issue with some OSD's
(autoout states, IO/CPU overload) which unfortunately resulted with one placement group
with the state "stale+active+clean", it's a placement group from .rgw.root
pool:
1.15 0 0 0 0 0 0 1 1
stale+active+clean 2020-05-11 23:22:51.396288 40'1
2142:152 [3,2,6] 3 [3,2,6] 3 40'1 2020-04-22
00:46:05.904418 40'1 2020-04-20 20:18:13.371396 0
I guess there is no active replica of that object anywhere on the cluster. Restarting
osd.3, osd.2 or osd.6 daemons does not help.
I've used ceph-objectstore-tool and successfully exported placement group from osd.3,
osd.2 and osd.6 and tried to import it on a completely different OSD, the exports differ
in filesize slightly, but the osd.3 wihch was the latest primary is the biggest so
I've tried to import it on a different OSD, when starting up I see the following (this
is from osd.1):
2020-05-14 21:43:19.779740 7f7880ac3700 1 osd.1 pg_epoch: 2459 pg[1.15( v 40'1
(0'0,40'1] local-lis/les=2073/2074 n=0 ec=73/39 lis/c 2073/2073 les/c/f
2074/2074/633 2145/39/2145) [] r=-1 lpr=2455 crt=40'1 lcod 0'0 unknown NOTIFY]
state<Start>: transitioning to Stray
I see from previous pg dumps (several weeks before while it was still active+clean) that
it was 115 bytes with zero objects in it but I am not sure how to interpret that.
As this is a pg from .rgw.root pool, I cannot get any response from the cluster when
accessing (everything timeouts).
What is the correct course of action with this pg?
Any help would be greatly appriciated.
Thanks,
Tomislav
Show replies by date