Good day
I'm currently decommissioning a cluster that runs EC3+1 (rack failure
domain - with 5 racks), however the cluster still has some production items
on it since I'm in the process of moving it to our new EC8+2 cluster.
Running Luminous 12.2.13 on Ubuntu 16 HWE, containerized with ceph-ansible
3.2.
I currently get the following error after we lost 1 OSD (195).
I'm forced to repair, scrub, deep scrub, restart OSDs etc, everything
mentioned in the troubleshooting docs & information from IRC but cannot for
the life of me get it to work.
What I'm seeing is, that pg 9.3dd (volume_images) has a status of 1 OSD
(osd is down) which I know, but the other OSD shows 316 (not queried). Also
pg 9.3dd has 4x functioning UP's but still reference the missing OSD under
acting.
Regards
OBJECT_UNFOUND 1/501815192 objects unfound (0.000%)
pg 9.3dd has 1 unfound objects
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 9.3d1 is active+clean+inconsistent, acting [347,316,307,249]
PG_DEGRADED Degraded data redundancy: 1219/2001837265 objects degraded
(0.000%), 1 pg degraded, 1 pg undersized
pg 9.3dd is stuck undersized for 55486.439002, current state
active+recovery_wait+forced_recovery+undersized+degraded+remapped, last
acting [355,2147483647,64,367]
ceph pg 9.3dd query
"up": [
355,
139,
64,
367
],
"acting": [
355,
2147483647,
64,
367
"recovery_state": [
{
"name": "Started/Primary/Active",
"enter_time": "2021-03-08 16:07:51.239010",
"might_have_unfound": [
{
"osd": "64(2)",
"status": "already probed"
},
{
"osd": "139(1)",
"status": "already probed"
},
{
"osd": "195(1)",
"status": "osd is down"
},
{
"osd": "316(2)",
"status": "not queried"
},
{
"osd": "367(3)",
"status": "already probed"
}
],
"recovery_progress": {
"backfill_targets": [
"139(1)"
ceph pg 9.3d1 query
{
"state": "active+clean+inconsistent",
"snap_trimq": "[]",
"snap_trimq_len": 0,
"epoch": 168488,
"up": [
347,
316,
307,
249
],
"acting": [
347,
316,
307,
249
],
"actingbackfill": [
"249(3)",
"307(2)",
"316(1)",
"347(0)"
--
CLUSTER STATS:
cluster:
id: 1ea59fbe-46a4-474e-8225-a66b32ca86b7
health: HEALTH_ERR
1/490166525 objects unfound (0.000%)
1 scrub errors
Possible data damage: 1 pg inconsistent
Degraded data redundancy: 1259/1956055233 objects degraded
(0.000%), 1 pg degraded, 1 pg undersized
services:
mon: 3 daemons, quorum B-04-11-cephctl,B-05-11-cephctl,B-03-11-cephctl
mgr: B-03-11-cephctl(active), standbys: B-04-11-cephctl, B-05-11-cephctl
mds: cephfs-1/1/1 up {0=B-04-11-cephctl=up:active}, 2 up:standby
osd: 384 osds: 383 up, 383 in; 1 remapped pgs
data:
pools: 11 pools, 13440 pgs
objects: 490.17M objects, 1.35PiB
usage: 1.88PiB used, 2.33PiB / 4.21PiB avail
pgs: 1259/1956055233 objects degraded (0.000%)
1/490166525 objects unfound (0.000%)
13332 active+clean
96 active+clean+scrubbing+deep
10 active+clean+scrubbing
1 active+clean+inconsistent
1
active+recovery_wait+forced_recovery+undersized+degraded+remapped
*Jeremi-Ernst Avenant, Mr.*Cloud Infrastructure Specialist
Inter-University Institute for Data Intensive Astronomy
5th Floor, Department of Physics and Astronomy,
University of Cape Town
Tel: 021 959 4137 <0219592327>
Web:
www.idia.ac.za <http://www.uwc.ac.za/>
E-mail (IDIA): jeremi(a)idia.ac.za <mfundo(a)idia.ac.za>
Rondebosch, Cape Town, 7600, South Africa