We've also seen this bug several times since Mimic, it seems to happen
whenever a backfill target goes down. Always resolves itself but is
still annoying.
The original fixmaking this a warning instead of an error
unfortunately doesn't help on Nautilus because we often have clusters
that would be HEALTH_OK without this bug on Nautilus (i.e., some PGs
in remapped+backfill*) but they will show up as HEALTH_WARN with this
fix (and HEALTH_ERR without it).
Paul
On Wed, Aug 14, 2019 at 11:44 PM Bryan Stillwell <bstillwell(a)godaddy.com> wrote:
We've run into this issue on the first two clusters after upgrading them to Nautilus
(14.2.2).
When marking a single OSD back in to the cluster some PGs will switch to the
active+remapped+backfill_wait+backfill_toofull state for a while and then it goes away
after some of the other PGs finish backfilling. This is rather odd because all the data
on the cluster could fit on a single drive, but we have over 100 of them:
# ceph -s
cluster:
id: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
health: HEALTH_ERR
Degraded data redundancy (low space): 1 pg backfill_toofull
services:
mon: 3 daemons, quorum a1cephmon002,a1cephmon003,a1cephmon004 (age 21h)
mgr: a1cephmon002(active, since 21h), standbys: a1cephmon003, a1cephmon004
mds: cephfs:2 {0=a1cephmon002=up:active,1=a1cephmon003=up:active} 1 up:standby
osd: 143 osds: 142 up, 142 in; 106 remapped pgs
rgw: 11 daemons active (radosgw.a1cephrgw008, radosgw.a1cephrgw009,
radosgw.a1cephrgw010, radosgw.a1cephrgw011, radosgw.a1tcephrgw002, radosgw.a1tcephrgw003,
radosgw.a1tcephrgw004, radosgw.a1tcephrgw005, radosgw.a1tcephrgw006,
radosgw.a1tcephrgw007, radosgw.a1tcephrgw008)
data:
pools: 19 pools, 5264 pgs
objects: 1.45M objects, 148 GiB
usage: 658 GiB used, 436 TiB / 437 TiB avail
pgs: 44484/4351770 objects misplaced (1.022%)
5158 active+clean
104 active+remapped+backfill_wait
1 active+remapped+backfilling
1 active+remapped+backfill_wait+backfill_toofull
io:
client: 19 MiB/s rd, 13 MiB/s wr, 431 op/s rd, 509 op/s wr
I searched the archives, but most of the other people had more full clusters where
sometimes this state could be valid. This bug report seems similar, but the fix was just
to make it a warning instead of an error:
https://tracker.ceph.com/issues/39555
So I've created a new tracker ticket to troubleshoot this issue:
https://tracker.ceph.com/issues/4125
Let me know what you guys think,
Bryan
croit GmbH
Freseniusstr. 31h
81247 München