Ceph does not recover from OSD restart - ceph-users

3 Aug 2020

Dear cephers,

I have a serious issue with degraded objects after an OSD restart. The cluster was in a
state of re-balancing after adding disks to each host. Before restart I had "X/Y
objects misplaced". Apart from that, health was OK. I now restarted all OSDs of one
host and the cluster does not recover from that:

  cluster:
    id:     xxx
    health: HEALTH_ERR
            45813194/1492348700 objects misplaced (3.070%)
            Degraded data redundancy: 6798138/1492348700 objects degraded (0.456%), 85 pgs
degraded, 86 pgs undersized
            Degraded data redundancy (low space): 17 pgs backfill_toofull
            1 pools nearfull

  services:
    mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
    mgr: ceph-01(active), standbys: ceph-03, ceph-02
    mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
    osd: 297 osds: 272 up, 272 in; 307 remapped pgs

  data:
    pools:   11 pools, 3215 pgs
    objects: 177.3 M objects, 489 TiB
    usage:   696 TiB used, 1.2 PiB / 1.9 PiB avail
    pgs:     6798138/1492348700 objects degraded (0.456%)
             45813194/1492348700 objects misplaced (3.070%)
             2903 active+clean
             209  active+remapped+backfill_wait
             73   active+undersized+degraded+remapped+backfill_wait
             9    active+remapped+backfill_wait+backfill_toofull
             8    active+undersized+degraded+remapped+backfill_wait+backfill_toofull
             4    active+undersized+degraded+remapped+backfilling
             3    active+remapped+backfilling
             3    active+clean+scrubbing+deep
             1    active+clean+scrubbing
             1    active+undersized+remapped+backfilling
             1    active+clean+snaptrim

  io:
    client:   47 MiB/s rd, 61 MiB/s wr, 732 op/s rd, 792 op/s wr
    recovery: 195 MiB/s, 48 objects/s

After restarting there should only be a small number of degraded objects, the ones that
received writes during OSD restart. What I see, however, is that the cluster seems to have
lost track of a huge amount of objects, the 0.456% degraded are 1-2 days worth of I/O. I
did reboots before and saw only a few thousand objects degraded at most. The output of
ceph health detail shows a lot of lines like these:

[root@gnosis ~]# ceph health detail
HEALTH_ERR 45804316/1492356704 objects misplaced (3.069%); Degraded data redundancy:
6792562/1492356704 objects degraded (0.455%), 85 pgs degraded, 86 pgs undersized; Degraded
data redundancy (low space): 17 pgs backfill_toofull; 1 pools nearfull
OBJECT_MISPLACED 45804316/1492356704 objects misplaced (3.069%)
PG_DEGRADED Degraded data redundancy: 6792562/1492356704 objects degraded (0.455%), 85 pgs
degraded, 86 pgs undersized
    pg 11.9 is stuck undersized for 815.188981, current state
active+undersized+degraded+remapped+backfill_wait, last acting
[60,148,2147483647,263,76,230,87,169]
8...9
    pg 11.48 is active+undersized+degraded+remapped+backfill_wait, acting
[159,60,180,263,237,3,2147483647,72]
    pg 11.4a is stuck undersized for 851.162862, current state
active+undersized+degraded+remapped+backfill_wait, last acting
[182,233,87,228,2,180,63,2147483647]
[...]
    pg 11.22e is stuck undersized for 851.162402, current state
active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting
[234,183,239,2147483647,170,229,1,86]
PG_DEGRADED_FULL Degraded data redundancy (low space): 17 pgs backfill_toofull
    pg 11.24 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting
[230,259,2147483647,1,144,159,233,146]
[...]
    pg 11.1d9 is active+remapped+backfill_wait+backfill_toofull, acting
[84,259,183,170,85,234,233,2]
    pg 11.225 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull,
acting [236,183,1,2147483647,2147483647,169,229,230]
    pg 11.22e is active+undersized+degraded+remapped+backfill_wait+backfill_toofull,
acting [234,183,239,2147483647,170,229,1,86]
POOL_NEAR_FULL 1 pools nearfull
    pool 'sr-rbd-data-one-hdd' has 164 TiB (max 200 TiB)

It looks like a lot of PGs are not receiving theire complete crush map placement, as if
the peering is incomplete. This is a serious issue, it looks like the cluster will see a
total storage loss if just 2 more hosts reboot - without actually having lost any storage.
The pool in question is a 6+2 EC pool.

What is going on here? Why are the PG-maps not restored to their values from before the
OSD reboot? The degraded PGs should receive the missing OSD IDs, everything is up exactly
as it was before the reboot.

Thanks for your help and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14