[ceph-users] Re: Cluster degraded after adding OSDs to increase capacity

27 Aug 2020

Doubling the capacity in one shot was a big topology change, hence the 53% misplaced.

OSD fullness will naturally reflect a bell curve; there will be a tail of under-full and
over-full OSDs.  If you’d not said that your cluster was very full before expansion I
would have predicted it from the full / nearfull OSDs.

Think of CRUSH has a hash function that can experience collisions.  When you change the
topology, some collisions are removed, and sometimes PGs newly land on OSDs that they were
previously redirected from, which can result in additional fillage.   This can also occur
as just a natural result of move data moving onto a given OSD before it’s moved off,
especially as Ceph makes copies before deleting the old during a move, to maintain full
redundancy along the way.

`ceph osd df | sort -nk8`

Couple of ways to recover, depending on the unspecified release that you’re running.  You
need to squeeze the most-full outliers down on a continual basis going forward.

* Balance OSDs with either the ceph-mgr pg-upmap balancer (if all clients are Luminous or
better)
* Balance OSDs with reweight-by-utilization
* Balance OSDs with override weights `ceph osd reweight osd.666 0.xx`
* Raise the osd full ratio and backfill full ratio a few percentage points to let the 3
affected OSDs drain.  You may need to restart them serially for the new setting to take
effect.

...
  On Aug 27, 2020, at 8:28 AM, Dallas Jones
&lt;djones(a)tech4learning.com&gt; wrote:

 My 3-node Ceph cluster (14.2.4) has been running fine for months. However,
 my data pool became close to full a couple of weeks ago, so I added 12 new
 OSDs, roughly doubling the capacity of the cluster. However, the pool size
 has not changed, and the health of the cluster has changed for the worse.
 The dashboard shows the following cluster status:

   - PG_DEGRADED_FULL: Degraded data redundancy (low space): 2 pgs
   backfill_toofull
   - POOL_NEARFULL: 6 pool(s) nearfull
   - OSD_NEARFULL: 1 nearfull osd(s)

 Output from ceph -s:

  cluster:
    id:     e5a47160-a302-462a-8fa4-1e533e1edd4e
    health: HEALTH_ERR
            1 nearfull osd(s)
            6 pool(s) nearfull
            Degraded data redundancy (low space): 2 pgs backfill_toofull

  services:
    mon: 3 daemons, quorum ceph01,ceph02,ceph03 (age 5w)
    mgr: ceph01(active, since 4w), standbys: ceph03, ceph02
    mds: cephfs:1 {0=ceph01=up:active} 2 up:standby
    osd: 33 osds: 33 up (since 43h), 33 in (since 43h); 1094 remapped pgs
    rgw: 3 daemons active (ceph01, ceph02, ceph03)

  data:
    pools:   6 pools, 1632 pgs
    objects: 134.50M objects, 7.8 TiB
    usage:   42 TiB used, 81 TiB / 123 TiB avail
    pgs:     213786007/403501920 objects misplaced (52.983%)
             1088 active+remapped+backfill_wait
             538  active+clean
             4    active+remapped+backfilling
             2    active+remapped+backfill_wait+backfill_toofull

  io:
    recovery: 477 KiB/s, 330 keys/s, 29 objects/s

 Can someone steer me in the right direction for how to get my cluster
 healthy again?

 Thanks in advance!

 -Dallas
 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Cluster degraded after adding OSDs to increase capacity