Ceph pool at 90% capacity - rbd rm is timing out - any way to rescue? - ceph-users

27 Jul 2020

Hi,

We've let our Ceph pool (Octopus) get into a bad state, with around 90%
full:

# ceph health
...
  HEALTH_ERR 1/4 mons down, quorum
 angussyd-kvm01,angussyd-kvm02,angussyd-kvm03; 3 backfillfull osd(s); 1 full
 osd(s); 14 nearfull osd(s); Low space hindering backfill (add storage if
 this doesn't resolve itself): 580 pgs backfill_toofull; Degraded data
 redundancy: 1860769/9916650 objects degraded (18.764%), 597 pgs degraded,
 580 pgs undersized; 323 pgs not deep-scrubbed in time; 189 pgs not scrubbed
 in time; Full OSDs blocking recovery: 17 pgs recovery_toofull; 4 pool(s)
 full; 1 pools have too many placement groups 

At this point, even trying to run 'rbd rm" or "rbd du" seems to time
out.

(I am however, able to run "rbd ls -l" which shows me rbd image size - I
assume that's before taking into account thin-provisioning).

Is there any way to rescue this pool? Or at least some way to force delete
some of the large images?

Regards,
Victor