[ceph-users] Re: Troubleshooting stuck unclean PGs?

22 Sep 2020

Hi Wout,

 None of the OSDs are greater than 20% full. However, only 1 PG is
backfilling at a time, while the others are backfill_wait. I had
recently added a large amount of data to the Ceph cluster, and this
may have caused the # of PGs to increase causing the need to rebalance
or move objects.

 It appears that I could increase the # of backfill operations that
happen simultaneously by increasing `osd_max_backfills` and/or
`osd_recovery_max_active`. It looks like I should maybe consider
increasing the number of max backfills happening at a time because the
overall io during the backfill is pretty small.

 Does this seem reasonable? If so, with Ceph Octopus/cephadm, how can
adjust the parameters?

 Thanks,
   Matt

On Mon, Sep 21, 2020 at 2:21 PM Wout van Heeswijk &lt;wout(a)42on.com&gt; wrote:
...

 Hi Matt,

 The mon data can grow during when PGs are stuck unclean. Don't restart the mons.

 You need to find out why your placement groups are "backfill_wait". Likely some
of your OSDs are (near)full.

 If you have space elsewhere you can use the ceph balancer module or reweighting of OSDs
to rebalance data.

 Scrubbing will continue once the PGs are "active+clean"

 Kind regards,

 Wout
 42on

 ________________________________________
 From: Matt Larson &lt;larsonmattr(a)gmail.com&gt;
 Sent: Monday, September 21, 2020 6:22 PM
 To: ceph-users(a)ceph.io
 Subject: [ceph-users] Troubleshooting stuck unclean PGs?

 Hi,

  Our Ceph cluster is reporting several PGs that have not been scrubbed
 or deep scrubbed in time. It is over a week for these PGs to have been
 scrubbed. When I checked the `ceph health detail`, there are 29 pgs
 not deep-scrubbed in time and 22 pgs not scrubbed in time. I tried to
 manually start a scrub on the PGs, but it appears that they are
 actually in an unclean state that needs to be resolved first.

 This is a cluster running:
  ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable)

  Following the information at [Troubleshooting
 PGs](https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-…,
 I checked for PGs that are stuck stale | inactive | unclean. There
 were no PGs that are stale or inactive, but there are several that are
 stuck unclean:

  ```
 PG_STAT  STATE                          UP
    UP_PRIMARY  ACTING                            ACTING_PRIMARY
 8.3c     active+remapped+backfill_wait
 [124,41,108,8,87,16,79,157,49]         124
 [139,57,16,125,154,65,109,86,45]             139
 8.3e     active+remapped+backfill_wait
 [108,2,58,146,130,29,37,66,118]         108
 [127,92,24,50,33,6,130,66,149]             127
 8.3f     active+remapped+backfill_wait
 [19,34,86,132,59,78,153,99,6]          19
 [90,45,147,4,105,61,30,66,125]              90
 8.40     active+remapped+backfill_wait
 [19,131,80,76,42,101,61,3,144]          19
 [28,106,132,3,151,36,65,60,83]              28
 8.3a       active+remapped+backfilling
 [32,72,151,30,103,131,62,84,120]          32
 [91,60,7,133,101,117,78,20,158]              91
 8.7e     active+remapped+backfill_wait
 [108,2,58,146,130,29,37,66,118]         108
 [127,92,24,50,33,6,130,66,149]             127
 8.3b     active+remapped+backfill_wait
 [34,113,148,63,18,95,70,129,13]          34
 [66,17,132,90,14,52,101,47,115]              66
 8.7f     active+remapped+backfill_wait
 [19,34,86,132,59,78,153,99,6]          19
 [90,45,147,4,105,61,30,66,125]              90
 8.78     active+remapped+backfill_wait
 [96,113,159,63,29,133,73,8,89]          96
 [138,121,15,103,55,41,146,69,18]             138
 8.7d       active+remapped+backfilling
 [0,90,60,124,159,19,71,101,135]           0
 [150,72,124,129,63,10,94,29,41]             150
 8.7c     active+remapped+backfill_wait
 [124,41,108,8,87,16,79,157,49]         124
 [139,57,16,125,154,65,109,86,45]             139
 8.79     active+remapped+backfill_wait
 [59,15,41,82,131,20,73,156,113]          59
 [13,51,120,102,29,149,42,79,132]              13
 ```

 If I query one of the PGs that is backfilling, 8.3a, it shows it's state as :
     "recovery_state": [
         {
             "name": "Started/Primary/Active",
             "enter_time": "2020-09-19T20:45:44.027759+0000",
             "might_have_unfound": [],
             "recovery_progress": {
                 "backfill_targets": [
                     "30(3)",
                     "32(0)",
                     "62(6)",
                     "72(1)",
                     "84(7)",
                     "103(4)",
                     "120(8)",
                     "131(5)",
                     "151(2)"
                 ],

 Q1: Is there anything that I should check/fix to enable the PGs to
 resolve from the `unclean` state?
 Q2: I have also seen that the podman containers on one of our OSD
 servers are taking large amounts of disk space. Is there a way to
 limit the growth of disk space for podman containers, when
 administering a Ceph cluster using `cephadm` tools? At last check, a
 server running 16 OSDs and 1 MON is using 39G of disk space for its
 running containers. Can restarting containers help to start with a
 fresh slate or reduce the disk use?

 Thanks,
   Matt

 ------------------------

 Matt Larson
 Associate Scientist
 Computer Scientist/System Administrator
 UW-Madison Cryo-EM Research Center
 433 Babcock Drive, Madison, WI 53706
 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 

-- 
Matt Larson, PhD
Madison, WI  53705 U.S.A.

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Troubleshooting stuck unclean PGs?