[ceph-users] Re: pgs stuck backfill_toofull

29 Oct 2020

Hi Mark,

it looks like you have some very large PGs. Also, you run with a quite low PG count, in
particular, for the large pool. Please post the output of "ceph df" and
"ceph osd pool ls detail" to see how much data is in each pool and some pool
info. I guess you need to increase the PG count of the large pool to split PGs up and also
reduce the impact of imbalance. When I look at this:

 3 1.37790  0.45013  1410G  1079G   259G 76.49 1.39  21
 4 1.37790  0.95001  1410G  1086G   253G 76.98 1.40  44

I would conclude that the PGs are too large, the reweight of 0.45 without much utilization
effect indicates that. This weight will need to be rectified as well at some time.

You should be able to run with 100-200 PGs per OSD. Please be aware that PG planning
requires caution as you cannot reduce the PG count of a pool in your version. You need to
know how much data is in the pools right now and what the future plan is.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Mark Johnson &lt;markj(a)iovox.com&gt;
Sent: 29 October 2020 06:55:55
To: ceph-users(a)ceph.io
Subject: [ceph-users] pgs stuck backfill_toofull

I've been struggling with this one for a few days now.  We had an OSD report as near
full a few days ago.  Had this happen a couple of times before and a
reweight-by-utilization has sorted it out in the past.  Tried the same again but this time
we ended up with a couple of pgs in a state of backfill_toofull and a handful of misplaced
objects as a result.

Tried doing the reweight a few more times and it's been moving data around.  We did
have another osd trigger the near full alert but running the reweight a couple more times
seems to have moved some of that data around a bit better.  However, the original
near_full osd doesn't seem to have changed much and the backfill_toofull pgs are still
there.  I'd keep doing the reweight-by-utilization but I'm not sure if I'm
heading down the right path and if it will eventually sort it out.

We have 14 pools, but the vast majority of data resides in just one of those pools (pool
20).  The pgs in the backfill state are in pool 2 (as far as I can tell).  That particular
pool is used for some cephfs stuff and has a handful of large files in there (not sure if
this is significant to the problem).

All up, our utilization is showing as 55.13% but some of our OSDs are showing as 76% in
use with this one problem sitting at 85.02%.  Right now, I'm just not sure what the
proper corrective action is.  The last couple of reweights I've run have been a bit
more targetted in that I've set it to only function on two OSDs at a time.  If I run a
test-reweight targetting only one osd, it does say it will reweight OSD 9 (the one at
85.02%).  I gather this will move data away from this OSD and potentially get it below the
threshold.  However, at one point in the past couple of days, it's shown as no OSDs in
a near full state, yet the two pgs in backfill_toofull didn't change.  So, that's
why I'm not sure continually reweighting is going to solve this issue.

I'm a long way from knowledgable on Ceph so I'm not really sure what information
is useful here.  Here's a bit of info on what I'm seeing.  Can provide anything
else that might help.

Basically, we have a three node cluster but only two have OSDs.  The third is there simply
to enable a quorum to be established.  The OSDs are evenly spread across these two needs
and the configuration of each is identical.  We are running Jewel and are not in a
position to upgrade at this stage.

# ceph --version
ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e)

# ceph health detail
HEALTH_WARN 2 pgs backfill_toofull; 2 pgs stuck unclean; recovery 33/62099566 objects
misplaced (0.000%); 1 near full osd(s)
pg 2.52 is stuck unclean for 201822.031280, current state
active+remapped+backfill_toofull, last acting [17,3]
pg 2.18 is stuck unclean for 202114.617682, current state
active+remapped+backfill_toofull, last acting [18,2]
pg 2.18 is active+remapped+backfill_toofull, acting [18,2]
pg 2.52 is active+remapped+backfill_toofull, acting [17,3]
recovery 33/62099566 objects misplaced (0.000%)
osd.9 is near full at 85%

# ceph osd df
ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS
 2 1.37790  1.00000  1410G   842G   496G 59.75 1.08  33
 3 1.37790  0.45013  1410G  1079G   259G 76.49 1.39  21
 4 1.37790  0.95001  1410G  1086G   253G 76.98 1.40  44
 5 1.37790  1.00000  1410G   617G   722G 43.74 0.79  43
 6 1.37790  0.65009  1410G   616G   722G 43.69 0.79  39
 7 1.37790  0.95001  1410G   495G   844G 35.10 0.64  40
 8 1.37790  1.00000  1410G   732G   606G 51.93 0.94  52
 9 1.37790  0.70007  1410G  1199G   139G 85.02 1.54  37
10 1.37790  1.00000  1410G   611G   727G 43.35 0.79  41
11 1.37790  0.75006  1410G   495G   843G 35.11 0.64  32
 0 1.37790  1.00000  1410G   731G   608G 51.82 0.94  43
12 1.37790  1.00000  1410G   851G   487G 60.36 1.09  44
13 1.37790  1.00000  1410G   378G   960G 26.82 0.49  38
14 1.37790  1.00000  1410G   969G   370G 68.68 1.25  37
15 1.37790  1.00000  1410G   724G   614G 51.35 0.93  35
16 1.37790  1.00000  1410G   491G   847G 34.84 0.63  43
17 1.37790  1.00000  1410G   862G   476G 61.16 1.11  50
18 1.37790  0.80005  1410G  1083G   255G 76.78 1.39  26
19 1.37790  0.65009  1410G   963G   375G 68.29 1.24  23
20 1.37790  1.00000  1410G   724G   614G 51.38 0.93  42
              TOTAL 28219G 15557G 11227G 55.13
MIN/MAX VAR: 0.49/1.54  STDDEV: 15.57

# ceph pg ls backfill_toofull
pg_stat objects mip degr misp unf bytes log disklog state state_stamp v reported up
up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
2.18 9 0 0 18 0 0 3653 3653 active+remapped+backfill_toofull 2020-10-29 05:31:20.429912
610'549153 656:390372 [9,12] 9 [18,2] 18 594'547482 2020-10-25 20:28:39.680744
594'543841 2020-10-21 21:21:33.092868
2.52 15 0 0 15 0 0 4883 4883 active+remapped+backfill_toofull 2020-10-29 05:31:28.277898
652'502085 656:367288 [17,9] 17 [17,3] 17 594'499108 2020-10-26 11:06:48.417825
594'499108 2020-10-26 11:06:48.417825

pool : 17 18 19 11 20 21 12 13 0 14 1 15 2 16 | SUM
--------------------------------------------------------------------------------------------------------------------------------
osd.4 3 0 0 0 9 2 0 0 12 1 9 0 7 1 | 44
osd.17 1 0 0 0 7 3 1 0 8 1 17 1 11 0 | 50
osd.18 0 0 0 0 9 0 0 0 4 0 7 0 5 0 | 25
osd.5 0 0 0 2 5 1 1 0 5 0 16 0 11 2 | 43
osd.6 0 1 0 1 5 2 0 0 9 0 13 1 7 0 | 39
osd.19 0 0 1 0 8 2 0 1 2 0 6 0 3 0 | 23
osd.7 0 0 0 0 4 1 1 0 3 0 12 0 19 0 | 40
osd.8 0 1 0 0 6 3 0 2 10 1 13 1 15 0 | 52
osd.9 1 0 2 0 10 2 0 0 4 1 6 1 10 0 | 37
osd.10 0 0 1 1 5 2 0 1 7 0 12 0 11 1 | 41
osd.20 1 0 0 0 6 1 0 1 7 0 8 1 17 0 | 42
osd.11 0 0 0 0 4 1 1 1 5 0 11 0 9 0 | 32
osd.12 0 0 1 1 7 1 0 0 5 1 12 1 14 1 | 44
osd.13 0 2 0 0 3 1 0 0 10 1 11 0 10 0 | 38
osd.0 0 1 0 1 6 3 0 1 7 0 11 0 13 0 | 43
osd.14 1 0 0 0 8 1 1 0 4 1 12 0 9 0 | 37
osd.15 1 0 2 1 6 1 1 0 8 0 7 0 6 2 | 35
osd.2 0 2 1 0 7 2 1 0 7 1 4 1 6 0 | 32
osd.3 0 0 0 0 9 0 0 0 2 0 4 0 5 0 | 20
osd.16 0 1 0 1 4 3 1 1 9 0 9 1 12 1 | 43
--------------------------------------------------------------------------------------------------------------------------------
SUM : 8 8 8 8 128 32 8 8 128 8 200 8 200 8 |
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io

2024

2023

2022

2021

2020

2019

[ceph-users] Re: pgs stuck backfill_toofull