Thanks again Frank. That gives me something to digest (and try to understand).
One question regarding maintenance mode, these are production systems that are required to
be available all the time. What, exactly, will happen if I issue this command for
maintenance mode?
Thanks,
Mark
On Thu, 2020-10-29 at 07:51 +0000, Frank Schilder wrote:
Cephfs pools are uncritical, because ceph fs splits very large files into chunks of
objectsize. The RGW pool is the problem, because RGW does not as far as I know. A few 1TB
uploads and you have a problem.
The calculation is confusing, because the term PG is used in two different meanings,
unfortunately. The pool PG count and OSD PG count are different things. A PG is a virtual
raid set distributed over some OSDs. The number of PGs in a pool is the count of such raid
sets. The PG count for an OSD is in fact the PG membership count - something completely
different. It says in how many PGs an OSD is a member of. To create 100PGs with
replication 3 you need 3x100=300 PG memberships. If you have 3 OSDs, these will have 100
PG memberships each. This is shown as PGs in the utilisation columns. If these terms were
used with a bit more precision, it would be less confusing.
If the data distribution will remain more or less the same in the near future, changing
the PG count as follows should help:
Assuming that you have 20 OSDs (OSD 1 seems to be gone), increasing the PG count for pool
20 from 64 to 512 will require 2x(512-64)=896 additional PG memberships. Distributed over
20 OSDs, this is on average 44.8 memberships per OSD. This will leave PG memberships
available for the future and should sort out your distribution problem.
If you want to follow this route, you can do the following:
- ceph osd set noout # maintenance mode
- ceph osd set norebalance # prevent immediate start of rebalancing
- increase pg_num and pgp_num of pool 20 to 512
- increase the reweight of osd.3 to, say 0.8
- wait for peering to finish and any recovery to complete
- ceph osd unset noout # leave maintenance mode
- if everything OK (all PGs active, no degraded objects, no recovery) do ceph osd unset
norebalance
- once the rebalancing is finished, reweight the OSDs manually, the built-in reweight
commands are a bit limited
is that just a matter of "ceph osd reweight osd.3 1"
Yes, this will do. However, increase probably in less aggressive steps. You will need some
rebalancing, because you run a bit low on available space.
As a final note, running with size 2 min size 1 is a serious data redundancy risk. You
should get another server and upgrade to 3(2).
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Mark Johnson <
<mailto:markj@iovox.com
markj(a)iovox.com
Sent: 29 October 2020 08:19:01
To:
<mailto:ceph-users@ceph.io
ceph-users(a)ceph.io
; Frank Schilder
Subject: Re: pgs stuck backfill_toofull
Thanks for you swift reply. Below is the requested information.
I understand the bit about not being able to reduce the pg count as we've come across
this issue once before. This is the reason I've been hesitant to make any changes
there without being 100% certain of getting it right and the impact of these changes.
That, and the more I read about how to calculate this, the more confused I get. As for
the reweight, is that just a matter of "ceph osd reweight osd.3 1" once the
other issues are sorted out (or perhaps start with a less dramatic change and work up)?
Also, presuming I need to change the pg/pgp num, would you be suggesting on pool 2 based
on the below info (the pool with a few large files) or on pool 20 (the pool with the most
data but an average of about 250KB file size)? I'm just completely confused as to
what's caused this issue in the first place and how to go about fixing it. On top of
that, am I going to be able to increase the pg/pgp count with the cluster in a state of
health_warn? Just some posts I've read seem to indicate that the health state needs
to be ok before this sort of thing can be changed (but I could be misunnderstanding what
I'm reading).
Anyway, here's the info:
# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
28219G 11227G 15558G 55.13
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
rbd 0 0 0 690G 0
KUBERNETES 1 122G 15.11 690G 34188
KUBERNETES_METADATA 2 49310k 0 690G 1426
default.rgw.control 11 0 0 690G 8
default.rgw.data.root 12 20076k 0 690G 54412
default.rgw.gc 13 0 0 690G 32
default.rgw.log 14 0 0 690G 127
default.rgw.users.uid 15 4942 0 690G 15
default.rgw.users.keys 16 126 0 690G 4
default.rgw.users.swift 17 252 0 690G 8
default.rgw.buckets.index 18 0 0 690G 27206
.rgw.root 19 1588 0 690G 4
default.rgw.buckets.data 20 7402G 91.47 690G 30931617
default.rgw.users.email 21 0 0 690G 0
# ceph osd pool ls detail
pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins
pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 1 'KUBERNETES' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 100 pgp_num 100 last_change 17 flags hashpspool crash_replay_interval 45
stripe_width 0
pool 2 'KUBERNETES_METADATA' replicated size 2 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 100 pgp_num 100 last_change 16 flags hashpspool stripe_width
0
pool 11 'default.rgw.control' replicated size 2 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 4 pgp_num 4 last_change 68 flags hashpspool stripe_width 0
pool 12 'default.rgw.data.root' replicated size 2 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 4 pgp_num 4 last_change 69 flags hashpspool stripe_width 0
pool 13 'default.rgw.gc' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 4 pgp_num 4 last_change 70 flags hashpspool stripe_width 0
pool 14 'default.rgw.log' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 4 pgp_num 4 last_change 71 flags hashpspool stripe_width 0
pool 15 'default.rgw.users.uid' replicated size 2 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 4 pgp_num 4 last_change 72 flags hashpspool stripe_width 0
pool 16 'default.rgw.users.keys' replicated size 2 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 4 pgp_num 4 last_change 73 flags hashpspool stripe_width 0
pool 17 'default.rgw.users.swift' replicated size 2 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 4 pgp_num 4 last_change 74 flags hashpspool stripe_width 0
pool 18 'default.rgw.buckets.index' replicated size 2 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 4 pgp_num 4 last_change 75 flags hashpspool stripe_width 0
pool 19 '.rgw.root' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 4 pgp_num 4 last_change 76 flags hashpspool stripe_width 0
pool 20 'default.rgw.buckets.data' replicated size 2 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 64 pgp_num 64 last_change 442 flags hashpspool stripe_width
0
pool 21 'default.rgw.users.email' replicated size 2 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 16 pgp_num 16 last_change 260 flags hashpspool stripe_width
0
On Thu, 2020-10-29 at 07:05 +0000, Frank Schilder wrote:
Hi Mark,
it looks like you have some very large PGs. Also, you run with a quite low PG count, in
particular, for the large pool. Please post the output of "ceph df" and
"ceph osd pool ls detail" to see how much data is in each pool and some pool
info. I guess you need to increase the PG count of the large pool to split PGs up and also
reduce the impact of imbalance. When I look at this:
3 1.37790 0.45013 1410G 1079G 259G 76.49 1.39 21
4 1.37790 0.95001 1410G 1086G 253G 76.98 1.40 44
I would conclude that the PGs are too large, the reweight of 0.45 without much utilization
effect indicates that. This weight will need to be rectified as well at some time.
You should be able to run with 100-200 PGs per OSD. Please be aware that PG planning
requires caution as you cannot reduce the PG count of a pool in your version. You need to
know how much data is in the pools right now and what the future plan is.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Mark Johnson <
<mailto:
<mailto:markj@iovox.com
markj(a)iovox.com
<mailto:markj@iovox.com
markj(a)iovox.com
Sent: 29 October 2020 06:55:55
To:
<mailto:
<mailto:ceph-users@ceph.io
ceph-users(a)ceph.io
<mailto:ceph-users@ceph.io
ceph-users(a)ceph.io
Subject: [ceph-users] pgs stuck backfill_toofull
I've been struggling with this one for a few days now. We had an OSD report as near
full a few days ago. Had this happen a couple of times before and a
reweight-by-utilization has sorted it out in the past. Tried the same again but this time
we ended up with a couple of pgs in a state of backfill_toofull and a handful of misplaced
objects as a result.
Tried doing the reweight a few more times and it's been moving data around. We did
have another osd trigger the near full alert but running the reweight a couple more times
seems to have moved some of that data around a bit better. However, the original
near_full osd doesn't seem to have changed much and the backfill_toofull pgs are still
there. I'd keep doing the reweight-by-utilization but I'm not sure if I'm
heading down the right path and if it will eventually sort it out.
We have 14 pools, but the vast majority of data resides in just one of those pools (pool
20). The pgs in the backfill state are in pool 2 (as far as I can tell). That particular
pool is used for some cephfs stuff and has a handful of large files in there (not sure if
this is significant to the problem).
All up, our utilization is showing as 55.13% but some of our OSDs are showing as 76% in
use with this one problem sitting at 85.02%. Right now, I'm just not sure what the
proper corrective action is. The last couple of reweights I've run have been a bit
more targetted in that I've set it to only function on two OSDs at a time. If I run a
test-reweight targetting only one osd, it does say it will reweight OSD 9 (the one at
85.02%). I gather this will move data away from this OSD and potentially get it below the
threshold. However, at one point in the past couple of days, it's shown as no OSDs in
a near full state, yet the two pgs in backfill_toofull didn't change. So, that's
why I'm not sure continually reweighting is going to solve this issue.
I'm a long way from knowledgable on Ceph so I'm not really sure what information
is useful here. Here's a bit of info on what I'm seeing. Can provide anything
else that might help.
Basically, we have a three node cluster but only two have OSDs. The third is there simply
to enable a quorum to be established. The OSDs are evenly spread across these two needs
and the configuration of each is identical. We are running Jewel and are not in a
position to upgrade at this stage.
# ceph --version
ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e)
# ceph health detail
HEALTH_WARN 2 pgs backfill_toofull; 2 pgs stuck unclean; recovery 33/62099566 objects
misplaced (0.000%); 1 near full osd(s)
pg 2.52 is stuck unclean for 201822.031280, current state
active+remapped+backfill_toofull, last acting [17,3]
pg 2.18 is stuck unclean for 202114.617682, current state
active+remapped+backfill_toofull, last acting [18,2]
pg 2.18 is active+remapped+backfill_toofull, acting [18,2]
pg 2.52 is active+remapped+backfill_toofull, acting [17,3]
recovery 33/62099566 objects misplaced (0.000%)
osd.9 is near full at 85%
# ceph osd df
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
2 1.37790 1.00000 1410G 842G 496G 59.75 1.08 33
3 1.37790 0.45013 1410G 1079G 259G 76.49 1.39 21
4 1.37790 0.95001 1410G 1086G 253G 76.98 1.40 44
5 1.37790 1.00000 1410G 617G 722G 43.74 0.79 43
6 1.37790 0.65009 1410G 616G 722G 43.69 0.79 39
7 1.37790 0.95001 1410G 495G 844G 35.10 0.64 40
8 1.37790 1.00000 1410G 732G 606G 51.93 0.94 52
9 1.37790 0.70007 1410G 1199G 139G 85.02 1.54 37
10 1.37790 1.00000 1410G 611G 727G 43.35 0.79 41
11 1.37790 0.75006 1410G 495G 843G 35.11 0.64 32
0 1.37790 1.00000 1410G 731G 608G 51.82 0.94 43
12 1.37790 1.00000 1410G 851G 487G 60.36 1.09 44
13 1.37790 1.00000 1410G 378G 960G 26.82 0.49 38
14 1.37790 1.00000 1410G 969G 370G 68.68 1.25 37
15 1.37790 1.00000 1410G 724G 614G 51.35 0.93 35
16 1.37790 1.00000 1410G 491G 847G 34.84 0.63 43
17 1.37790 1.00000 1410G 862G 476G 61.16 1.11 50
18 1.37790 0.80005 1410G 1083G 255G 76.78 1.39 26
19 1.37790 0.65009 1410G 963G 375G 68.29 1.24 23
20 1.37790 1.00000 1410G 724G 614G 51.38 0.93 42
TOTAL 28219G 15557G 11227G 55.13
MIN/MAX VAR: 0.49/1.54 STDDEV: 15.57
# ceph pg ls backfill_toofull
pg_stat objects mip degr misp unf bytes log disklog state state_stamp v reported up
up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
2.18 9 0 0 18 0 0 3653 3653 active+remapped+backfill_toofull 2020-10-29 05:31:20.429912
610'549153 656:390372 [9,12] 9 [18,2] 18 594'547482 2020-10-25 20:28:39.680744
594'543841 2020-10-21 21:21:33.092868
2.52 15 0 0 15 0 0 4883 4883 active+remapped+backfill_toofull 2020-10-29 05:31:28.277898
652'502085 656:367288 [17,9] 17 [17,3] 17 594'499108 2020-10-26 11:06:48.417825
594'499108 2020-10-26 11:06:48.417825
pool : 17 18 19 11 20 21 12 13 0 14 1 15 2 16 | SUM
--------------------------------------------------------------------------------------------------------------------------------
osd.4 3 0 0 0 9 2 0 0 12 1 9 0 7 1 | 44
osd.17 1 0 0 0 7 3 1 0 8 1 17 1 11 0 | 50
osd.18 0 0 0 0 9 0 0 0 4 0 7 0 5 0 | 25
osd.5 0 0 0 2 5 1 1 0 5 0 16 0 11 2 | 43
osd.6 0 1 0 1 5 2 0 0 9 0 13 1 7 0 | 39
osd.19 0 0 1 0 8 2 0 1 2 0 6 0 3 0 | 23
osd.7 0 0 0 0 4 1 1 0 3 0 12 0 19 0 | 40
osd.8 0 1 0 0 6 3 0 2 10 1 13 1 15 0 | 52
osd.9 1 0 2 0 10 2 0 0 4 1 6 1 10 0 | 37
osd.10 0 0 1 1 5 2 0 1 7 0 12 0 11 1 | 41
osd.20 1 0 0 0 6 1 0 1 7 0 8 1 17 0 | 42
osd.11 0 0 0 0 4 1 1 1 5 0 11 0 9 0 | 32
osd.12 0 0 1 1 7 1 0 0 5 1 12 1 14 1 | 44
osd.13 0 2 0 0 3 1 0 0 10 1 11 0 10 0 | 38
osd.0 0 1 0 1 6 3 0 1 7 0 11 0 13 0 | 43
osd.14 1 0 0 0 8 1 1 0 4 1 12 0 9 0 | 37
osd.15 1 0 2 1 6 1 1 0 8 0 7 0 6 2 | 35
osd.2 0 2 1 0 7 2 1 0 7 1 4 1 6 0 | 32
osd.3 0 0 0 0 9 0 0 0 2 0 4 0 5 0 | 20
osd.16 0 1 0 1 4 3 1 1 9 0 9 1 12 1 | 43
--------------------------------------------------------------------------------------------------------------------------------
SUM : 8 8 8 8 128 32 8 8 128 8 200 8 200 8 |
_______________________________________________
ceph-users mailing list --
<mailto:
<mailto:ceph-users@ceph.io
ceph-users(a)ceph.io
<mailto:ceph-users@ceph.io
ceph-users(a)ceph.io
To unsubscribe send an email to
<mailto:
<mailto:ceph-users-leave@ceph.io
ceph-users-leave(a)ceph.io
<mailto:ceph-users-leave@ceph.io
ceph-users-leave(a)ceph.io