On Tue, Nov 28, 2023 at 3:52 PM Anthony D'Atri <aad(a)dreamsnake.net> wrote:
Very small and/or non-uniform clusters can be corner cases for many things, especially if
they don’t have enough PGs. What is your failure domain — host or OSD?
Failure domain is host, and PG number should be fairly reasonable.
Are your OSDs sized uniformly? Please send the output of the following commands:
OSDs are definitely not uniform in size. This might be the issue with
the automation.
You asked for it, but I do apologize for the wall of text that follows...
`ceph osd tree`
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 131.65762 root default
-25 16.46977 host k8s1
14 hdd 5.45799 osd.14 up 0.90002 1.00000
19 hdd 10.91409 osd.19 up 1.00000 1.00000
22 ssd 0.09769 osd.22 up 1.00000 1.00000
-13 25.56458 host k8s3
2 hdd 10.91409 osd.2 up 0.84998 1.00000
3 hdd 1.81940 osd.3 up 0.75002 1.00000
20 hdd 12.73340 osd.20 up 1.00000 1.00000
10 ssd 0.09769 osd.10 up 1.00000 1.00000
-14 12.83107 host k8s4
0 hdd 10.91399 osd.0 up 1.00000 1.00000
5 hdd 1.81940 osd.5 up 1.00000 1.00000
11 ssd 0.09769 osd.11 up 1.00000 1.00000
-2 14.65048 host k8s5
1 hdd 1.81940 osd.1 up 0.70001 1.00000
17 hdd 12.73340 osd.17 up 1.00000 1.00000
12 ssd 0.09769 osd.12 up 1.00000 1.00000
-6 14.65048 host k8s6
4 hdd 1.81940 osd.4 up 0.75000 1.00000
16 hdd 12.73340 osd.16 up 0.95001 1.00000
13 ssd 0.09769 osd.13 up 1.00000 1.00000
-3 23.74518 host k8s7
6 hdd 12.73340 osd.6 up 1.00000 1.00000
15 hdd 10.91409 osd.15 up 0.95001 1.00000
8 ssd 0.09769 osd.8 up 1.00000 1.00000
-9 23.74606 host k8s8
7 hdd 14.55269 osd.7 up 1.00000 1.00000
18 hdd 9.09569 osd.18 up 1.00000 1.00000
9 ssd 0.09769 osd.9 up 1.00000 1.00000
so that we can see the topology.
`ceph -s`
Note this cluster is in the middle of re-creating all the OSDs to
modify the OSD allocation size - I have scrubbing disabled since I'm
basically rewriting just about everything in the cluster weekly right
now but normally that would be on.
cluster:
id: ba455d73-116e-4f24-8a34-a45e3ba9f44c
health: HEALTH_WARN
noscrub,nodeep-scrub flag(s) set
546 pgs not deep-scrubbed in time
542 pgs not scrubbed in time
services:
mon: 3 daemons, quorum e,f,g (age 7d)
mgr: a(active, since 7d)
mds: 1/1 daemons up, 1 hot standby
osd: 22 osds: 22 up (since 5h), 22 in (since 33h); 101 remapped pgs
flags noscrub,nodeep-scrub
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 13 pools, 617 pgs
objects: 9.36M objects, 33 TiB
usage: 67 TiB used, 65 TiB / 132 TiB avail
pgs: 1778936/21708668 objects misplaced (8.195%)
516 active+clean
100 active+remapped+backfill_wait
1 active+remapped+backfilling
io:
client: 371 KiB/s rd, 2.8 MiB/s wr, 2 op/s rd, 7 op/s wr
recovery: 25 MiB/s, 6 objects/s
progress:
Global Recovery Event (7d)
[=======================.....] (remaining: 36h)
`ceph osd df`
Note that these are not in a steady state right now. OSD 6 in
particular was just re-created and is repopulating. A few of the
reweights were set to deal with some gross issues in balance - when it
all settles down I plan to optimize them.
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP
META AVAIL %USE VAR PGS STATUS
14 hdd 5.45799 0.90002 5.5 TiB 3.0 TiB 3.0 TiB 2.0 MiB 11
GiB 2.4 TiB 55.51 1.09 72 up
19 hdd 10.91409 1.00000 11 TiB 6.2 TiB 6.2 TiB 3.1 MiB 16
GiB 4.7 TiB 57.12 1.12 144 up
22 ssd 0.09769 1.00000 100 GiB 2.4 GiB 1.8 GiB 167 MiB 504
MiB 98 GiB 2.43 0.05 32 up
2 hdd 10.91409 0.84998 11 TiB 4.5 TiB 4.5 TiB 5.0 MiB 9.7
GiB 6.4 TiB 41.11 0.81 99 up
3 hdd 1.81940 0.75002 1.8 TiB 1.0 TiB 1.0 TiB 2.3 MiB 3.8
GiB 818 GiB 56.11 1.10 21 up
20 hdd 12.73340 1.00000 13 TiB 7.1 TiB 7.1 TiB 3.7 MiB 16
GiB 5.6 TiB 56.01 1.10 165 up
10 ssd 0.09769 1.00000 100 GiB 1.3 GiB 299 MiB 185 MiB 835
MiB 99 GiB 1.29 0.03 38 up
0 hdd 10.91399 1.00000 11 TiB 6.5 TiB 6.5 TiB 3.7 MiB 15
GiB 4.4 TiB 59.41 1.17 144 up
5 hdd 1.81940 1.00000 1.8 TiB 845 GiB 842 GiB 1.7 MiB 3.3
GiB 1018 GiB 45.36 0.89 23 up
11 ssd 0.09769 1.00000 100 GiB 3.1 GiB 1.3 GiB 157 MiB 1.6
GiB 97 GiB 3.09 0.06 33 up
1 hdd 1.81940 0.70001 1.8 TiB 983 GiB 979 GiB 1.3 MiB 3.4
GiB 880 GiB 52.76 1.04 26 up
17 hdd 12.73340 1.00000 13 TiB 7.3 TiB 7.2 TiB 3.6 MiB 15
GiB 5.5 TiB 56.95 1.12 159 up
12 ssd 0.09769 1.00000 100 GiB 1.5 GiB 120 MiB 55 MiB 1.3
GiB 99 GiB 1.49 0.03 21 up
4 hdd 1.81940 0.75000 1.8 TiB 1.0 TiB 1.0 TiB 2.5 MiB 3.0
GiB 820 GiB 55.98 1.10 24 up
16 hdd 12.73340 0.95001 13 TiB 7.6 TiB 7.5 TiB 7.9 MiB 16
GiB 5.2 TiB 59.32 1.17 171 up
13 ssd 0.09769 1.00000 100 GiB 2.4 GiB 528 MiB 196 MiB 1.7
GiB 98 GiB 2.38 0.05 33 up
6 hdd 12.73340 1.00000 13 TiB 1.7 TiB 1.7 TiB 1.3 MiB 4.5
GiB 11 TiB 13.66 0.27 48 up
15 hdd 10.91409 0.95001 11 TiB 6.5 TiB 6.5 TiB 5.2 MiB 13
GiB 4.4 TiB 59.42 1.17 155 up
8 ssd 0.09769 1.00000 100 GiB 1.9 GiB 1.1 GiB 116 MiB 788
MiB 98 GiB 1.95 0.04 26 up
7 hdd 14.55269 1.00000 15 TiB 7.8 TiB 7.7 TiB 3.9 MiB 16
GiB 6.8 TiB 53.32 1.05 172 up
18 hdd 9.09569 1.00000 9.1 TiB 4.9 TiB 4.9 TiB 3.9 MiB 11
GiB 4.2 TiB 53.96 1.06 109 up
9 ssd 0.09769 1.00000 100 GiB 2.2 GiB 391 MiB 264 MiB 1.6
GiB 98 GiB 2.25 0.04 40 up
TOTAL 132 TiB 67 TiB 67 TiB 1.2 GiB 164
GiB 65 TiB 50.82
MIN/MAX VAR: 0.03/1.17 STDDEV: 29.78
`ceph osd dump | grep pool`
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 7 object_hash
rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 12539 flags
hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
read_balance_score 6.98
pool 2 'myfs-metadata' replicated size 3 min_size 2 crush_rule 25
object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on
last_change 32432 lfor 0/0/31 flags hashpspool stripe_width 0
pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application
cephfs read_balance_score 2.19
pool 3 'myfs-replicated' replicated size 2 min_size 1 crush_rule 26
object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn
last_change 32511 lfor 0/21361/21359 flags
hashpspool,selfmanaged_snaps stripe_width 0 application cephfs
read_balance_score 1.99
pool 4 'pvc-generic-pool' replicated size 3 min_size 2 crush_rule 17
object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on
last_change 32586 lfor 0/0/5211 flags hashpspool,selfmanaged_snaps
stripe_width 0 application rbd read_balance_score 3.26
pool 13 'myfs-eck2m2' erasure profile myfs-eck2m2_ecprofile size 4
min_size 3 crush_rule 8 object_hash rjenkins pg_num 128 pgp_num 128
autoscale_mode warn last_change 32511 lfor 0/8517/8518 flags
hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 8192
application cephfs
pool 22 'my-store.rgw.otp' replicated size 3 min_size 2 crush_rule 24
object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change
32431 flags hashpspool stripe_width 0 pg_num_min 8 application
rook-ceph-rgw read_balance_score 1.75
pool 23 'my-store.rgw.buckets.index' replicated size 3 min_size 2
crush_rule 22 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode
on last_change 32431 flags hashpspool stripe_width 0 pg_num_min 8
application rook-ceph-rgw read_balance_score 2.63
pool 24 'my-store.rgw.log' replicated size 3 min_size 2 crush_rule 23
object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change
32431 flags hashpspool stripe_width 0 pg_num_min 8 application
rook-ceph-rgw read_balance_score 2.63
pool 25 'my-store.rgw.control' replicated size 3 min_size 2 crush_rule
19 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on
last_change 32432 flags hashpspool stripe_width 0 pg_num_min 8
application rook-ceph-rgw read_balance_score 1.75
pool 26 '.rgw.root' replicated size 3 min_size 2 crush_rule 18
object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change
32432 flags hashpspool stripe_width 0 pg_num_min 8 application
rook-ceph-rgw read_balance_score 3.50
pool 27 'my-store.rgw.buckets.non-ec' replicated size 3 min_size 2
crush_rule 20 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode
on last_change 32431 flags hashpspool stripe_width 0 pg_num_min 8
application rook-ceph-rgw read_balance_score 2.62
pool 28 'my-store.rgw.meta' replicated size 3 min_size 2 crush_rule 21
object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change
32431 flags hashpspool stripe_width 0 pg_num_min 8 application
rook-ceph-rgw read_balance_score 1.74
pool 29 'my-store.rgw.buckets.data' erasure profile
my-store.rgw.buckets.data_ecprofile size 4 min_size 3 crush_rule 16
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
last_change 32433 lfor 0/0/13673 flags hashpspool,ec_overwrites
stripe_width 8192 application rook-ceph-rgw
`ceph balancer status`
This does have normal output when the cluster isn't in the middle of recovery.
{
"active": true,
"last_optimize_duration": "0:00:00.000107",
"last_optimize_started": "Tue Nov 28 22:11:56 2023",
"mode": "upmap",
"no_optimization_needed": true,
"optimize_result": "Too many objects (0.081907 > 0.050000) are
misplaced; try again later",
"plans": []
}
`ceph osd pool autoscale-status`
No output for this. I'm not sure why - this has given output in the
past. Might be due to being in the middle of recovery, or it might be
a Reef issue (I don't think I've looked at this since upgrading). In
any case, PG counts are in the osd dump, and I have the hdd storage
classes set to warn I think.
The balancer module can be confounded by certain
complex topologies like multiple device classes and/or CRUSH roots.
Since you’re using Rook, I wonder if you might be hitting something that I’ve seen
myself; the above commands will tell the tale.
Yeah, if it is designed for equally-sized OSDs then it isn't going to
work quite right for me. I do try to keep hosts reasonably balanced,
but not individual OSDs.
--
Rich