Partially answering my own question. I think it is possible to tweak the existing
parameters to achieve what I'm looking for on average. The main reason I want to use
the internal scheduler is the high number of PGs on some pools, which I actually intend to
increase even further. For such pools a simple calculation shows that it is impractical to
do manual scrubbing with cron, I simply cannot execute cron jobs often enough to achieve a
reasonable scrub distribution (looking at a script like
https://gist.github.com/ethaniel/5db696d9c78516308b235b0cb904e4ad).
Looking at scrub date stamp distributions for specific pools, PGs with old deep-scrub time
stamps tend to correlate also with PGs with old scrub date stamps. The idea now id to
tweak scrub_min_interval such that the scrub scheduler is forced to select PGs out of the
20-30% with the oldest scrub stamps. This should imply that, after a reasonable time
interval, the age of the oldest deep-scrub stamps is reduced as long not deep-scrubbed PGs
become much more likely to be scheduled for deep-scrub.
This adjustment is done together with making osd_deep_scrub_randomize_ratio and
osd_scrub_backoff_ratio more aggressive to cycle more frequently through the small list of
PGs eligible for scrubbing. This is a bit like the reverse calculation for achieving the
effect of a not implemented deep_scrub_min_interval.
I made the following global changes (and hope the parameters do something like what their
documentation says):
global advanced osd_deep_scrub_randomize_ratio 0.330000
global dev osd_scrub_backoff_ratio 0.500000
With this setting, about 33% of all scrub should be deep-scrubs, meaning that on average
after 3 scrub events a PG is also deep-scrubbed. This leads to this estimate for expected
deep-scrub intervals:
given scrub-interval is scrub_min_interval*[1, 1+osd_scrub_interval_randomize_ratio] =
scrub_min_interval*[1, 1.5], the expected deep-scrub-interval is (assuming worst-case
realisation of randomize_ratio for upper value):
scrub_min_interval*[1, 1.5]*3 = scrub_min_interval*[3, 4.5]
I can't calculate how the tail will look like, but I hope its not a fat tail. I will
report back what I observe; see below.
The idea now is to tune scrub_min_interval per pool such that only about 20-30% of PGs
have a scrub-stamp older than scrub_min_interval. The scheduler will cycle only through
these and a bit faster than default. As the stamp histograms included below indicate, the
distribution is probably very sensitive to changes of this interval. I now changed these
values on some pools and already see that PGs with much older deep-scrub stamps are now
selected for deep-scrubbing. I will observe what these settings converge to and report
back. It seems that it will lead to an improved stamp distribution and one only needs to
issue manual deep-scrubs for very few PGs that are outliers of the random number generator
(that's the tail I talked about above). My goal is to have a script schedule a
deep-scrub on the outliers no more often than daily.
The reports below have been pulled after changes to settings were applied for about 2-3h.
There was already improvement in the right direction, but the original distribution issue
is still very pronounced.
Here two per-pool scrub stamp distributions for an SSD pool and an HDD pool, both with
large number of PGs per OSD:
=== SSD pool:
Scrub info for pool sr-rbd-data-one (id=2): dumped pgs
Scrub report:
22% 941 PGs not scrubbed since 1 intervals ( 6h)
42% 795 PGs not scrubbed since 2 intervals ( 12h)
62% 829 PGs not scrubbed since 3 intervals ( 18h)
82% 823 PGs not scrubbed since 4 intervals ( 24h)
96% 576 PGs not scrubbed since 5 intervals ( 30h)
100% 132 PGs not scrubbed since 6 intervals ( 36h)
4096 PGs out of 4096 reported, 0 missing.
Deep-scrub report:
13% 545 PGs not deep-scrubbed since 1 intervals ( 24h)
25% 508 PGs not deep-scrubbed since 2 intervals ( 48h) 1 scrubbing+deep
45% 797 PGs not deep-scrubbed since 3 intervals ( 72h) 1 scrubbing+deep
59% 587 PGs not deep-scrubbed since 4 intervals ( 96h)
70% 463 PGs not deep-scrubbed since 5 intervals (120h)
78% 312 PGs not deep-scrubbed since 6 intervals (144h)
84% 263 PGs not deep-scrubbed since 7 intervals (168h)
89% 173 PGs not deep-scrubbed since 8 intervals (192h)
92% 151 PGs not deep-scrubbed since 9 intervals (216h)
95% 106 PGs not deep-scrubbed since 10 intervals (240h)
96% 55 PGs not deep-scrubbed since 11 intervals (264h)
97% 50 PGs not deep-scrubbed since 12 intervals (288h)
98% 44 PGs not deep-scrubbed since 13 intervals (312h)
99% 24 PGs not deep-scrubbed since 14 intervals (336h)
100% 18 PGs not deep-scrubbed since 15 intervals (360h)
4096 PGs out of 4096 reported, 0 missing.
PGs marked with a * are on busy OSDs and not eligible for scrubbing.
sr-rbd-data-one scrub_min_interval=0h
sr-rbd-data-one scrub_max_interval=0h
sr-rbd-data-one deep_scrub_interval=0h
===
Here we see that after 24h 82% of PGs are scrubbed, but we have quite a tail of not
deep-scrubbed PGs. Its long enough to trigger a warning with default parameters. In this
case, reducing scrub_min_interval to a value around 18-20h could reduce the tail enough.
The alternative is simply to schedule a deep-scrub on the oldest PGs manually (cron). This
would start immediately since no OSDs are allocated to scrubbing/recovery.
=== HDD pool:
Scrub info for pool con-fs2-data2 (id=19): dumped pgs
Scrub report:
11% 939 PGs not scrubbed since 1 intervals ( 6h)
22% 936 PGs not scrubbed since 2 intervals ( 12h)
33% 874 PGs not scrubbed since 3 intervals ( 18h)
43% 821 PGs not scrubbed since 4 intervals ( 24h)
54% 931 PGs not scrubbed since 5 intervals ( 30h)
64% 766 PGs not scrubbed since 6 intervals ( 36h)
72% 646 PGs not scrubbed since 7 intervals ( 42h)
79% 559 PGs not scrubbed since 8 intervals ( 48h)
84% 411 PGs not scrubbed since 9 intervals ( 54h)
88% 346 PGs not scrubbed since 10 intervals ( 60h)
90% 222 PGs not scrubbed since 11 intervals ( 66h)
93% 213 PGs not scrubbed since 12 intervals ( 72h)
95% 160 PGs not scrubbed since 13 intervals ( 78h)
96% 87 PGs not scrubbed since 14 intervals ( 84h)
97% 77 PGs not scrubbed since 15 intervals ( 90h)
98% 57 PGs not scrubbed since 16 intervals ( 96h) 1 scrubbing
98% 42 PGs not scrubbed since 17 intervals (102h)
99% 32 PGs not scrubbed since 18 intervals (108h)
99% 19 PGs not scrubbed since 19 intervals (114h)
99% 19 PGs not scrubbed since 20 intervals (120h)
99% 10 PGs not scrubbed since 21 intervals (126h)
99% 1 PGs not scrubbed since 22 intervals (132h) 19.165f*
99% 5 PGs not scrubbed since 24 intervals (138h) 19.412* 19.75c* 19.140f* 19.134c*
19.fb7*
99% 5 PGs not scrubbed since 25 intervals (144h) 19.1714* 19.148d* 19.1fa9*
19.1f05* 19.1cda*
99% 1 PGs not scrubbed since 26 intervals (150h) 19.a3f*
99% 1 PGs not scrubbed since 27 intervals (156h) 19.a01*
99% 3 PGs not scrubbed since 28 intervals (162h) 19.12f2* 19.1284* 19.c90*
99% 1 PGs not scrubbed since 29 intervals (168h)
99% 1 PGs not scrubbed since 30 intervals (174h) 19.f13*
99% 2 PGs not scrubbed since 32 intervals (180h) 19.1f87* 19.67b*
99% 2 PGs not scrubbed since 36 intervals (186h) 19.133f* 19.1318*
99% 2 PGs not scrubbed since 40 intervals (192h) 19.12f4* 19.248*
100% 1 PGs not scrubbed since 43 intervals (198h) 19.1984*
8192 PGs out of 8192 reported, 0 missing.
Deep-scrub report:
14% 1210 PGs not deep-scrubbed since 1 intervals ( 24h)
28% 1136 PGs not deep-scrubbed since 2 intervals ( 48h) 1 scrubbing+deep
40% 985 PGs not deep-scrubbed since 3 intervals ( 72h) 4 scrubbing+deep
51% 851 PGs not deep-scrubbed since 4 intervals ( 96h) 5 scrubbing+deep
59% 713 PGs not deep-scrubbed since 5 intervals (120h) 4 scrubbing+deep
63% 276 PGs not deep-scrubbed since 6 intervals (144h) 2 scrubbing+deep
70% 566 PGs not deep-scrubbed since 7 intervals (168h) 1 scrubbing+deep
76% 534 PGs not deep-scrubbed since 8 intervals (192h) 2 scrubbing+deep
82% 480 PGs not deep-scrubbed since 9 intervals (216h) 2 scrubbing+deep
87% 381 PGs not deep-scrubbed since 10 intervals (240h) 2 scrubbing+deep
90% 253 PGs not deep-scrubbed since 11 intervals (264h) 1 scrubbing+deep
92% 222 PGs not deep-scrubbed since 12 intervals (288h) 1 scrubbing+deep
94% 136 PGs not deep-scrubbed since 13 intervals (312h) 1 scrubbing+deep
96% 179 PGs not deep-scrubbed since 14 intervals (336h) 3 scrubbing+deep
98% 156 PGs not deep-scrubbed since 15 intervals (360h) 6 scrubbing+deep
99% 65 PGs not deep-scrubbed since 16 intervals (384h) 3 scrubbing+deep
99% 31 PGs not deep-scrubbed since 17 intervals (408h) 4 scrubbing+deep
99% 14 PGs not deep-scrubbed since 18 intervals (432h) 3 scrubbing+deep
99% 3 PGs not deep-scrubbed since 19 intervals (456h) 19.1d89* 19.fb7* 19.807*
100% 1 PGs not deep-scrubbed since 21 intervals (480h) 19.a01*
8192 PGs out of 8192 reported, 0 missing.
PGs marked with a * are on busy OSDs and not eligible for scrubbing.
con-fs2-data2 scrub_min_interval=42h
con-fs2-data2 scrub_max_interval=0h
con-fs2-data2 deep_scrub_interval=0h
===
This is the real deal, the pool I'm fighting with at the moment. I made a small change
in scrub_min_interval (pool setting) from 24h to 42h, which resulted in the very good
deep-scrub state allocation of the PGs in the pool. With scrub_min_interval=24h basically
all scrubbing happened on PGs not deep-scrubbed within 1-6 days. After increasing this
value to the time interval for which about 70% of PGs were scrubbed (leaving 30%
eligible), the allocation of deep-scrub states is much much better. I expect both tails to
get shorter and the overall deep-scrub load to go down as well. I hope to reach a state
where I only need to issue a few deep-scrubs manually per day to get everything scrubbed
within 1 week and deep-scrubbed within 3-4 weeks.
For now I will wait what effect the global settings have on the SSD pools and what the HDD
pool converges to. This will need 1-2 months observations and I will report back when
significant changes show up.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans(a)dtu.dk>
Sent: Wednesday, November 15, 2023 11:14 AM
To: ceph-users(a)ceph.io
Subject: [ceph-users] How to configure something like osd_deep_scrub_min_interval?
Hi folks,
I am fighting a bit with odd deep-scrub behavior on HDDs and discovered a likely cause of
why the distribution of last_deep_scrub_stamps is so weird. I wrote a small script to
extract a histogram of scrubs by "days not scrubbed" (more precisely, intervals
not scrubbed; see code) to find out how (deep-) scrub times are distributed. Output
below.
What I expected is along the lines that HDD-OSDs try to scrub every 1-3 days, while they
try to deep-scrub every 7-14 days. In other words, OSDs that have been deep-scrubbed
within the last 7 days would *never* be in scrubbing+deep state. However, what I see is
completely different. There seems to be no distinction between scrub- and deep-scrub start
times. This is really unexpected as nobody would try to deep-scrub HDDs every day. Weekly
to bi-weekly is normal, specifically for large drives.
Is there a way to configure something like osd_deep_scrub_min_interval (no, I don't
want to run cron jobs for scrubbing yet)? In the output below, I would like to be able to
configure a minimum period of 1-2 weeks before the next deep-scrub happens. How can I do
that?
The observed behavior is very unusual for RAID systems (if its not a bug in the report
script). With this behavior its not surprising that people complain about "not
deep-scrubbed in time" messages and too high deep-scrub IO load when such a large
percentage of OSDs is needlessly deep-scrubbed after 1-6 days again already.
Sample output:
# scrub-report
dumped pgs
Scrub report:
4121 PGs not scrubbed since 1 intervals (6h)
3831 PGs not scrubbed since 2 intervals (6h)
4012 PGs not scrubbed since 3 intervals (6h)
3986 PGs not scrubbed since 4 intervals (6h)
2998 PGs not scrubbed since 5 intervals (6h)
1488 PGs not scrubbed since 6 intervals (6h)
909 PGs not scrubbed since 7 intervals (6h)
771 PGs not scrubbed since 8 intervals (6h)
582 PGs not scrubbed since 9 intervals (6h) 2 scrubbing
431 PGs not scrubbed since 10 intervals (6h)
333 PGs not scrubbed since 11 intervals (6h) 1 scrubbing
265 PGs not scrubbed since 12 intervals (6h)
195 PGs not scrubbed since 13 intervals (6h)
116 PGs not scrubbed since 14 intervals (6h)
78 PGs not scrubbed since 15 intervals (6h) 1 scrubbing
72 PGs not scrubbed since 16 intervals (6h)
37 PGs not scrubbed since 17 intervals (6h)
5 PGs not scrubbed since 18 intervals (6h) 14.237* 19.5cd* 19.12cc* 19.1233*
14.40e*
33 PGs not scrubbed since 20 intervals (6h)
23 PGs not scrubbed since 21 intervals (6h)
16 PGs not scrubbed since 22 intervals (6h)
12 PGs not scrubbed since 23 intervals (6h)
8 PGs not scrubbed since 24 intervals (6h)
2 PGs not scrubbed since 25 intervals (6h) 19.eef* 19.bb3*
4 PGs not scrubbed since 26 intervals (6h) 19.b4c* 19.10b8* 19.f13* 14.1ed*
5 PGs not scrubbed since 27 intervals (6h) 19.43f* 19.231* 19.1dbe* 19.1788*
19.16c0*
6 PGs not scrubbed since 28 intervals (6h)
2 PGs not scrubbed since 30 intervals (6h) 19.10f6* 14.9d*
3 PGs not scrubbed since 31 intervals (6h) 19.1322* 19.1318* 8.a*
1 PGs not scrubbed since 32 intervals (6h) 19.133f*
1 PGs not scrubbed since 33 intervals (6h) 19.1103*
3 PGs not scrubbed since 36 intervals (6h) 19.19cc* 19.12f4* 19.248*
1 PGs not scrubbed since 39 intervals (6h) 19.1984*
1 PGs not scrubbed since 41 intervals (6h) 14.449*
1 PGs not scrubbed since 44 intervals (6h) 19.179f*
Deep-scrub report:
3723 PGs not deep-scrubbed since 1 intervals (24h)
4621 PGs not deep-scrubbed since 2 intervals (24h) 8 scrubbing+deep
3588 PGs not deep-scrubbed since 3 intervals (24h) 8 scrubbing+deep
2929 PGs not deep-scrubbed since 4 intervals (24h) 3 scrubbing+deep
1705 PGs not deep-scrubbed since 5 intervals (24h) 4 scrubbing+deep
1904 PGs not deep-scrubbed since 6 intervals (24h) 5 scrubbing+deep
1540 PGs not deep-scrubbed since 7 intervals (24h) 7 scrubbing+deep
1304 PGs not deep-scrubbed since 8 intervals (24h) 7 scrubbing+deep
923 PGs not deep-scrubbed since 9 intervals (24h) 5 scrubbing+deep
557 PGs not deep-scrubbed since 10 intervals (24h) 7 scrubbing+deep
501 PGs not deep-scrubbed since 11 intervals (24h) 2 scrubbing+deep
363 PGs not deep-scrubbed since 12 intervals (24h) 2 scrubbing+deep
377 PGs not deep-scrubbed since 13 intervals (24h) 1 scrubbing+deep
383 PGs not deep-scrubbed since 14 intervals (24h) 2 scrubbing+deep
252 PGs not deep-scrubbed since 15 intervals (24h) 2 scrubbing+deep
116 PGs not deep-scrubbed since 16 intervals (24h) 5 scrubbing+deep
47 PGs not deep-scrubbed since 17 intervals (24h) 2 scrubbing+deep
10 PGs not deep-scrubbed since 18 intervals (24h)
2 PGs not deep-scrubbed since 19 intervals (24h) 19.1c6c* 19.a01*
1 PGs not deep-scrubbed since 20 intervals (24h) 14.1ed*
2 PGs not deep-scrubbed since 21 intervals (24h) 19.1322* 19.10f6*
1 PGs not deep-scrubbed since 23 intervals (24h) 19.19cc*
1 PGs not deep-scrubbed since 24 intervals (24h) 19.179f*
PGs marked with a * are on busy OSDs and not eligible for scrubbing.
The script (pasted here because attaching doesn't work):
# cat bin/scrub-report
#!/bin/bash
# Compute last scrub interval count. Scrub interval 6h, deep-scrub interval 24h.
# Print how many PGs have not been (deep-)scrubbed since #intervals.
ceph -f json pg dump pgs 2>&1 > /root/.cache/ceph/pgs_dump.json
echo ""
T0="$(date +%s)"
scrub_info="$(jq --arg T0 "$T0" -rc '.pg_stats[] | [
.pgid,
(.last_scrub_stamp[:19]+"Z" | (($T0|tonumber) -
fromdateiso8601)/(60*60*6)|ceil),
(.last_deep_scrub_stamp[:19]+"Z" | (($T0|tonumber) -
fromdateiso8601)/(60*60*24)|ceil),
.state,
(.acting | join(" "))
] | @tsv
' /root/.cache/ceph/pgs_dump.json)"
# less <<<"$scrub_info"
# 1 2 3 4 5..NF
# pg_id scrub-ints deep-scrub-ints status acting[]
awk <<<"$scrub_info" '{
for(i=5; i<=NF; ++i) pg_osds[$1]=pg_osds[$1] " " $i
if($4 == "active+clean") {
si_mx=si_mx<$2 ? $2 : si_mx
dsi_mx=dsi_mx<$3 ? $3 : dsi_mx
pg_sn[$2]++
pg_sn_ids[$2]=pg_sn_ids[$2] " " $1
pg_dsn[$3]++
pg_dsn_ids[$3]=pg_dsn_ids[$3] " " $1
} else if($4 ~ /scrubbing\+deep/) {
deep_scrubbing[$3]++
for(i=5; i<=NF; ++i) osd[$i]="busy"
} else if($4 ~ /scrubbing/) {
scrubbing[$2]++
for(i=5; i<=NF; ++i) osd[$i]="busy"
} else {
unclean[$2]++
unclean_d[$3]++
si_mx=si_mx<$2 ? $2 : si_mx
dsi_mx=dsi_mx<$3 ? $3 : dsi_mx
pg_sn[$2]++
pg_sn_ids[$2]=pg_sn_ids[$2] " " $1
pg_dsn[$3]++
pg_dsn_ids[$3]=pg_dsn_ids[$3] " " $1
for(i=5; i<=NF; ++i) osd[$i]="busy"
}
}
END {
print "Scrub report:"
for(si=1; si<=si_mx; ++si) {
if(pg_sn[si]==0 && scrubbing[si]==0 && unclean[si]==0)
continue;
printf("%7d PGs not scrubbed since %2d intervals (6h)",
pg_sn[si], si)
if(scrubbing[si]) printf(" %d scrubbing", scrubbing[si])
if(unclean[si]) printf(" %d unclean", unclean[si])
if(pg_sn[si]<=5) {
split(pg_sn_ids[si], pgs)
osds_busy=0
for(pg in pgs) {
split(pg_osds[pgs[pg]], osds)
for(o in osds) if(osd[osds[o]]=="busy")
osds_busy=1
if(osds_busy) printf(" %s*", pgs[pg])
if(!osds_busy) printf(" %s", pgs[pg])
}
}
printf("\n")
}
print ""
print "Deep-scrub report:"
for(dsi=1; dsi<=dsi_mx; ++dsi) {
if(pg_dsn[dsi]==0 && deep_scrubbing[dsi]==0 &&
unclean_d[dsi]==0) continue;
printf("%7d PGs not deep-scrubbed since %2d intervals (24h)",
pg_dsn[dsi], dsi)
if(deep_scrubbing[dsi]) printf(" %d scrubbing+deep",
deep_scrubbing[dsi])
if(unclean_d[dsi]) printf(" %d unclean", unclean_d[dsi])
if(pg_dsn[dsi]<=5) {
split(pg_dsn_ids[dsi], pgs)
osds_busy=0
for(pg in pgs) {
split(pg_osds[pgs[pg]], osds)
for(o in osds) if(osd[osds[o]]=="busy")
osds_busy=1
if(osds_busy) printf(" %s*", pgs[pg])
if(!osds_busy) printf(" %s", pgs[pg])
}
}
printf("\n")
}
print ""
print "PGs marked with a * are on busy OSDs and not eligible for
scrubbing."
}
'
Don't forget the last "'" when copy-pasting.
Thanks for any pointers.
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io