Hi folks,
I am fighting a bit with odd deep-scrub behavior on HDDs and discovered a likely cause of why the distribution of last_deep_scrub_stamps is so weird. I wrote a small script to extract a histogram of scrubs by "days not scrubbed" (more precisely, intervals not scrubbed; see code) to find out how (deep-) scrub times are distributed. Output below.
What I expected is along the lines that HDD-OSDs try to scrub every 1-3 days, while they try to deep-scrub every 7-14 days. In other words, OSDs that have been deep-scrubbed within the last 7 days would *never* be in scrubbing+deep state. However, what I see is completely different. There seems to be no distinction between scrub- and deep-scrub start times. This is really unexpected as nobody would try to deep-scrub HDDs every day. Weekly to bi-weekly is normal, specifically for large drives.
Is there a way to configure something like osd_deep_scrub_min_interval (no, I don't want to run cron jobs for scrubbing yet)? In the output below, I would like to be able to configure a minimum period of 1-2 weeks before the next deep-scrub happens. How can I do that?
The observed behavior is very unusual for RAID systems (if its not a bug in the report script). With this behavior its not surprising that people complain about "not deep-scrubbed in time" messages and too high deep-scrub IO load when such a large percentage of OSDs is needlessly deep-scrubbed after 1-6 days again already.
Sample output:
# scrub-report
dumped pgs
Scrub report:
4121 PGs not scrubbed since 1 intervals (6h)
3831 PGs not scrubbed since 2 intervals (6h)
4012 PGs not scrubbed since 3 intervals (6h)
3986 PGs not scrubbed since 4 intervals (6h)
2998 PGs not scrubbed since 5 intervals (6h)
1488 PGs not scrubbed since 6 intervals (6h)
909 PGs not scrubbed since 7 intervals (6h)
771 PGs not scrubbed since 8 intervals (6h)
582 PGs not scrubbed since 9 intervals (6h) 2 scrubbing
431 PGs not scrubbed since 10 intervals (6h)
333 PGs not scrubbed since 11 intervals (6h) 1 scrubbing
265 PGs not scrubbed since 12 intervals (6h)
195 PGs not scrubbed since 13 intervals (6h)
116 PGs not scrubbed since 14 intervals (6h)
78 PGs not scrubbed since 15 intervals (6h) 1 scrubbing
72 PGs not scrubbed since 16 intervals (6h)
37 PGs not scrubbed since 17 intervals (6h)
5 PGs not scrubbed since 18 intervals (6h) 14.237* 19.5cd* 19.12cc* 19.1233* 14.40e*
33 PGs not scrubbed since 20 intervals (6h)
23 PGs not scrubbed since 21 intervals (6h)
16 PGs not scrubbed since 22 intervals (6h)
12 PGs not scrubbed since 23 intervals (6h)
8 PGs not scrubbed since 24 intervals (6h)
2 PGs not scrubbed since 25 intervals (6h) 19.eef* 19.bb3*
4 PGs not scrubbed since 26 intervals (6h) 19.b4c* 19.10b8* 19.f13* 14.1ed*
5 PGs not scrubbed since 27 intervals (6h) 19.43f* 19.231* 19.1dbe* 19.1788* 19.16c0*
6 PGs not scrubbed since 28 intervals (6h)
2 PGs not scrubbed since 30 intervals (6h) 19.10f6* 14.9d*
3 PGs not scrubbed since 31 intervals (6h) 19.1322* 19.1318* 8.a*
1 PGs not scrubbed since 32 intervals (6h) 19.133f*
1 PGs not scrubbed since 33 intervals (6h) 19.1103*
3 PGs not scrubbed since 36 intervals (6h) 19.19cc* 19.12f4* 19.248*
1 PGs not scrubbed since 39 intervals (6h) 19.1984*
1 PGs not scrubbed since 41 intervals (6h) 14.449*
1 PGs not scrubbed since 44 intervals (6h) 19.179f*
Deep-scrub report:
3723 PGs not deep-scrubbed since 1 intervals (24h)
4621 PGs not deep-scrubbed since 2 intervals (24h) 8 scrubbing+deep
3588 PGs not deep-scrubbed since 3 intervals (24h) 8 scrubbing+deep
2929 PGs not deep-scrubbed since 4 intervals (24h) 3 scrubbing+deep
1705 PGs not deep-scrubbed since 5 intervals (24h) 4 scrubbing+deep
1904 PGs not deep-scrubbed since 6 intervals (24h) 5 scrubbing+deep
1540 PGs not deep-scrubbed since 7 intervals (24h) 7 scrubbing+deep
1304 PGs not deep-scrubbed since 8 intervals (24h) 7 scrubbing+deep
923 PGs not deep-scrubbed since 9 intervals (24h) 5 scrubbing+deep
557 PGs not deep-scrubbed since 10 intervals (24h) 7 scrubbing+deep
501 PGs not deep-scrubbed since 11 intervals (24h) 2 scrubbing+deep
363 PGs not deep-scrubbed since 12 intervals (24h) 2 scrubbing+deep
377 PGs not deep-scrubbed since 13 intervals (24h) 1 scrubbing+deep
383 PGs not deep-scrubbed since 14 intervals (24h) 2 scrubbing+deep
252 PGs not deep-scrubbed since 15 intervals (24h) 2 scrubbing+deep
116 PGs not deep-scrubbed since 16 intervals (24h) 5 scrubbing+deep
47 PGs not deep-scrubbed since 17 intervals (24h) 2 scrubbing+deep
10 PGs not deep-scrubbed since 18 intervals (24h)
2 PGs not deep-scrubbed since 19 intervals (24h) 19.1c6c* 19.a01*
1 PGs not deep-scrubbed since 20 intervals (24h) 14.1ed*
2 PGs not deep-scrubbed since 21 intervals (24h) 19.1322* 19.10f6*
1 PGs not deep-scrubbed since 23 intervals (24h) 19.19cc*
1 PGs not deep-scrubbed since 24 intervals (24h) 19.179f*
PGs marked with a * are on busy OSDs and not eligible for scrubbing.
The script (pasted here because attaching doesn't work):
# cat bin/scrub-report
#!/bin/bash
# Compute last scrub interval count. Scrub interval 6h, deep-scrub interval 24h.
# Print how many PGs have not been (deep-)scrubbed since #intervals.
ceph -f json pg dump pgs 2>&1 > /root/.cache/ceph/pgs_dump.json
echo ""
T0="$(date +%s)"
scrub_info="$(jq --arg T0 "$T0" -rc '.pg_stats[] | [
.pgid,
(.last_scrub_stamp[:19]+"Z" | (($T0|tonumber) - fromdateiso8601)/(60*60*6)|ceil),
(.last_deep_scrub_stamp[:19]+"Z" | (($T0|tonumber) - fromdateiso8601)/(60*60*24)|ceil),
.state,
(.acting | join(" "))
] | @tsv
' /root/.cache/ceph/pgs_dump.json)"
# less <<<"$scrub_info"
# 1 2 3 4 5..NF
# pg_id scrub-ints deep-scrub-ints status acting[]
awk <<<"$scrub_info" '{
for(i=5; i<=NF; ++i) pg_osds[$1]=pg_osds[$1] " " $i
if($4 == "active+clean") {
si_mx=si_mx<$2 ? $2 : si_mx
dsi_mx=dsi_mx<$3 ? $3 : dsi_mx
pg_sn[$2]++
pg_sn_ids[$2]=pg_sn_ids[$2] " " $1
pg_dsn[$3]++
pg_dsn_ids[$3]=pg_dsn_ids[$3] " " $1
} else if($4 ~ /scrubbing\+deep/) {
deep_scrubbing[$3]++
for(i=5; i<=NF; ++i) osd[$i]="busy"
} else if($4 ~ /scrubbing/) {
scrubbing[$2]++
for(i=5; i<=NF; ++i) osd[$i]="busy"
} else {
unclean[$2]++
unclean_d[$3]++
si_mx=si_mx<$2 ? $2 : si_mx
dsi_mx=dsi_mx<$3 ? $3 : dsi_mx
pg_sn[$2]++
pg_sn_ids[$2]=pg_sn_ids[$2] " " $1
pg_dsn[$3]++
pg_dsn_ids[$3]=pg_dsn_ids[$3] " " $1
for(i=5; i<=NF; ++i) osd[$i]="busy"
}
}
END {
print "Scrub report:"
for(si=1; si<=si_mx; ++si) {
if(pg_sn[si]==0 && scrubbing[si]==0 && unclean[si]==0) continue;
printf("%7d PGs not scrubbed since %2d intervals (6h)", pg_sn[si], si)
if(scrubbing[si]) printf(" %d scrubbing", scrubbing[si])
if(unclean[si]) printf(" %d unclean", unclean[si])
if(pg_sn[si]<=5) {
split(pg_sn_ids[si], pgs)
osds_busy=0
for(pg in pgs) {
split(pg_osds[pgs[pg]], osds)
for(o in osds) if(osd[osds[o]]=="busy") osds_busy=1
if(osds_busy) printf(" %s*", pgs[pg])
if(!osds_busy) printf(" %s", pgs[pg])
}
}
printf("\n")
}
print ""
print "Deep-scrub report:"
for(dsi=1; dsi<=dsi_mx; ++dsi) {
if(pg_dsn[dsi]==0 && deep_scrubbing[dsi]==0 && unclean_d[dsi]==0) continue;
printf("%7d PGs not deep-scrubbed since %2d intervals (24h)", pg_dsn[dsi], dsi)
if(deep_scrubbing[dsi]) printf(" %d scrubbing+deep", deep_scrubbing[dsi])
if(unclean_d[dsi]) printf(" %d unclean", unclean_d[dsi])
if(pg_dsn[dsi]<=5) {
split(pg_dsn_ids[dsi], pgs)
osds_busy=0
for(pg in pgs) {
split(pg_osds[pgs[pg]], osds)
for(o in osds) if(osd[osds[o]]=="busy") osds_busy=1
if(osds_busy) printf(" %s*", pgs[pg])
if(!osds_busy) printf(" %s", pgs[pg])
}
}
printf("\n")
}
print ""
print "PGs marked with a * are on busy OSDs and not eligible for scrubbing."
}
'
Don't forget the last "'" when copy-pasting.
Thanks for any pointers.
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hello,
I've problem: my ceph cluster (3x mon nodes, 6x osd nodes, every
osd node have 12 rotational disk and one NVMe device for
bluestore DB). CEPH is installed by ceph orchestrator and have
bluefs storage on osd.
I've started process upgrade from version 17.2.6 to 18.2.1 by
invocating:
ceph orch upgrade start --ceph-version 18.2.1
After upgrade of mon and mgr processes orchestrator tried to
upgrade the first OSD node, but they are falling down.
I've stop the process of upgrade, but I have 1 osd node
completely down.
After upgrade I've got some error messages and I've found
/var/lib/ceph/crashxxxx directories, I attach to this message
files, which I've found here.
Please, can you advice, what now I can do? It seems, that rocksdb
is even non-compatible or corrupted :-(
Thanks in advance.
Sincerely
Jan Marek
--
Ing. Jan Marek
University of South Bohemia
Academic Computer Centre
Phone: +420389032080
http://www.gnu.org/philosophy/no-word-attachments.cs.html
Hello,
We've been using Ceph for managing our storage infrastructure, and we recently upgraded to the latest version (Ceph v18.2.1 "reef"). However, We've noticed that the "refresh interval" option seems to be missing in the dashboard, and we are facing challenges with monitoring our cluster in real-time.
In the earlier version of the Ceph dashboard, there was a useful "refresh interval" option that allowed us to customize the update frequency of the dashboard. This was particularly handy for monitoring changes and responding promptly. However, after the upgrade to Ceph v18.2.1 "reef", We can't seem to find this option anywhere in the dashboard.
Additionally, we observed an automatic refresh occurring at every 25 seconds. Seeking guidance locating and tuning the refresh interval settings in the latest version of Ceph to potentially reduce this interval.
We've explored the dashboard settings thoroughly and reviewed the release notes for Ceph v18.2.1 "reef", but we couldn't find any mention of the removal of the "refresh interval" option.
Any guidance or insights would be greatly appreciated!
Thanks,
Mohammad Saif
Ceph Enthusiast
>>
>> You can do that for a PoC, but that's a bad idea for any production workload. You'd want at least three nodes with OSDs to use the default RF=3 replication. You can do RF=2, but at the peril of your mortal data.
>
> I'm not sure I agree - I think size=2, min_size=2 is no worse than
> RAID1 for data security.
size=2, min_size=2 *is* RAID1. Except that you become unavailable if a single drive is unavailable.
>> That isn't even the main risk as I understand it. Of course a double
> failure is going to be a problem with size=2, or traditional RAID1,
> and I think anybody choosing this configuration accepts this risk.
We see people often enough who don’t know that. I’ve seen double failures. ymmv.
> As I understand it, the reason min_size=1 is a trap has nothing to do
> with double failures per se.
It’s one of the concerns.
>
> The issue is that Ceph OSDs are somewhat prone to flapping during
> recovery (OOM, etc). So even if the disk is fine, an OSD can go down
> for a short time. If you have size=2, min=1 configured, then when
> this happens the PG will become degraded and will continue operating
> on the other OSD, and the flapping OSD becomes stale. Then when it
> comes back up it recovers. The problem is that if the other OSD has a
> permanent failure (disk crash/etc) while the first OSD is flapping,
> now you have no good OSDs, because when the flapping OSD comes back up
> it is stale, and its PGs have no peer.
Indeed, arguably that’s an overlapping failure. I’ve seen this too, and have a pg query to demonstrate it.
> I suspect there are ways to re-activate it, though this will result in potential data
> inconsistency since writes were allowed to the cluster and will then
> get rolled back.
Yep.
> With only two OSDs I'm guessing that would be the
> main impact (well, depending on journaling behavior/etc), but if you
> have more OSDs than that then you could have situations where one file
> is getting rolled back, and some other file isn't, and so on.
But you’d have a voting majority.
>
> With min_size=2 you're fairly safe from flapping because there will
> always be two replicas that have the most recent version of every PG,
> and so you can still tolerate a permanent failure of one of them.
Exactly.
>
> size=2, min=2 doesn't suffer this failure mode, because anytime there
> is flapping the PG goes inactive and no writes can be made, so when
> the other OSD comes back up there is nothing to recover. Of course
> this results in IO blocks and downtime, which is obviously
> undesirable, but it is likely a more recoverable state than
> inconsistent writes.
Agreed, the difference between availability and durability. Depends what’s important to you.
>
> Apologies if I've gotten any of that wrong, but my understanding is
> that it is these sorts of failure modes that cause min_size=1 to be a
> trap. This isn't the sort of thing that typically happens in a RAID1
> config, or at least that admins don't think about.
It’s both.
Hi,
When trying to log in to rgb via the dashboard, an error appears in the
logs ValueError: invalid literal for int() with base 10: '443
ssl_certificate=config://rgw/cert/rgw.test'
RGW with SSL
If rgw is without ssl, everything works fine
ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable)
Please tell me how to solve this?
Hi All,
Just successfully(?) completed a "live" update of the first node of a
Ceph Quincy cluster from RL8 to RL9. Everything "seems" to be working -
EXCEPT the iSCSI Gateway on that box.
During the update the ceph-iscsi package was removed (ie
`ceph-iscsi-3.6-2.g97f5b02.el8.noarch.rpm` - this is the latest package
available from the Ceph Repos). So, obviously, I reinstalled the package.
However, `dnf` is throwing errors (unsurprisingly, as that package is an
el8 package and this box is now running el9): that package requires
python 3.6 and el9 runs with python 3.8 (I believe).
So my question(s) is: Can I simply "downgrade" python to 3.6, or is
there an el9-compatible version of `ceph-iscsi` somewhere, and/or is
there some process I need to follow to get the iSCSI Gateway back up and
running?
Some further info: The next step in my
"happy-happy-fun-time-holiday-ICT-maintenance" was to upgrade the
current Ceph Cluster to use `cephadm` and to go from Ceph-Quincy to
Ceph-Reef - is this my ultimate upgrade path to get the iSCSI G/W back?
BTW the Ceph Cluster is used *only* to provide iSCSI LUNS to an oVirt
(KVM) Cluster front-end. Because it is the holidays I can take the
entire network down (ie shutdown all the VMs) to facilitate this update
process, which also means that I can use some other way (ie a non-iSCSI
way - I think) to connect the Ceph SAN Cluster to the oVirt VM-Hosting
Cluster - if *this* is the solution (ie no iSCSI) does someone have any
experience in running oVirt off of Ceph in a non-iSCSI way - and could
you be so kind as to provide some pointers/documentation/help?
And before anyone says it, let me: "I broke, now I own it" :-)
Thanks in advance, and everyone have a Merry Christmas, Heavenly
Hanukkah, Quality Kwanzaa, Really-good (upcoming) Ramadan, and/or a
Happy Holidays.
Cheers
Dulux-Oz
Hello.
We are using Ceph storage to test whether we can run the service by uploading and saving more than 40 billion files.
So I'd like to check the contents below.
1) Maximum number of Rados gateway objects that can be stored in one cluster using the bucket index
2) Maximum number of Rados gateway objects that can be stored in one bucket
Although we have referred to the limitations on the number of Rados gateway objects mentioned in existing documents, it seems theoretically unlimited
If you have operated the number of files at the level we think in actual services or products, we would appreciate it if you could share them.
Below are related documents and related settings values.
> Related documents
- https://documentation.suse.com/ses/5.5/html/ses-all/cha-ceph-gw.html
- https://www.ibm.com/docs/en/storage-ceph/6?topic=resharding-limitations-buc…
- https://docs.ceph.com/en/latest/dev/radosgw/bucket_index/
> Related config
- rgw_dynamic_resharding: true
- rgw_max_objs_per_shard: 100000
- rgw_max_dynamic_shards : 65521
Good morning everybody!
Guys, are there any differences or limitations when using Docker instead of
Podman?
Context: I have a cluster with Debian 11 running Podman (3.0.1), but the
iSCSI service, when restarted, the "tcmu-runner" binary is in "Z State" and
the "rbd-target-api" script enters "D State" and never dies, causing the
service not to start until I perform a reboot. On machines that use
distributions based on Red Hat with podman 4+ this behavior does not happen.
I don't want to use a repository that I don't know about just to update
podman.
I haven't tested it with Debian 12 yet, as we experienced some problems
with bootstrap, so we decided to use Debian 11.
I'm thinking about testing with Docker but I don't know what the difference
is between both solutions in the CEPH context.
Hi community,
When I list images of rbd in ceph dashboard, Block->Images, list image is
too slow to view, how can i get it faster.
I am using ceph reef version 18.2.1
Thanks to the community.
*Tran Thanh Phong*
Email: tranphong079(a)gmail.com
Skype: tranphong079