December 2023 - ceph-users

How to configure something like osd_deep_scrub_min_interval?

by Frank Schilder

Hi folks, I am fighting a bit with odd deep-scrub behavior on HDDs and discovered a likely cause of why the distribution of last_deep_scrub_stamps is so weird. I wrote a small script to extract a histogram of scrubs by "days not scrubbed" (more precisely, intervals not scrubbed; see code) to find out how (deep-) scrub times are distributed. Output below. What I expected is along the lines that HDD-OSDs try to scrub every 1-3 days, while they try to deep-scrub every 7-14 days. In other words, OSDs that have been deep-scrubbed within the last 7 days would *never* be in scrubbing+deep state. However, what I see is completely different. There seems to be no distinction between scrub- and deep-scrub start times. This is really unexpected as nobody would try to deep-scrub HDDs every day. Weekly to bi-weekly is normal, specifically for large drives. Is there a way to configure something like osd_deep_scrub_min_interval (no, I don't want to run cron jobs for scrubbing yet)? In the output below, I would like to be able to configure a minimum period of 1-2 weeks before the next deep-scrub happens. How can I do that? The observed behavior is very unusual for RAID systems (if its not a bug in the report script). With this behavior its not surprising that people complain about "not deep-scrubbed in time" messages and too high deep-scrub IO load when such a large percentage of OSDs is needlessly deep-scrubbed after 1-6 days again already. Sample output: # scrub-report dumped pgs Scrub report: 4121 PGs not scrubbed since 1 intervals (6h) 3831 PGs not scrubbed since 2 intervals (6h) 4012 PGs not scrubbed since 3 intervals (6h) 3986 PGs not scrubbed since 4 intervals (6h) 2998 PGs not scrubbed since 5 intervals (6h) 1488 PGs not scrubbed since 6 intervals (6h) 909 PGs not scrubbed since 7 intervals (6h) 771 PGs not scrubbed since 8 intervals (6h) 582 PGs not scrubbed since 9 intervals (6h) 2 scrubbing 431 PGs not scrubbed since 10 intervals (6h) 333 PGs not scrubbed since 11 intervals (6h) 1 scrubbing 265 PGs not scrubbed since 12 intervals (6h) 195 PGs not scrubbed since 13 intervals (6h) 116 PGs not scrubbed since 14 intervals (6h) 78 PGs not scrubbed since 15 intervals (6h) 1 scrubbing 72 PGs not scrubbed since 16 intervals (6h) 37 PGs not scrubbed since 17 intervals (6h) 5 PGs not scrubbed since 18 intervals (6h) 14.237* 19.5cd* 19.12cc* 19.1233* 14.40e* 33 PGs not scrubbed since 20 intervals (6h) 23 PGs not scrubbed since 21 intervals (6h) 16 PGs not scrubbed since 22 intervals (6h) 12 PGs not scrubbed since 23 intervals (6h) 8 PGs not scrubbed since 24 intervals (6h) 2 PGs not scrubbed since 25 intervals (6h) 19.eef* 19.bb3* 4 PGs not scrubbed since 26 intervals (6h) 19.b4c* 19.10b8* 19.f13* 14.1ed* 5 PGs not scrubbed since 27 intervals (6h) 19.43f* 19.231* 19.1dbe* 19.1788* 19.16c0* 6 PGs not scrubbed since 28 intervals (6h) 2 PGs not scrubbed since 30 intervals (6h) 19.10f6* 14.9d* 3 PGs not scrubbed since 31 intervals (6h) 19.1322* 19.1318* 8.a* 1 PGs not scrubbed since 32 intervals (6h) 19.133f* 1 PGs not scrubbed since 33 intervals (6h) 19.1103* 3 PGs not scrubbed since 36 intervals (6h) 19.19cc* 19.12f4* 19.248* 1 PGs not scrubbed since 39 intervals (6h) 19.1984* 1 PGs not scrubbed since 41 intervals (6h) 14.449* 1 PGs not scrubbed since 44 intervals (6h) 19.179f* Deep-scrub report: 3723 PGs not deep-scrubbed since 1 intervals (24h) 4621 PGs not deep-scrubbed since 2 intervals (24h) 8 scrubbing+deep 3588 PGs not deep-scrubbed since 3 intervals (24h) 8 scrubbing+deep 2929 PGs not deep-scrubbed since 4 intervals (24h) 3 scrubbing+deep 1705 PGs not deep-scrubbed since 5 intervals (24h) 4 scrubbing+deep 1904 PGs not deep-scrubbed since 6 intervals (24h) 5 scrubbing+deep 1540 PGs not deep-scrubbed since 7 intervals (24h) 7 scrubbing+deep 1304 PGs not deep-scrubbed since 8 intervals (24h) 7 scrubbing+deep 923 PGs not deep-scrubbed since 9 intervals (24h) 5 scrubbing+deep 557 PGs not deep-scrubbed since 10 intervals (24h) 7 scrubbing+deep 501 PGs not deep-scrubbed since 11 intervals (24h) 2 scrubbing+deep 363 PGs not deep-scrubbed since 12 intervals (24h) 2 scrubbing+deep 377 PGs not deep-scrubbed since 13 intervals (24h) 1 scrubbing+deep 383 PGs not deep-scrubbed since 14 intervals (24h) 2 scrubbing+deep 252 PGs not deep-scrubbed since 15 intervals (24h) 2 scrubbing+deep 116 PGs not deep-scrubbed since 16 intervals (24h) 5 scrubbing+deep 47 PGs not deep-scrubbed since 17 intervals (24h) 2 scrubbing+deep 10 PGs not deep-scrubbed since 18 intervals (24h) 2 PGs not deep-scrubbed since 19 intervals (24h) 19.1c6c* 19.a01* 1 PGs not deep-scrubbed since 20 intervals (24h) 14.1ed* 2 PGs not deep-scrubbed since 21 intervals (24h) 19.1322* 19.10f6* 1 PGs not deep-scrubbed since 23 intervals (24h) 19.19cc* 1 PGs not deep-scrubbed since 24 intervals (24h) 19.179f* PGs marked with a * are on busy OSDs and not eligible for scrubbing. The script (pasted here because attaching doesn't work): # cat bin/scrub-report #!/bin/bash # Compute last scrub interval count. Scrub interval 6h, deep-scrub interval 24h. # Print how many PGs have not been (deep-)scrubbed since #intervals. ceph -f json pg dump pgs 2>&1 > /root/.cache/ceph/pgs_dump.json echo "" T0="$(date +%s)" scrub_info="$(jq --arg T0 "$T0" -rc '.pg_stats[] | [ .pgid, (.last_scrub_stamp[:19]+"Z" | (($T0|tonumber) - fromdateiso8601)/(60*60*6)|ceil), (.last_deep_scrub_stamp[:19]+"Z" | (($T0|tonumber) - fromdateiso8601)/(60*60*24)|ceil), .state, (.acting | join(" ")) ] | @tsv ' /root/.cache/ceph/pgs_dump.json)" # less <<<"$scrub_info" # 1 2 3 4 5..NF # pg_id scrub-ints deep-scrub-ints status acting[] awk <<<"$scrub_info" '{ for(i=5; i<=NF; ++i) pg_osds[$1]=pg_osds[$1] " " $i if($4 == "active+clean") { si_mx=si_mx<$2 ? $2 : si_mx dsi_mx=dsi_mx<$3 ? $3 : dsi_mx pg_sn[$2]++ pg_sn_ids[$2]=pg_sn_ids[$2] " " $1 pg_dsn[$3]++ pg_dsn_ids[$3]=pg_dsn_ids[$3] " " $1 } else if($4 ~ /scrubbing\+deep/) { deep_scrubbing[$3]++ for(i=5; i<=NF; ++i) osd[$i]="busy" } else if($4 ~ /scrubbing/) { scrubbing[$2]++ for(i=5; i<=NF; ++i) osd[$i]="busy" } else { unclean[$2]++ unclean_d[$3]++ si_mx=si_mx<$2 ? $2 : si_mx dsi_mx=dsi_mx<$3 ? $3 : dsi_mx pg_sn[$2]++ pg_sn_ids[$2]=pg_sn_ids[$2] " " $1 pg_dsn[$3]++ pg_dsn_ids[$3]=pg_dsn_ids[$3] " " $1 for(i=5; i<=NF; ++i) osd[$i]="busy" } } END { print "Scrub report:" for(si=1; si<=si_mx; ++si) { if(pg_sn[si]==0 && scrubbing[si]==0 && unclean[si]==0) continue; printf("%7d PGs not scrubbed since %2d intervals (6h)", pg_sn[si], si) if(scrubbing[si]) printf(" %d scrubbing", scrubbing[si]) if(unclean[si]) printf(" %d unclean", unclean[si]) if(pg_sn[si]<=5) { split(pg_sn_ids[si], pgs) osds_busy=0 for(pg in pgs) { split(pg_osds[pgs[pg]], osds) for(o in osds) if(osd[osds[o]]=="busy") osds_busy=1 if(osds_busy) printf(" %s*", pgs[pg]) if(!osds_busy) printf(" %s", pgs[pg]) } } printf("\n") } print "" print "Deep-scrub report:" for(dsi=1; dsi<=dsi_mx; ++dsi) { if(pg_dsn[dsi]==0 && deep_scrubbing[dsi]==0 && unclean_d[dsi]==0) continue; printf("%7d PGs not deep-scrubbed since %2d intervals (24h)", pg_dsn[dsi], dsi) if(deep_scrubbing[dsi]) printf(" %d scrubbing+deep", deep_scrubbing[dsi]) if(unclean_d[dsi]) printf(" %d unclean", unclean_d[dsi]) if(pg_dsn[dsi]<=5) { split(pg_dsn_ids[dsi], pgs) osds_busy=0 for(pg in pgs) { split(pg_osds[pgs[pg]], osds) for(o in osds) if(osd[osds[o]]=="busy") osds_busy=1 if(osds_busy) printf(" %s*", pgs[pg]) if(!osds_busy) printf(" %s", pgs[pg]) } } printf("\n") } print "" print "PGs marked with a * are on busy OSDs and not eligible for scrubbing." } ' Don't forget the last "'" when copy-pasting. Thanks for any pointers. ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

4 months

2
6
0 0

Stuck in upgrade process to reef

by Jan Marek

Hello, I've problem: my ceph cluster (3x mon nodes, 6x osd nodes, every osd node have 12 rotational disk and one NVMe device for bluestore DB). CEPH is installed by ceph orchestrator and have bluefs storage on osd. I've started process upgrade from version 17.2.6 to 18.2.1 by invocating: ceph orch upgrade start --ceph-version 18.2.1 After upgrade of mon and mgr processes orchestrator tried to upgrade the first OSD node, but they are falling down. I've stop the process of upgrade, but I have 1 osd node completely down. After upgrade I've got some error messages and I've found /var/lib/ceph/crashxxxx directories, I attach to this message files, which I've found here. Please, can you advice, what now I can do? It seems, that rocksdb is even non-compatible or corrupted :-( Thanks in advance. Sincerely Jan Marek -- Ing. Jan Marek University of South Bohemia Academic Computer Centre Phone: +420389032080 http://www.gnu.org/philosophy/no-word-attachments.cs.html

4 months

2
8
0 0

Unable to find Refresh Interval Option in Ceph Dashboard (Ceph v18.2.1 "reef")- Seeking Assistance

by Alam Mohammad

Hello, We've been using Ceph for managing our storage infrastructure, and we recently upgraded to the latest version (Ceph v18.2.1 "reef"). However, We've noticed that the "refresh interval" option seems to be missing in the dashboard, and we are facing challenges with monitoring our cluster in real-time. In the earlier version of the Ceph dashboard, there was a useful "refresh interval" option that allowed us to customize the update frequency of the dashboard. This was particularly handy for monitoring changes and responding promptly. However, after the upgrade to Ceph v18.2.1 "reef", We can't seem to find this option anywhere in the dashboard. Additionally, we observed an automatic refresh occurring at every 25 seconds. Seeking guidance locating and tuning the refresh interval settings in the latest version of Ceph to potentially reduce this interval. We've explored the dashboard settings thoroughly and reviewed the release notes for Ceph v18.2.1 "reef", but we couldn't find any mention of the removal of the "refresh interval" option. Any guidance or insights would be greatly appreciated! Thanks, Mohammad Saif Ceph Enthusiast

4 months

3
2
0 0

Re: Ceph newbee questions

by Anthony D'Atri

>> >> You can do that for a PoC, but that's a bad idea for any production workload. You'd want at least three nodes with OSDs to use the default RF=3 replication. You can do RF=2, but at the peril of your mortal data. > > I'm not sure I agree - I think size=2, min_size=2 is no worse than > RAID1 for data security. size=2, min_size=2 *is* RAID1. Except that you become unavailable if a single drive is unavailable. >> That isn't even the main risk as I understand it. Of course a double > failure is going to be a problem with size=2, or traditional RAID1, > and I think anybody choosing this configuration accepts this risk. We see people often enough who don’t know that. I’ve seen double failures. ymmv. > As I understand it, the reason min_size=1 is a trap has nothing to do > with double failures per se. It’s one of the concerns. > > The issue is that Ceph OSDs are somewhat prone to flapping during > recovery (OOM, etc). So even if the disk is fine, an OSD can go down > for a short time. If you have size=2, min=1 configured, then when > this happens the PG will become degraded and will continue operating > on the other OSD, and the flapping OSD becomes stale. Then when it > comes back up it recovers. The problem is that if the other OSD has a > permanent failure (disk crash/etc) while the first OSD is flapping, > now you have no good OSDs, because when the flapping OSD comes back up > it is stale, and its PGs have no peer. Indeed, arguably that’s an overlapping failure. I’ve seen this too, and have a pg query to demonstrate it. > I suspect there are ways to re-activate it, though this will result in potential data > inconsistency since writes were allowed to the cluster and will then > get rolled back. Yep. > With only two OSDs I'm guessing that would be the > main impact (well, depending on journaling behavior/etc), but if you > have more OSDs than that then you could have situations where one file > is getting rolled back, and some other file isn't, and so on. But you’d have a voting majority. > > With min_size=2 you're fairly safe from flapping because there will > always be two replicas that have the most recent version of every PG, > and so you can still tolerate a permanent failure of one of them. Exactly. > > size=2, min=2 doesn't suffer this failure mode, because anytime there > is flapping the PG goes inactive and no writes can be made, so when > the other OSD comes back up there is nothing to recover. Of course > this results in IO blocks and downtime, which is obviously > undesirable, but it is likely a more recoverable state than > inconsistent writes. Agreed, the difference between availability and durability. Depends what’s important to you. > > Apologies if I've gotten any of that wrong, but my understanding is > that it is these sorts of failure modes that cause min_size=1 to be a > trap. This isn't the sort of thing that typically happens in a RAID1 > config, or at least that admins don't think about. It’s both.

4 months

2
4
0 0

ValueError: invalid literal for int() with base 10: '443 ssl_certificate=c

by Владимир Клеусов

Hi, When trying to log in to rgb via the dashboard, an error appears in the logs ValueError: invalid literal for int() with base 10: '443 ssl_certificate=config://rgw/cert/rgw.test' RGW with SSL If rgw is without ssl, everything works fine ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable) Please tell me how to solve this?

4 months, 1 week

2
2
0 0

ceph-iscsi on RL9

by duluxoz

Hi All, Just successfully(?) completed a "live" update of the first node of a Ceph Quincy cluster from RL8 to RL9. Everything "seems" to be working - EXCEPT the iSCSI Gateway on that box. During the update the ceph-iscsi package was removed (ie `ceph-iscsi-3.6-2.g97f5b02.el8.noarch.rpm` - this is the latest package available from the Ceph Repos). So, obviously, I reinstalled the package. However, `dnf` is throwing errors (unsurprisingly, as that package is an el8 package and this box is now running el9): that package requires python 3.6 and el9 runs with python 3.8 (I believe). So my question(s) is: Can I simply "downgrade" python to 3.6, or is there an el9-compatible version of `ceph-iscsi` somewhere, and/or is there some process I need to follow to get the iSCSI Gateway back up and running? Some further info: The next step in my "happy-happy-fun-time-holiday-ICT-maintenance" was to upgrade the current Ceph Cluster to use `cephadm` and to go from Ceph-Quincy to Ceph-Reef - is this my ultimate upgrade path to get the iSCSI G/W back? BTW the Ceph Cluster is used *only* to provide iSCSI LUNS to an oVirt (KVM) Cluster front-end. Because it is the holidays I can take the entire network down (ie shutdown all the VMs) to facilitate this update process, which also means that I can use some other way (ie a non-iSCSI way - I think) to connect the Ceph SAN Cluster to the oVirt VM-Hosting Cluster - if *this* is the solution (ie no iSCSI) does someone have any experience in running oVirt off of Ceph in a non-iSCSI way - and could you be so kind as to provide some pointers/documentation/help? And before anyone says it, let me: "I broke, now I own it" :-) Thanks in advance, and everyone have a Merry Christmas, Heavenly Hanukkah, Quality Kwanzaa, Really-good (upcoming) Ramadan, and/or a Happy Holidays. Cheers Dulux-Oz

4 months, 1 week

2
2
0 0

What is the maximum number of Rados gateway objects in one cluster using the bucket index and in one bucket?

by steve jung

Hello. We are using Ceph storage to test whether we can run the service by uploading and saving more than 40 billion files. So I'd like to check the contents below. 1) Maximum number of Rados gateway objects that can be stored in one cluster using the bucket index 2) Maximum number of Rados gateway objects that can be stored in one bucket Although we have referred to the limitations on the number of Rados gateway objects mentioned in existing documents, it seems theoretically unlimited If you have operated the number of files at the level we think in actual services or products, we would appreciate it if you could share them. Below are related documents and related settings values. > Related documents - https://documentation.suse.com/ses/5.5/html/ses-all/cha-ceph-gw.html - https://www.ibm.com/docs/en/storage-ceph/6?topic=resharding-limitations-buc… - https://docs.ceph.com/en/latest/dev/radosgw/bucket_index/ > Related config - rgw_dynamic_resharding: true - rgw_max_objs_per_shard: 100000 - rgw_max_dynamic_shards : 65521

4 months, 1 week

2
1
0 0

mds generates slow request: peer_request, how to deal with it?

by David Yang

I hope this message finds you well. I have a cephfs cluster with 3 active mds, and use 3-node samba to export through the kernel. Currently, there are 2 node mds experiencing slow requests. We have tried restarting the mds. After a few hours, the replay log status became active. But the slow request reappears. The slow request does not seem to come from the client, but from the request of the mds node. Looking forward to your prompt response. HEALTH_WARN 2 MDSs report slow requests; 2 MDSs behind on trimming [WRN] MDS_SLOW_REQUEST: 2 MDSs report slow requests mds.osd44(mds.0): 2 slow requests are blocked > 30 secs mds.osd43(mds.1): 2 slow requests are blocked > 30 secs [WRN] MDS_TRIM: 2 MDSs behind on trimming mds.osd44(mds.0): Behind on trimming (18642/1024) max_segments: 1024, num_segments: 18642 mds.osd43(mds.1): Behind on trimming (976612/1024) max_segments: 1024, num_segments: 976612 mds.0 { "ops": [ { "description": "peer_request:mds.1:1", "initiated_at": "2023-12-31T11:19:38.679925+0800", "age": 4358.8009461359998, "duration": 4358.8009636369998, "type_data": { "flag_point": "dispatched", "reqid": "mds.1:1", "op_type": "peer_request", "leader_info": { "leader": "1" }, "events": [ { "time": "2023-12-31T11:19:38.679925+0800", "event": "initiated" }, { "time": "2023-12-31T11:19:38.679925+0800", "event": "throttled" }, { "time": "2023-12-31T11:19:38.679925+0800", "event": "header_read" }, { "time": "2023-12-31T11:19:38.679936+0800", "event": "all_read" }, { "time": "2023-12-31T11:19:38.679940+0800", "event": "dispatched" } ] } }, { "description": "peer_request:mds.1:2", "initiated_at": "2023-12-31T11:19:38.679938+0800", "age": 4358.8009326969996, "duration": 4358.8009763549999, "type_data": { "flag_point": "dispatched", "reqid": "mds.1:2", "op_type": "peer_request", "leader_info": { "leader": "1" }, "events": [ { "time": "2023-12-31T11:19:38.679938+0800", "event": "initiated" }, { "time": "2023-12-31T11:19:38.679938+0800", "event": "throttled" }, { "time": "2023-12-31T11:19:38.679938+0800", "event": "header_read" }, { "time": "2023-12-31T11:19:38.679941+0800", "event": "all_read" }, { "time": "2023-12-31T11:19:38.679991+0800", "event": "dispatched" } ] } } ], "complaint_time": 30, "num_blocked_ops": 2 } mds.1 { "ops": [ { "description": "internal op exportdir:mds.1:1", "initiated_at": "2023-12-31T11:19:34.416451+0800", "age": 4384.38814198, "duration": 4384.3881617610004, "type_data": { "flag_point": "failed to wrlock, waiting", "reqid": "mds.1:1", "op_type": "internal_op", "internal_op": 5377, "op_name": "exportdir", "events": [ { "time": "2023-12-31T11:19:34.416451+0800", "event": "initiated" }, { "time": "2023-12-31T11:19:34.416451+0800", "event": "throttled" }, { "time": "2023-12-31T11:19:34.416451+0800", "event": "header_read" }, { "time": "2023-12-31T11:19:34.416451+0800", "event": "all_read" }, { "time": "2023-12-31T11:19:34.416451+0800", "event": "dispatched" }, { "time": "2023-12-31T11:19:38.679923+0800", "event": "requesting remote authpins" }, { "time": "2023-12-31T11:19:38.693981+0800", "event": "failed to wrlock, waiting" } ] } }, { "description": "internal op exportdir:mds.1:2", "initiated_at": "2023-12-31T11:19:34.416482+0800", "age": 4384.3881117999999, "duration": 4384.3881714600002, "type_data": { "flag_point": "failed to wrlock, waiting", "reqid": "mds.1:2", "op_type": "internal_op", "internal_op": 5377, "op_name": "exportdir", "events": [ { "time": "2023-12-31T11:19:34.416482+0800", "event": "initiated" }, { "time": "2023-12-31T11:19:34.416482+0800", "event": "throttled" }, { "time": "2023-12-31T11:19:34.416482+0800", "event": "header_read" }, { "time": "2023-12-31T11:19:34.416482+0800", "event": "all_read" }, { "time": "2023-12-31T11:19:34.416482+0800", "event": "dispatched" }, { "time": "2023-12-31T11:19:38.679929+0800", "event": "requesting remote authpins" }, { "time": "2023-12-31T11:19:38.693995+0800", "event": "failed to wrlock, waiting" } ] } } ], "complaint_time": 30, "num_blocked_ops": 2 } I can't find any other solution other than restarting the mds service with slow requests. Currently, the backlog of mds logs in the metadata pool exceeds 4TB. Best regards,

4 months, 1 week

2
2
0 0

cephadm - podman vs docker

by Murilo Morais

Good morning everybody! Guys, are there any differences or limitations when using Docker instead of Podman? Context: I have a cluster with Debian 11 running Podman (3.0.1), but the iSCSI service, when restarted, the "tcmu-runner" binary is in "Z State" and the "rbd-target-api" script enters "D State" and never dies, causing the service not to start until I perform a reboot. On machines that use distributions based on Red Hat with podman 4+ this behavior does not happen. I don't want to use a repository that I don't know about just to update podman. I haven't tested it with Debian 12 yet, as we experienced some problems with bootstrap, so we decided to use Debian 11. I'm thinking about testing with Docker but I don't know what the difference is between both solutions in the CEPH context.

4 months, 1 week

2
1
0 0

About slow query of Block-Images

by Phong Tran Thanh

Hi community, When I list images of rbd in ceph dashboard, Block->Images, list image is too slow to view, how can i get it faster. I am using ceph reef version 18.2.1 Thanks to the community. *Tran Thanh Phong* Email: tranphong079(a)gmail.com Skype: tranphong079

4 months, 1 week

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users December 2023