October 2023 - ceph-users

by Frank Schilder

Hi Dan, thanks for your answer. I don't have a problem with increasing osd_max_scrubs (=1 at the moment) as such. I would simply prefer a somewhat finer grained way of controlling scrubbing than just doubling or tripling it right away. Some more info. These 2 pools are data pools for a large FS. Unfortunately, we have a large percentage of small files, which is a pain for recovery and seemingly also for deep scrubbing. Our OSDs are about 25% used and I had to increase the warning interval already to 2 weeks. With all the warning grace parameters this means that we manage to deep scrub everything about every month. I need to plan for 75% utilisation and a 3 months period is a bit far on the risky side. Our data is to a large percentage cold data. Client reads will not do the check for us, we need to combat bit-rot pro-actively. The reasons I'm interested in parameters initiating more scrubs while also converting more scrubs into deep scrubs are, that 1) scrubs seem to complete very fast. I almost never catch a PG in state "scrubbing", I usually only see "deep scrubbing". 2) I suspect the low deep-scrub count is due to a low number of deep-scrubs scheduled and not due to conflicting per-OSD deep scrub reservations. With the OSD count we have and the distribution over 12 servers I would expect at least a peak of 50% OSDs being active in scrubbing instead of the 25% peak I'm seeing now. It ought to be possible to schedule more PGs for deep scrub than actually are. 3) Every OSD having only 1 deep scrub active seems to have no measurable impact on user IO. If I could just get more PGs scheduled with 1 deep scrub per OSD it would already help a lot. Once this is working, I can eventually increase osd_max_scrubs when the OSDs fill up. For now I would just like that (deep) scrub scheduling looks a bit harder and schedules more eligible PGs per time unit. If we can get deep scrubbing up to an average of 42PGs completing per hour with keeping osd_max_scrubs=1 to maintain current IO impact, we should be able to complete a deep scrub with 75% full OSDs in about 30 days. This is the current tail-time with 25% utilisation. I believe currently a deep scrub of a PG in these pools takes 2-3 hours. Its just a gut feeling from some repair and deep-scrub commands, I would need to check logs for more precise info. Increasing osd_max_scrubs would then be a further and not the only option to push for more deep scrubbing. My expectation would be that values of 2-3 are fine due to the increasingly higher percentage of cold data for which no interference with client IO will happen. Hope that makes sense and there is a way beyond bumping osd_max_scrubs to increase the number of scheduled and executed deep scrubs. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Dan van der Ster <dvanders(a)gmail.com> Sent: 05 January 2023 15:36 To: Frank Schilder Cc: ceph-users(a)ceph.io Subject: Re: [ceph-users] increasing number of (deep) scrubs Hi Frank, What is your current osd_max_scrubs, and why don't you want to increase it? With 8+2, 8+3 pools each scrub is occupying the scrub slot on 10 or 11 OSDs, so at a minimum it could take 3-4x the amount of time to scrub the data than if those were replicated pools. If you want the scrub to complete in time, you need to increase the amount of scrub slots accordingly. On the other hand, IMHO the 1-week deadline for deep scrubs is often much too ambitious for large clusters -- increasing the scrub intervals is one solution, or I find it simpler to increase mon_warn_pg_not_scrubbed_ratio and mon_warn_pg_not_deep_scrubbed_ratio until you find a ratio that works for your cluster. Of course, all of this can impact detection of bit-rot, which anyway can be covered by client reads if most data is accessed periodically. But if the cluster is mostly idle or objects are generally not read, then it would be preferable to increase slots osd_max_scrubs. Cheers, Dan On Tue, Jan 3, 2023 at 2:30 AM Frank Schilder <frans(a)dtu.dk> wrote: > > Hi all, > > we are using 16T and 18T spinning drives as OSDs and I'm observing that they are not scrubbed as often as I would like. It looks like too few scrubs are scheduled for these large OSDs. My estimate is as follows: we have 852 spinning OSDs backing a 8+2 pool with 2024 and an 8+3 pool with 8192 PGs. On average I see something like 10PGs of pool 1 and 12 PGs of pool 2 (deep) scrubbing. This amounts to only 232 out of 852 OSDs scrubbing and seems to be due to a conservative rate of (deep) scrubs being scheduled. The PGs (dep) scrub fairly quickly. > > I would like to increase gently the number of scrubs scheduled for these drives and *not* the number of scrubs per OSD. I'm looking at parameters like: > > osd_scrub_backoff_ratio > osd_deep_scrub_randomize_ratio > > I'm wondering if lowering osd_scrub_backoff_ratio to 0.5 and, maybe, increasing osd_deep_scrub_randomize_ratio to 0.2 would have the desired effect? Are there other parameters to look at that allow gradual changes in the number of scrubs going on? > > Thanks a lot for your help! > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

4 months, 3 weeks

2
3
0 0

CephFS pool not releasing space after data deletion

by Kuhring, Mathias

Dear Ceph users, Our CephFS is not releasing/freeing up space after deleting hundreds of terabytes of data. By now, this drives us in a "nearfull" osd/pool situation and thus throttles IO. We are on ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable). Recently, we moved a bunch of data to a new pool with better EC. This was done by adding a new EC pool to the FS. Then assigning the FS root to the new EC pool via the directory layout xattr (so all new data is written to the new pool). And finally copying old data to new folders. I swapped the data as follows to remain the old directory structures. I also made snapshots for validation purposes. So basically: cp -r mymount/mydata/ mymount/new/ # this creates copy on new pool mkdir mymount/mydata/.snap/tovalidate mkdir mymount/new/mydata/.snap/tovalidate mv mymount/mydata/ mymount/old/ mv mymount/new/mydata mymount/ I could see the increase of data in the new pool as expected (ceph df). I compared the snapshots with hashdeep to make sure the new data is alright. Then I went ahead deleting the old data, basically: rmdir mymount/old/mydata/.snap/* # this also included a bunch of other older snapshots rm -r mymount/old/mydata At first we had a bunch of PGs with snaptrim/snaptrim_wait. But they are done for quite some time now. And now, already two weeks later the size of the old pool still hasn't really decreased. I'm still waiting for around 500 TB to be released (and much more is planned). I honestly have no clue, where to go from here. From my point of view (i.e. the CephFS mount), the data is gone. I also never hard/soft-linked it anywhere. This doesn't seem to be a regular issue. At least I couldn't find anything related or resolved in the docs or user list, yet. If anybody has an idea how to resolve this, I would highly appreciate it. Best Wishes, Mathias

5 months

3
4
0 0

v17.2.7 Quincy released

by Yuri Weinstein

We're happy to announce the 7th backport release in the Quincy series. https://ceph.io/en/news/blog/2023/v17-2-7-quincy-released/ Notable Changes --------------- * `ceph mgr dump` command now displays the name of the Manager module that registered a RADOS client in the `name` field added to elements of the `active_clients` array. Previously, only the address of a module's RADOS client was shown in the `active_clients` array. * mClock Scheduler: The mClock scheduler (default scheduler in Quincy) has undergone significant usability and design improvements to address the slow backfill issue. Some important changes are: * The 'balanced' profile is set as the default mClock profile because it represents a compromise between prioritizing client IO or recovery IO. Users can then choose either the 'high_client_ops' profile to prioritize client IO or the 'high_recovery_ops' profile to prioritize recovery IO. * QoS parameters including reservation and limit are now specified in terms of a fraction (range: 0.0 to 1.0) of the OSD's IOPS capacity. * The cost parameters (osd_mclock_cost_per_io_usec_* and osd_mclock_cost_per_byte_usec_*) have been removed. The cost of an operation is now determined using the random IOPS and maximum sequential bandwidth capability of the OSD's underlying device. * Degraded object recovery is given higher priority when compared to misplaced object recovery because degraded objects present a data safety issue not present with objects that are merely misplaced. Therefore, backfilling operations with the 'balanced' and 'high_client_ops' mClock profiles may progress slower than what was seen with the 'WeightedPriorityQueue' (WPQ) scheduler. * The QoS allocations in all mClock profiles are optimized based on the above fixes and enhancements. * For more detailed information see: https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/ * RGW: S3 multipart uploads using Server-Side Encryption now replicate correctly in multi-site. Previously, the replicas of such objects were corrupted on decryption. A new tool, ``radosgw-admin bucket resync encrypted multipart``, can be used to identify these original multipart uploads. The ``LastModified`` timestamp of any identified object is incremented by 1 nanosecond to cause peer zones to replicate it again. For multi-site deployments that make any use of Server-Side Encryption, we recommended running this command against every bucket in every zone after all zones have upgraded. * CephFS: MDS evicts clients which are not advancing their request tids which causes a large buildup of session metadata resulting in the MDS going read-only due to the RADOS operation exceeding the size threshold. `mds_session_metadata_threshold` config controls the maximum size that a (encoded) session metadata can grow. * CephFS: After recovering a Ceph File System post following the disaster recovery procedure, the recovered files under `lost+found` directory can now be deleted. Getting Ceph ------------ * Git at git://github.com/ceph/ceph.git * Tarball at https://download.ceph.com/tarballs/ceph-17.2.7.tar.gz * Containers at https://quay.io/repository/ceph/ceph * For packages, see https://docs.ceph.com/en/latest/install/get-packages/ * Release git sha1: b12291d110049b2f35e32e0de30d70e9a4c060d2

5 months, 4 weeks

2
1
0 0

MDS stuck in rejoin

by Frank Schilder

Hi all, we had a client with the warning "[WRN] MDS_CLIENT_OLDEST_TID: 1 clients failing to advance oldest client/flush tid". I looked at the client and there was nothing going on, so I rebooted it. After the client was back, the message was still there. To clean this up I failed the MDS. Unfortunately, the MDS that took over is remained stuck in rejoin without doing anything. All that happened in the log was: [root@ceph-10 ceph]# tail -f ceph-mds.ceph-10.log 2023-07-20T15:54:29.147+0200 7fedb9c9f700 1 mds.2.896604 rejoin_start 2023-07-20T15:54:29.161+0200 7fedb9c9f700 1 mds.2.896604 rejoin_joint_start 2023-07-20T15:55:28.005+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896614 from mon.4 2023-07-20T15:56:00.278+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896615 from mon.4 [...] 2023-07-20T16:02:54.935+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896653 from mon.4 2023-07-20T16:03:07.276+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896654 from mon.4 After some time I decided to give another fail a try and, this time, the replacement daemon went to active state really fast. If I have a message like the above, what is the clean way of getting the client clean again (version: 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable))? Thanks and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

6 months

2
16
0 0

Ceph OSD reported Slow operations

by prabhav＠cdac.in

In a production setup of 36 OSDs( SAS disks) totalling 180 TB allocated to a single Ceph Cluster with 3 monitors and 3 managers. There were 830 volumes and VMs created in Openstack with Ceph as a backend. On Sep 21, users reported slowness in accessing the VMs. Analysing the logs lead us to problem with SAS , Network congestion and Ceph configuration( as all default values were used). We updated the Network from 1Gbps to 10Gbps for public and cluster networking. There was no change. The ceph benchmark performance showed that 28 OSDs out of 36 OSDs reported very low IOPS of 30 to 50 while the remaining showed 300+ IOPS. We gradually started reducing the load on the ceph cluster and now the volumes count is 650. Now the slow operations has gradually reduced but I am aware that this is not the solution. Ceph configuration is updated with increasing the osd_journal_size to 10 GB, osd_max_backfills = 1 osd_recovery_max_active = 1 osd_recovery_op_priority = 1 bluestore_cache_trim_max_skip_pinned=10000 After one month, now we faced another issue with Mgr daemon stopped in all 3 quorums and 16 OSDs went down. From the ceph-mon,ceph-mgr.log could not get the reason. Please guide me as its a production setup

6 months

3
3
0 0

owner locked out of bucket via bucket policy

by Wesley Dillingham

I have a bucket which got injected with bucket policy which locks the bucket even to the bucket owner. The bucket now cannot be accessed (even get its info or delete bucket policy does not work) I have looked in the radosgw-admin command for a way to delete a bucket policy but do not see anything. I presume I will need to somehow remove the bucket policy from however it is stored in the bucket metadata / omap etc. If anyone can point me in the right direction on that I would appreciate it. Thanks Respectfully, *Wes Dillingham* wes(a)wesdillingham.com LinkedIn <http://www.linkedin.com/in/wesleydillingham>

6 months

4
15
0 0

6.5 CephFS client - ceph_cap_reclaim_work [ceph] / ceph_con_workfn [libceph] hogged CPU

by Stefan Kooman

Hi, Since the 6.5 kernel addressed the issue with regards to regression in the readahead handling code... we went ahead and installed this kernel for a couple of mail / web clusters (Ubuntu 6.5.1-060501-generic #202309020842 SMP PREEMPT_DYNAMIC Sat Sep 2 08:48:34 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux). Since then we occasionally see the following being logged by the kernel: [Sun Sep 10 07:19:00 2023] workqueue: delayed_work [ceph] hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND [Sun Sep 10 08:41:24 2023] workqueue: ceph_con_workfn [libceph] hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND [Sun Sep 10 11:05:55 2023] workqueue: delayed_work [ceph] hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND [Sun Sep 10 12:54:38 2023] workqueue: ceph_con_workfn [libceph] hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND [Sun Sep 10 19:06:37 2023] workqueue: ceph_con_workfn [libceph] hogged CPU for >10000us 16 times, consider switching to WQ_UNBOUND [Mon Sep 11 10:53:33 2023] workqueue: ceph_con_workfn [libceph] hogged CPU for >10000us 32 times, consider switching to WQ_UNBOUND [Tue Sep 12 10:14:03 2023] workqueue: ceph_con_workfn [libceph] hogged CPU for >10000us 64 times, consider switching to WQ_UNBOUND [Tue Sep 12 11:14:33 2023] workqueue: ceph_cap_reclaim_work [ceph] hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND We wonder if this is a new phenomenon, or that it's rather logged in the new kernel and it was not before. However, we have hit a few OOM situations since we switched to the new kernel because of ceph_cap_reclaim_work events (OOM is because Apache threads keep piling up as it cannot access CephFS). We then also see MDS slow ops reported. This might be related to a backup job that is running on a backup server. We did not observe this behavior on 5.12.19 kernel. Ceph cluster is on 16.2.11 currently. Anyone has some insight on this? Thanks, Stefan

6 months

3
12
0 0

17.2.7 quincy

by Matthew Darwin

Hi all, I see17.2.7 quincy is published as debian-bullseye packages. So I tried it on a test cluster. I must say I was not expecting the big dashboard change in a patch release. Also all the "cluster utilization" numbers are all blank now (any way to fix it?), so the dashboard is much less usable now. Thoughts?

6 months, 1 week

2
7
0 0

Re: RGW access logs with bucket name

by Boris Behrens

Bringing up that topic again: is it possible to log the bucket name in the rgw client logs? currently I am only to know the bucket name when someone access the bucket via https://TLD/bucket/object instead of https://bucket.TLD/object. Am Di., 3. Jan. 2023 um 10:25 Uhr schrieb Boris Behrens <bb(a)kervyn.de>: > Hi, > I am looking forward to move our logs from > /var/log/ceph/ceph-client...log to our logaggregator. > > Is there a way to have the bucket name in the log file? > > Or can I write the rgw_enable_ops_log into a file? Maybe I could work with > this. > > Cheers and happy new year > Boris > -- Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groÃƒ¼en Saal.

6 months, 1 week

4
6
0 0

"cephadm version" in reef returns "AttributeError: 'CephadmContext' object has no attribute 'fsid'"

by Martin Conway

Hi I wrote before about issues I was having with cephadm in 18.2.0 Sorry, I didn't see the helpful replies because my mail service binned the responses. I still can't get the reef version of cephadm to work properly. I had updated the system rpm to reef (ceph repo) and also upgraded the containerised ceph daemons to reef before my first email. Both the system package cephadm and the one found at /var/lib/ceph/${fsid}/cephadm.* return the same error when running "cephadm version" Traceback (most recent call last): File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 9468, in <module> main() File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 9456, in main r = ctx.func(ctx) File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 2108, in _infer_image ctx.image = infer_local_ceph_image(ctx, ctx.container_engine.path) File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 2191, in infer_local_ceph_image container_info = get_container_info(ctx, daemon, daemon_name is not None) File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 2154, in get_container_info matching_daemons = [d for d in daemons if daemon_name_or_type(d) == daemon_filter and d['fsid'] == ctx.fsid] File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 2154, in <listcomp> matching_daemons = [d for d in daemons if daemon_name_or_type(d) == daemon_filter and d['fsid'] == ctx.fsid] File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 217, in __getattr__ return super().__getattribute__(name) AttributeError: 'CephadmContext' object has no attribute 'fsid' I am running into other issues as well, but I think they may point back to this issue of "'CephadmContext' object has no attribute 'fsid'" Any help would be appreciated. Regards, Martin Conway IT and Digital Media Manager Research School of Physics Australian National University Canberra ACT 2601 +61 2 6125 1599 https://physics.anu.edu.au<https://physics.anu.edu.au/>

6 months, 1 week

4
8
0 0

2024

2023

2022

2021

2020

2019

ceph-users October 2023