Hi Dan,
thanks for your answer. I don't have a problem with increasing osd_max_scrubs (=1 at the moment) as such. I would simply prefer a somewhat finer grained way of controlling scrubbing than just doubling or tripling it right away.
Some more info. These 2 pools are data pools for a large FS. Unfortunately, we have a large percentage of small files, which is a pain for recovery and seemingly also for deep scrubbing. Our OSDs are about 25% used and I had to increase the warning interval already to 2 weeks. With all the warning grace parameters this means that we manage to deep scrub everything about every month. I need to plan for 75% utilisation and a 3 months period is a bit far on the risky side.
Our data is to a large percentage cold data. Client reads will not do the check for us, we need to combat bit-rot pro-actively.
The reasons I'm interested in parameters initiating more scrubs while also converting more scrubs into deep scrubs are, that
1) scrubs seem to complete very fast. I almost never catch a PG in state "scrubbing", I usually only see "deep scrubbing".
2) I suspect the low deep-scrub count is due to a low number of deep-scrubs scheduled and not due to conflicting per-OSD deep scrub reservations. With the OSD count we have and the distribution over 12 servers I would expect at least a peak of 50% OSDs being active in scrubbing instead of the 25% peak I'm seeing now. It ought to be possible to schedule more PGs for deep scrub than actually are.
3) Every OSD having only 1 deep scrub active seems to have no measurable impact on user IO. If I could just get more PGs scheduled with 1 deep scrub per OSD it would already help a lot. Once this is working, I can eventually increase osd_max_scrubs when the OSDs fill up. For now I would just like that (deep) scrub scheduling looks a bit harder and schedules more eligible PGs per time unit.
If we can get deep scrubbing up to an average of 42PGs completing per hour with keeping osd_max_scrubs=1 to maintain current IO impact, we should be able to complete a deep scrub with 75% full OSDs in about 30 days. This is the current tail-time with 25% utilisation. I believe currently a deep scrub of a PG in these pools takes 2-3 hours. Its just a gut feeling from some repair and deep-scrub commands, I would need to check logs for more precise info.
Increasing osd_max_scrubs would then be a further and not the only option to push for more deep scrubbing. My expectation would be that values of 2-3 are fine due to the increasingly higher percentage of cold data for which no interference with client IO will happen.
Hope that makes sense and there is a way beyond bumping osd_max_scrubs to increase the number of scheduled and executed deep scrubs.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Dan van der Ster <dvanders(a)gmail.com>
Sent: 05 January 2023 15:36
To: Frank Schilder
Cc: ceph-users(a)ceph.io
Subject: Re: [ceph-users] increasing number of (deep) scrubs
Hi Frank,
What is your current osd_max_scrubs, and why don't you want to increase it?
With 8+2, 8+3 pools each scrub is occupying the scrub slot on 10 or 11
OSDs, so at a minimum it could take 3-4x the amount of time to scrub
the data than if those were replicated pools.
If you want the scrub to complete in time, you need to increase the
amount of scrub slots accordingly.
On the other hand, IMHO the 1-week deadline for deep scrubs is often
much too ambitious for large clusters -- increasing the scrub
intervals is one solution, or I find it simpler to increase
mon_warn_pg_not_scrubbed_ratio and mon_warn_pg_not_deep_scrubbed_ratio
until you find a ratio that works for your cluster.
Of course, all of this can impact detection of bit-rot, which anyway
can be covered by client reads if most data is accessed periodically.
But if the cluster is mostly idle or objects are generally not read,
then it would be preferable to increase slots osd_max_scrubs.
Cheers, Dan
On Tue, Jan 3, 2023 at 2:30 AM Frank Schilder <frans(a)dtu.dk> wrote:
>
> Hi all,
>
> we are using 16T and 18T spinning drives as OSDs and I'm observing that they are not scrubbed as often as I would like. It looks like too few scrubs are scheduled for these large OSDs. My estimate is as follows: we have 852 spinning OSDs backing a 8+2 pool with 2024 and an 8+3 pool with 8192 PGs. On average I see something like 10PGs of pool 1 and 12 PGs of pool 2 (deep) scrubbing. This amounts to only 232 out of 852 OSDs scrubbing and seems to be due to a conservative rate of (deep) scrubs being scheduled. The PGs (dep) scrub fairly quickly.
>
> I would like to increase gently the number of scrubs scheduled for these drives and *not* the number of scrubs per OSD. I'm looking at parameters like:
>
> osd_scrub_backoff_ratio
> osd_deep_scrub_randomize_ratio
>
> I'm wondering if lowering osd_scrub_backoff_ratio to 0.5 and, maybe, increasing osd_deep_scrub_randomize_ratio to 0.2 would have the desired effect? Are there other parameters to look at that allow gradual changes in the number of scrubs going on?
>
> Thanks a lot for your help!
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Dear Ceph users,
Our CephFS is not releasing/freeing up space after deleting hundreds of
terabytes of data.
By now, this drives us in a "nearfull" osd/pool situation and thus
throttles IO.
We are on ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5)
quincy (stable).
Recently, we moved a bunch of data to a new pool with better EC.
This was done by adding a new EC pool to the FS.
Then assigning the FS root to the new EC pool via the directory layout xattr
(so all new data is written to the new pool).
And finally copying old data to new folders.
I swapped the data as follows to remain the old directory structures.
I also made snapshots for validation purposes.
So basically:
cp -r mymount/mydata/ mymount/new/ # this creates copy on new pool
mkdir mymount/mydata/.snap/tovalidate
mkdir mymount/new/mydata/.snap/tovalidate
mv mymount/mydata/ mymount/old/
mv mymount/new/mydata mymount/
I could see the increase of data in the new pool as expected (ceph df).
I compared the snapshots with hashdeep to make sure the new data is alright.
Then I went ahead deleting the old data, basically:
rmdir mymount/old/mydata/.snap/* # this also included a bunch of other
older snapshots
rm -r mymount/old/mydata
At first we had a bunch of PGs with snaptrim/snaptrim_wait.
But they are done for quite some time now.
And now, already two weeks later the size of the old pool still hasn't
really decreased.
I'm still waiting for around 500 TB to be released (and much more is
planned).
I honestly have no clue, where to go from here.
From my point of view (i.e. the CephFS mount), the data is gone.
I also never hard/soft-linked it anywhere.
This doesn't seem to be a regular issue.
At least I couldn't find anything related or resolved in the docs or
user list, yet.
If anybody has an idea how to resolve this, I would highly appreciate it.
Best Wishes,
Mathias
We're happy to announce the 7th backport release in the Quincy series.
https://ceph.io/en/news/blog/2023/v17-2-7-quincy-released/
Notable Changes
---------------
* `ceph mgr dump` command now displays the name of the Manager module that
registered a RADOS client in the `name` field added to elements of the
`active_clients` array. Previously, only the address of a module's RADOS
client was shown in the `active_clients` array.
* mClock Scheduler: The mClock scheduler (default scheduler in Quincy) has
undergone significant usability and design improvements to address the slow
backfill issue. Some important changes are:
* The 'balanced' profile is set as the default mClock profile because it
represents a compromise between prioritizing client IO or recovery IO. Users
can then choose either the 'high_client_ops' profile to prioritize client IO
or the 'high_recovery_ops' profile to prioritize recovery IO.
* QoS parameters including reservation and limit are now specified in terms
of a fraction (range: 0.0 to 1.0) of the OSD's IOPS capacity.
* The cost parameters (osd_mclock_cost_per_io_usec_* and
osd_mclock_cost_per_byte_usec_*) have been removed. The cost of an operation
is now determined using the random IOPS and maximum sequential bandwidth
capability of the OSD's underlying device.
* Degraded object recovery is given higher priority when compared to misplaced
object recovery because degraded objects present a data safety issue not
present with objects that are merely misplaced. Therefore, backfilling
operations with the 'balanced' and 'high_client_ops' mClock profiles may
progress slower than what was seen with the 'WeightedPriorityQueue' (WPQ)
scheduler.
* The QoS allocations in all mClock profiles are optimized based on the above
fixes and enhancements.
* For more detailed information see:
https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/
* RGW: S3 multipart uploads using Server-Side Encryption now replicate
correctly in multi-site. Previously, the replicas of such objects were
corrupted on decryption. A new tool, ``radosgw-admin bucket resync encrypted
multipart``, can be used to identify these original multipart uploads. The
``LastModified`` timestamp of any identified object is incremented by 1
nanosecond to cause peer zones to replicate it again. For multi-site
deployments that make any use of Server-Side Encryption, we recommended
running this command against every bucket in every zone after all zones have
upgraded.
* CephFS: MDS evicts clients which are not advancing their request tids which
causes a large buildup of session metadata resulting in the MDS going
read-only due to the RADOS operation exceeding the size threshold.
`mds_session_metadata_threshold` config controls the maximum size that a
(encoded) session metadata can grow.
* CephFS: After recovering a Ceph File System post following the disaster
recovery procedure, the recovered files under `lost+found` directory can now
be deleted.
Getting Ceph
------------
* Git at git://github.com/ceph/ceph.git
* Tarball at https://download.ceph.com/tarballs/ceph-17.2.7.tar.gz
* Containers at https://quay.io/repository/ceph/ceph
* For packages, see https://docs.ceph.com/en/latest/install/get-packages/
* Release git sha1: b12291d110049b2f35e32e0de30d70e9a4c060d2
Hi all,
we had a client with the warning "[WRN] MDS_CLIENT_OLDEST_TID: 1 clients failing to advance oldest client/flush tid". I looked at the client and there was nothing going on, so I rebooted it. After the client was back, the message was still there. To clean this up I failed the MDS. Unfortunately, the MDS that took over is remained stuck in rejoin without doing anything. All that happened in the log was:
[root@ceph-10 ceph]# tail -f ceph-mds.ceph-10.log
2023-07-20T15:54:29.147+0200 7fedb9c9f700 1 mds.2.896604 rejoin_start
2023-07-20T15:54:29.161+0200 7fedb9c9f700 1 mds.2.896604 rejoin_joint_start
2023-07-20T15:55:28.005+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896614 from mon.4
2023-07-20T15:56:00.278+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896615 from mon.4
[...]
2023-07-20T16:02:54.935+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896653 from mon.4
2023-07-20T16:03:07.276+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896654 from mon.4
After some time I decided to give another fail a try and, this time, the replacement daemon went to active state really fast.
If I have a message like the above, what is the clean way of getting the client clean again (version: 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable))?
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
In a production setup of 36 OSDs( SAS disks) totalling 180 TB allocated to a single Ceph Cluster with 3 monitors and 3 managers. There were 830 volumes and VMs created in Openstack with Ceph as a backend. On Sep 21, users reported slowness in accessing the VMs.
Analysing the logs lead us to problem with SAS , Network congestion and Ceph configuration( as all default values were used). We updated the Network from 1Gbps to 10Gbps for public and cluster networking. There was no change.
The ceph benchmark performance showed that 28 OSDs out of 36 OSDs reported very low IOPS of 30 to 50 while the remaining showed 300+ IOPS.
We gradually started reducing the load on the ceph cluster and now the volumes count is 650. Now the slow operations has gradually reduced but I am aware that this is not the solution.
Ceph configuration is updated with increasing the
osd_journal_size to 10 GB,
osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_op_priority = 1
bluestore_cache_trim_max_skip_pinned=10000
After one month, now we faced another issue with Mgr daemon stopped in all 3 quorums and 16 OSDs went down. From the ceph-mon,ceph-mgr.log could not get the reason. Please guide me as its a production setup
I have a bucket which got injected with bucket policy which locks the
bucket even to the bucket owner. The bucket now cannot be accessed (even
get its info or delete bucket policy does not work) I have looked in the
radosgw-admin command for a way to delete a bucket policy but do not see
anything. I presume I will need to somehow remove the bucket policy from
however it is stored in the bucket metadata / omap etc. If anyone can point
me in the right direction on that I would appreciate it. Thanks
Respectfully,
*Wes Dillingham*
wes(a)wesdillingham.com
LinkedIn <http://www.linkedin.com/in/wesleydillingham>
Hi,
Since the 6.5 kernel addressed the issue with regards to regression in
the readahead handling code... we went ahead and installed this kernel
for a couple of mail / web clusters (Ubuntu 6.5.1-060501-generic
#202309020842 SMP PREEMPT_DYNAMIC Sat Sep 2 08:48:34 UTC 2023 x86_64
x86_64 x86_64 GNU/Linux). Since then we occasionally see the following
being logged by the kernel:
[Sun Sep 10 07:19:00 2023] workqueue: delayed_work [ceph] hogged CPU for
>10000us 4 times, consider switching to WQ_UNBOUND
[Sun Sep 10 08:41:24 2023] workqueue: ceph_con_workfn [libceph] hogged
CPU for >10000us 4 times, consider switching to WQ_UNBOUND
[Sun Sep 10 11:05:55 2023] workqueue: delayed_work [ceph] hogged CPU for
>10000us 8 times, consider switching to WQ_UNBOUND
[Sun Sep 10 12:54:38 2023] workqueue: ceph_con_workfn [libceph] hogged
CPU for >10000us 8 times, consider switching to WQ_UNBOUND
[Sun Sep 10 19:06:37 2023] workqueue: ceph_con_workfn [libceph] hogged
CPU for >10000us 16 times, consider switching to WQ_UNBOUND
[Mon Sep 11 10:53:33 2023] workqueue: ceph_con_workfn [libceph] hogged
CPU for >10000us 32 times, consider switching to WQ_UNBOUND
[Tue Sep 12 10:14:03 2023] workqueue: ceph_con_workfn [libceph] hogged
CPU for >10000us 64 times, consider switching to WQ_UNBOUND
[Tue Sep 12 11:14:33 2023] workqueue: ceph_cap_reclaim_work [ceph]
hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
We wonder if this is a new phenomenon, or that it's rather logged in the
new kernel and it was not before.
However, we have hit a few OOM situations since we switched to the new
kernel because of ceph_cap_reclaim_work events (OOM is because Apache
threads keep piling up as it cannot access CephFS). We then also see MDS
slow ops reported. This might be related to a backup job that is running
on a backup server. We did not observe this behavior on 5.12.19 kernel.
Ceph cluster is on 16.2.11 currently.
Anyone has some insight on this?
Thanks,
Stefan
Hi all,
I see17.2.7 quincy is published as debian-bullseye packages. So I
tried it on a test cluster.
I must say I was not expecting the big dashboard change in a patch
release. Also all the "cluster utilization" numbers are all blank now
(any way to fix it?), so the dashboard is much less usable now.
Thoughts?
Bringing up that topic again:
is it possible to log the bucket name in the rgw client logs?
currently I am only to know the bucket name when someone access the bucket
via https://TLD/bucket/object instead of https://bucket.TLD/object.
Am Di., 3. Jan. 2023 um 10:25 Uhr schrieb Boris Behrens <bb(a)kervyn.de>:
> Hi,
> I am looking forward to move our logs from
> /var/log/ceph/ceph-client...log to our logaggregator.
>
> Is there a way to have the bucket name in the log file?
>
> Or can I write the rgw_enable_ops_log into a file? Maybe I could work with
> this.
>
> Cheers and happy new year
> Boris
>
--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
Hi
I wrote before about issues I was having with cephadm in 18.2.0 Sorry, I didn't see the helpful replies because my mail service binned the responses.
I still can't get the reef version of cephadm to work properly.
I had updated the system rpm to reef (ceph repo) and also upgraded the containerised ceph daemons to reef before my first email.
Both the system package cephadm and the one found at /var/lib/ceph/${fsid}/cephadm.* return the same error when running "cephadm version"
Traceback (most recent call last):
File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 9468, in <module>
main()
File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 9456, in main
r = ctx.func(ctx)
File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 2108, in _infer_image
ctx.image = infer_local_ceph_image(ctx, ctx.container_engine.path)
File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 2191, in infer_local_ceph_image
container_info = get_container_info(ctx, daemon, daemon_name is not None)
File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 2154, in get_container_info
matching_daemons = [d for d in daemons if daemon_name_or_type(d) == daemon_filter and d['fsid'] == ctx.fsid]
File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 2154, in <listcomp>
matching_daemons = [d for d in daemons if daemon_name_or_type(d) == daemon_filter and d['fsid'] == ctx.fsid]
File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 217, in __getattr__
return super().__getattribute__(name)
AttributeError: 'CephadmContext' object has no attribute 'fsid'
I am running into other issues as well, but I think they may point back to this issue of "'CephadmContext' object has no attribute 'fsid'"
Any help would be appreciated.
Regards,
Martin Conway
IT and Digital Media Manager
Research School of Physics
Australian National University
Canberra ACT 2601
+61 2 6125 1599
https://physics.anu.edu.au<https://physics.anu.edu.au/>