On 29/05/2023 20.55, Anthony D'Atri wrote:
> Check the uptime for the OSDs in question
I restarted all my OSDs within the past 10 days or so. Maybe OSD
restarts are somehow breaking these stats?
>
>> On May 29, 2023, at 6:44 AM, Hector Martin <marcan(a)marcan.st> wrote:
>>
>> Hi,
>>
>> I'm watching a cluster finish a bunch of backfilling, and I noticed that
>> quite often PGs end up with zero misplaced objects, even though they are
>> still backfilling.
>>
>> Right now the cluster is down to 6 backfilling PGs:
>>
>> data:
>> volumes: 1/1 healthy
>> pools: 6 pools, 268 pgs
>> objects: 18.79M objects, 29 TiB
>> usage: 49 TiB used, 25 TiB / 75 TiB avail
>> pgs: 262 active+clean
>> 6 active+remapped+backfilling
>>
>> But there are no misplaced objects, and the misplaced column in `ceph pg
>> dump` is zero for all PGs.
>>
>> If I do a `ceph pg dump_json`, I can see `num_objects_recovered`
>> increasing for these PGs... but the misplaced count is still 0.
>>
>> Is there something else that would cause recoveries/backfills other than
>> misplaced objects? Or perhaps there is a bug somewhere causing the
>> misplaced object count to be misreported as 0 sometimes?
>>
>> # ceph -v
>> ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy
>> (stable)
>>
>> - Hector
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
>
- Hector
Hi all,
I just had created a ceph cluster to use cephfs. When i create the a ceph
fs pool i get the filesystem below error.
# ceph osd pool create cephfs_data 128
pool 'cephfs_data' created
# ceph osd pool create cephfs_metadata 128
pool 'cephfs_metadata' created
# ceph fs new cephfs cephfs_metadata cephfs_data
new fs with metadata pool 6 and data pool 5
# ceph -s
cluster:
id: 1c27def45-f0f9-494d-sfke-eb4323432fd
health: HEALTH_ERR
1 filesystem is offline
1 filesystem is online with fewer MDS than max_mds
services:
mon: 2 daemons, quorum ceph-mon01,ceph-mon02
mgr: ceph-adm01(active)
mds: cephfs-0/0/1 up
osd: 12 osds: 12 up, 12 in
data:
pools: 2 pools, 256 pgs
objects: 0 objects, 0 B
usage: 12 GiB used, 588 GiB / 600 GiB avail
pgs: 256 active+clean
but when i check the max_mds for the ceph fs it says 1
# ceph fs get cephfs | grep max_mds
max_mds 1
Let anyone know what am i missing here? Any inputs is much appreciated.
Regards,
Ram
Ceph-explorer..
I have some questions for those who’ve experienced this issue.
1. It seems like those reporting this issue are seeing it strictly after upgrading to Octopus. From what version did each of these sites upgrade to Octopus? From Nautilus? Mimic? Luminous?
2. Does anyone have any lifecycle rules on a bucket experiencing this issue? If so, please describe.
3. Is anyone making copies of the affected objects (to same or to a different bucket) prior to seeing the issue? And if they are making copies, does the destination bucket have lifecycle rules? And if they are making copies, are those copies ever being removed?
4. Is anyone experiencing this issue willing to run their RGWs with 'debug_ms=1'? That would allow us to see a request from an RGW to either remove a tail object or decrement its reference counter (and when its counter reaches 0 it will be deleted).
Thanks,
Eric
> On Nov 12, 2020, at 4:54 PM, huxiaoyu(a)horebdata.cn wrote:
>
> Looks like this is a very dangerous bug for data safety. Hope the bug would be quickly identified and fixed.
>
> best regards,
>
> Samuel
>
>
>
> huxiaoyu(a)horebdata.cn <mailto:huxiaoyu@horebdata.cn>
>
> From: Janek Bevendorff
> Date: 2020-11-12 18:17
> To: huxiaoyu(a)horebdata.cn <mailto:huxiaoyu@horebdata.cn>; EDH - Manuel Rios; Rafael Lopez
> CC: Robin H. Johnson; ceph-users
> Subject: Re: [ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk
> I have never seen this on Luminous. I recently upgraded to Octopus and the issue started occurring only few weeks later.
>
> On 12/11/2020 16:37, huxiaoyu(a)horebdata.cn wrote:
> which Ceph versions are affected by this RGW bug/issues? Luminous, Mimic, Octupos, or the latest?
>
> any idea?
>
> samuel
>
>
>
> huxiaoyu(a)horebdata.cn
>
> From: EDH - Manuel Rios
> Date: 2020-11-12 14:27
> To: Janek Bevendorff; Rafael Lopez
> CC: Robin H. Johnson; ceph-users
> Subject: [ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk
> This same error caused us to wipe a full cluster of 300TB... will be related to some rados index/database bug not to s3.
>
> As Janek exposed is a mayor issue, because the error silent happend and you can only detect it with S3, when you're going to delete/purge a S3 bucket. Dropping NoSuchKey. Error is not related to S3 logic ..
>
> Hope this time dev's can take enought time to find and resolve the issue. Error happens with low ec profiles, even with replica x3 in some cases.
>
> Regards
>
>
>
> -----Mensaje original-----
> De: Janek Bevendorff <janek.bevendorff(a)uni-weimar.de <mailto:janek.bevendorff@uni-weimar.de>>
> Enviado el: jueves, 12 de noviembre de 2020 14:06
> Para: Rafael Lopez <rafael.lopez(a)monash.edu <mailto:rafael.lopez@monash.edu>>
> CC: Robin H. Johnson <robbat2(a)gentoo.org <mailto:robbat2@gentoo.org>>; ceph-users <ceph-users(a)ceph.io <mailto:ceph-users@ceph.io>>
> Asunto: [ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk
>
> Here is a bug report concerning (probably) this exact issue:
> https://tracker.ceph.com/issues/47866 <https://tracker.ceph.com/issues/47866>
>
> I left a comment describing the situation and my (limited) experiences with it.
>
>
> On 11/11/2020 10:04, Janek Bevendorff wrote:
>>
>> Yeah, that seems to be it. There are 239 objects prefixed
>> .8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh in my dump. However, there are none
>> of the multiparts from the other file to be found and the head object
>> is 0 bytes.
>>
>> I checked another multipart object with an end pointer of 11.
>> Surprisingly, it had way more than 11 parts (39 to be precise) named
>> .1, .1_1 .1_2, .1_3, etc. Not sure how Ceph identifies those, but I
>> could find them in the dump at least.
>>
>> I have no idea why the objects disappeared. I ran a Spark job over all
>> buckets, read 1 byte of every object and recorded errors. Of the 78
>> buckets, two are missing objects. One bucket is missing one object,
>> the other 15. So, luckily, the incidence is still quite low, but the
>> problem seems to be expanding slowly.
>>
>>
>> On 10/11/2020 23:46, Rafael Lopez wrote:
>>> Hi Janek,
>>>
>>> What you said sounds right - an S3 single part obj won't have an S3
>>> multipart string as part of the prefix. S3 multipart string looks
>>> like "2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme".
>>>
>>> From memory, single part S3 objects that don't fit in a single rados
>>> object are assigned a random prefix that has nothing to do with
>>> the object name, and the rados tail/data objects (not the head
>>> object) have that prefix.
>>> As per your working example, the prefix for that would be
>>> '.8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh'. So there would be (239) "shadow"
>>> objects with names containing that prefix, and if you add up the
>>> sizes it should be the size of your S3 object.
>>>
>>> You should look at working and non working examples of both single
>>> and multipart S3 objects, as they are probably all a bit different
>>> when you look in rados.
>>>
>>> I agree it is a serious issue, because once objects are no longer in
>>> rados, they cannot be recovered. If it was a case that there was a
>>> link broken or rados objects renamed, then we could work to
>>> recover...but as far as I can tell, it looks like stuff is just
>>> vanishing from rados. The only explanation I can think of is some
>>> (rgw or rados) background process is incorrectly doing something with
>>> these objects (eg. renaming/deleting). I had thought perhaps it was a
>>> bug with the rgw garbage collector..but that is pure speculation.
>>>
>>> Once you can articulate the problem, I'd recommend logging a bug
>>> tracker upstream.
>>>
>>>
>>> On Wed, 11 Nov 2020 at 06:33, Janek Bevendorff
>>> <janek.bevendorff(a)uni-weimar.de <mailto:janek.bevendorff@uni-weimar.de>
>>> <mailto:janek.bevendorff@uni-weimar.de <mailto:janek.bevendorff@uni-weimar.de>>> wrote:
>>>
>>> Here's something else I noticed: when I stat objects that work
>>> via radosgw-admin, the stat info contains a "begin_iter" JSON
>>> object with RADOS key info like this
>>>
>>>
>>> "key": {
>>> "name":
>>> "29/items/WIDE-20110924034843-crawl420/WIDE-20110924065228-02544.warc.gz",
>>> "instance": "",
>>> "ns": ""
>>> }
>>>
>>>
>>> and then "end_iter" with key info like this:
>>>
>>>
>>> "key": {
>>> "name":
>>> ".8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh_239",
>>> "instance": "",
>>> "ns": "shadow"
>>> }
>>>
>>> However, when I check the broken 0-byte object, the "begin_iter"
>>> and "end_iter" keys look like this:
>>>
>>>
>>> "key": {
>>> "name":
>>> "29/items/WIDE-20110903143858-crawl428/WIDE-20110903143858-01166.warc.gz.2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme.1",
>>> "instance": "",
>>> "ns": "multipart"
>>> }
>>>
>>> [...]
>>>
>>>
>>> "key": {
>>> "name":
>>> "29/items/WIDE-20110903143858-crawl428/WIDE-20110903143858-01166.warc.gz.2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme.19",
>>> "instance": "",
>>> "ns": "multipart"
>>> }
>>>
>>> So, it's the full name plus a suffix and the namespace is
>>> multipart, not shadow (or empty). This in itself may just be an
>>> artefact of whether the object was uploaded in one go or as a
>>> multipart object, but the second difference is that I cannot find
>>> any of the multipart objects in my pool's object name dump. I
>>> can, however, find the shadow RADOS object of the intact S3 object.
>>>
>>>
>>>
>>>
>>> --
>>> *Rafael Lopez*
>>> Devops Systems Engineer
>>> Monash University eResearch Centre
>>>
>>> T: +61 3 9905 9118 <tel:%2B61%203%209905%209118>
>>> E: rafael.lopez(a)monash.edu <mailto:rafael.lopez@monash.edu>
>>>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hi,
In the discussion after the Ceph Month talks yesterday, there was a bit
of chat about cephadm / containers / packages. IIRC, Sage observed that
a common reason in the recent user survey for not using cephadm was that
it only worked on containerised deployments. I think he then went on to
say that he hadn't heard any compelling reasons why not to use
containers, and suggested that resistance was essentially a user
education question[0].
I'd like to suggest, briefly, that:
* containerised deployments are more complex to manage, and this is not
simply a matter of familiarity
* reducing the complexity of systems makes admins' lives easier
* the trade-off of the pros and cons of containers vs packages is not
obvious, and will depend on deployment needs
* Ceph users will benefit from both approaches being supported into the
future
We make extensive use of containers at Sanger, particularly for
scientific workflows, and also for bundling some web apps (e.g.
Grafana). We've also looked at a number of container runtimes (Docker,
singularity, charliecloud). They do have advantages - it's easy to
distribute a complex userland in a way that will run on (almost) any
target distribution; rapid "cloud" deployment; some separation (via
namespaces) of network/users/processes.
For what I think of as a 'boring' Ceph deploy (i.e. install on a set of
dedicated hardware and then run for a long time), I'm not sure any of
these benefits are particularly relevant and/or compelling - Ceph
upstream produce Ubuntu .debs and Canonical (via their Ubuntu Cloud
Archive) provide .debs of a couple of different Ceph releases per Ubuntu
LTS - meaning we can easily separate out OS upgrade from Ceph upgrade.
And upgrading the Ceph packages _doesn't_ restart the daemons[1],
meaning that we maintain control over restart order during an upgrade.
And while we might briefly install packages from a PPA or similar to
test a bugfix, we roll those (test-)cluster-wide, rather than trying to
run a mixed set of versions on a single cluster - and I understand this
single-version approach is best practice.
Deployment via containers does bring complexity; some examples we've
found at Sanger (not all Ceph-related, which we run from packages):
* you now have 2 process supervision points - dockerd and systemd
* docker updates (via distribution unattended-upgrades) have an
unfortunate habit of rudely restarting everything
* docker squats on a chunk of RFC 1918 space (and telling it not to can
be a bore), which coincides with our internal network...
* there is more friction if you need to look inside containers
(particularly if you have a lot running on a host and are trying to find
out what's going on)
* you typically need to be root to build docker containers (unlike packages)
* we already have package deployment infrastructure (which we'll need
regardless of deployment choice)
We also currently use systemd overrides to tweak some of the Ceph units
(e.g. to do some network sanity checks before bringing up an OSD), and
have some tools to pair OSD / journal / LVM / disk device up; I think
these would be more fiddly in a containerised deployment. I'd accept
that fixing these might just be a SMOP[2] on our part.
Now none of this is show-stopping, and I am most definitely not saying
"don't ship containers". But I think there is added complexity to your
deployment from going the containers route, and that is not simply a
"learn how to use containers" learning curve. I do think it is
reasonable for an admin to want to reduce the complexity of what they're
dealing with - after all, much of my job is trying to automate or
simplify the management of complex systems!
I can see from a software maintainer's point of view that just building
one container and shipping it everywhere is easier than building
packages for a number of different distributions (one of my other hats
is a Debian developer, and I have a bunch of machinery for doing this
sort of thing). But it would be a bit unfortunate if the general thrust
of "let's make Ceph easier to set up and manage" was somewhat derailed
with "you must use containers, even if they make your life harder".
I'm not going to criticise anyone who decides to use a container-based
deployment (and I'm sure there are plenty of setups where it's an
obvious win), but if I were advising someone who wanted to set up and
use a 'boring' Ceph cluster for the medium term, I'd still advise on
using packages. I don't think this makes me a luddite :)
Regards, and apologies for the wall of text,
Matthew
[0] I think that's a fair summary!
[1] This hasn't always been true...
[2] Simple (sic.) Matter of Programming
--
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
Hi all,
I have a problem regarding upgrading Ceph cluster from Pacific to Quincy
version with cephadm. I have successfully upgraded the cluster to the
latest Pacific (16.2.11). But when I run the following command to upgrade
the cluster to 17.2.5, After upgrading 3/4 mgrs, the upgrade process stops
with "Unexpected error". (everything is on a private network)
ceph orch upgrade start my-private-repo/quay-io/ceph/ceph:v17.2.5
I also tried the 17.2.4 version.
cephadm fails to check the hosts' status and marks them as offline:
cephadm 2023-04-06T10:19:59.998510+0000 mgr.host9.arhpnd (mgr.4516356) 5782
: cephadm [DBG] host host4 (x.x.x.x) failed check
cephadm 2023-04-06T10:19:59.998553+0000 mgr.host9.arhpnd (mgr.4516356) 5783
: cephadm [DBG] Host "host4" marked as offline. Skipping daemon refresh
cephadm 2023-04-06T10:19:59.998581+0000 mgr.host9.arhpnd (mgr.4516356) 5784
: cephadm [DBG] Host "host4" marked as offline. Skipping gather facts
refresh
cephadm 2023-04-06T10:19:59.998609+0000 mgr.host9.arhpnd (mgr.4516356) 5785
: cephadm [DBG] Host "host4" marked as offline. Skipping network refresh
cephadm 2023-04-06T10:19:59.998633+0000 mgr.host9.arhpnd (mgr.4516356) 5786
: cephadm [DBG] Host "host4" marked as offline. Skipping device refresh
cephadm 2023-04-06T10:19:59.998659+0000 mgr.host9.arhpnd (mgr.4516356) 5787
: cephadm [DBG] Host "host4" marked as offline. Skipping osdspec preview
refresh
cephadm 2023-04-06T10:19:59.998682+0000 mgr.host9.arhpnd (mgr.4516356) 5788
: cephadm [DBG] Host "host4" marked as offline. Skipping autotune
cluster 2023-04-06T10:20:00.000151+0000 mon.host8 (mon.0) 158587 : cluster
[ERR] Health detail: HEALTH_ERR 9 hosts fail cephadm check; Upgrade: failed
due to an unexpected exception
cluster 2023-04-06T10:20:00.000191+0000 mon.host8 (mon.0) 158588 : cluster
[ERR] [WRN] CEPHADM_HOST_CHECK_FAILED: 9 hosts fail cephadm check
cluster 2023-04-06T10:20:00.000202+0000 mon.host8 (mon.0) 158589 : cluster
[ERR] host host7 (x.x.x.x) failed check: Unable to reach remote host
host7. Process exited with non-zero exit status 3
cluster 2023-04-06T10:20:00.000213+0000 mon.host8 (mon.0) 158590 : cluster
[ERR] host host2 (x.x.x.x) failed check: Unable to reach remote host
host2. Process exited with non-zero exit status 3
cluster 2023-04-06T10:20:00.000220+0000 mon.host8 (mon.0) 158591 : cluster
[ERR] host host8 (x.x.x.x) failed check: Unable to reach remote host
host8. Process exited with non-zero exit status 3
cluster 2023-04-06T10:20:00.000228+0000 mon.host8 (mon.0) 158592 : cluster
[ERR] host host4 (x.x.x.x) failed check: Unable to reach remote host
host4. Process exited with non-zero exit status 3
cluster 2023-04-06T10:20:00.000240+0000 mon.host8 (mon.0) 158593 : cluster
[ERR] host host3 (x.x.x.x) failed check: Unable to reach remote host
host3. Process exited with non-zero exit status 3
and here are some outputs of the commands:
[root@host8 ~]# ceph -s
cluster:
id: xxx
health: HEALTH_ERR
9 hosts fail cephadm check
Upgrade: failed due to an unexpected exception
services:
mon: 5 daemons, quorum host8,host1,host7,host2,host9 (age 2w)
mgr: host9.arhpnd(active, since 105m), standbys: host8.jowfih,
host1.warjsr, host2.qyavjj
mds: 1/1 daemons up, 3 standby
osd: 37 osds: 37 up (since 8h), 37 in (since 3w)
data:
io:
client:
progress:
Upgrade to 17.2.5 (0s)
[............................]
[root@host8 ~]# ceph orch upgrade status
{
"target_image": "my-private-repo/quay-io/ceph/ceph@sha256
:34c763383e3323c6bb35f3f2229af9f466518d9db926111277f5e27ed543c427",
"in_progress": true,
"which": "Upgrading all daemon types on all hosts",
"services_complete": [],
"progress": "3/59 daemons upgraded",
"message": "Error: UPGRADE_EXCEPTION: Upgrade: failed due to an
unexpected exception",
"is_paused": true
}
[root@host8 ~]# ceph cephadm check-host host7
check-host failed:
Host 'host7' not found. Use 'ceph orch host ls' to see all managed hosts.
[root@host8 ~]# ceph versions
{
"mon": {
"ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
pacific (stable)": 5
},
"mgr": {
"ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
pacific (stable)": 1,
"ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 3
},
"osd": {
"ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
pacific (stable)": 37
},
"mds": {
"ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
pacific (stable)": 4
},
"overall": {
"ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
pacific (stable)": 47,
"ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 3
}
}
The strange thing is I can rollback the cluster status by failing to
not-upgraded mgr like this:
ceph mgr fail
ceph orch upgrade start my-private-repo/quay-io/ceph/ceph:v16.2.11
Would you happen to have any idea about this?
Best regards,
Reza
Hi Dan,
thanks for your answer. I don't have a problem with increasing osd_max_scrubs (=1 at the moment) as such. I would simply prefer a somewhat finer grained way of controlling scrubbing than just doubling or tripling it right away.
Some more info. These 2 pools are data pools for a large FS. Unfortunately, we have a large percentage of small files, which is a pain for recovery and seemingly also for deep scrubbing. Our OSDs are about 25% used and I had to increase the warning interval already to 2 weeks. With all the warning grace parameters this means that we manage to deep scrub everything about every month. I need to plan for 75% utilisation and a 3 months period is a bit far on the risky side.
Our data is to a large percentage cold data. Client reads will not do the check for us, we need to combat bit-rot pro-actively.
The reasons I'm interested in parameters initiating more scrubs while also converting more scrubs into deep scrubs are, that
1) scrubs seem to complete very fast. I almost never catch a PG in state "scrubbing", I usually only see "deep scrubbing".
2) I suspect the low deep-scrub count is due to a low number of deep-scrubs scheduled and not due to conflicting per-OSD deep scrub reservations. With the OSD count we have and the distribution over 12 servers I would expect at least a peak of 50% OSDs being active in scrubbing instead of the 25% peak I'm seeing now. It ought to be possible to schedule more PGs for deep scrub than actually are.
3) Every OSD having only 1 deep scrub active seems to have no measurable impact on user IO. If I could just get more PGs scheduled with 1 deep scrub per OSD it would already help a lot. Once this is working, I can eventually increase osd_max_scrubs when the OSDs fill up. For now I would just like that (deep) scrub scheduling looks a bit harder and schedules more eligible PGs per time unit.
If we can get deep scrubbing up to an average of 42PGs completing per hour with keeping osd_max_scrubs=1 to maintain current IO impact, we should be able to complete a deep scrub with 75% full OSDs in about 30 days. This is the current tail-time with 25% utilisation. I believe currently a deep scrub of a PG in these pools takes 2-3 hours. Its just a gut feeling from some repair and deep-scrub commands, I would need to check logs for more precise info.
Increasing osd_max_scrubs would then be a further and not the only option to push for more deep scrubbing. My expectation would be that values of 2-3 are fine due to the increasingly higher percentage of cold data for which no interference with client IO will happen.
Hope that makes sense and there is a way beyond bumping osd_max_scrubs to increase the number of scheduled and executed deep scrubs.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Dan van der Ster <dvanders(a)gmail.com>
Sent: 05 January 2023 15:36
To: Frank Schilder
Cc: ceph-users(a)ceph.io
Subject: Re: [ceph-users] increasing number of (deep) scrubs
Hi Frank,
What is your current osd_max_scrubs, and why don't you want to increase it?
With 8+2, 8+3 pools each scrub is occupying the scrub slot on 10 or 11
OSDs, so at a minimum it could take 3-4x the amount of time to scrub
the data than if those were replicated pools.
If you want the scrub to complete in time, you need to increase the
amount of scrub slots accordingly.
On the other hand, IMHO the 1-week deadline for deep scrubs is often
much too ambitious for large clusters -- increasing the scrub
intervals is one solution, or I find it simpler to increase
mon_warn_pg_not_scrubbed_ratio and mon_warn_pg_not_deep_scrubbed_ratio
until you find a ratio that works for your cluster.
Of course, all of this can impact detection of bit-rot, which anyway
can be covered by client reads if most data is accessed periodically.
But if the cluster is mostly idle or objects are generally not read,
then it would be preferable to increase slots osd_max_scrubs.
Cheers, Dan
On Tue, Jan 3, 2023 at 2:30 AM Frank Schilder <frans(a)dtu.dk> wrote:
>
> Hi all,
>
> we are using 16T and 18T spinning drives as OSDs and I'm observing that they are not scrubbed as often as I would like. It looks like too few scrubs are scheduled for these large OSDs. My estimate is as follows: we have 852 spinning OSDs backing a 8+2 pool with 2024 and an 8+3 pool with 8192 PGs. On average I see something like 10PGs of pool 1 and 12 PGs of pool 2 (deep) scrubbing. This amounts to only 232 out of 852 OSDs scrubbing and seems to be due to a conservative rate of (deep) scrubs being scheduled. The PGs (dep) scrub fairly quickly.
>
> I would like to increase gently the number of scrubs scheduled for these drives and *not* the number of scrubs per OSD. I'm looking at parameters like:
>
> osd_scrub_backoff_ratio
> osd_deep_scrub_randomize_ratio
>
> I'm wondering if lowering osd_scrub_backoff_ratio to 0.5 and, maybe, increasing osd_deep_scrub_randomize_ratio to 0.2 would have the desired effect? Are there other parameters to look at that allow gradual changes in the number of scrubs going on?
>
> Thanks a lot for your help!
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Bringing up that topic again:
is it possible to log the bucket name in the rgw client logs?
currently I am only to know the bucket name when someone access the bucket
via https://TLD/bucket/object instead of https://bucket.TLD/object.
Am Di., 3. Jan. 2023 um 10:25 Uhr schrieb Boris Behrens <bb(a)kervyn.de>:
> Hi,
> I am looking forward to move our logs from
> /var/log/ceph/ceph-client...log to our logaggregator.
>
> Is there a way to have the bucket name in the log file?
>
> Or can I write the rgw_enable_ops_log into a file? Maybe I could work with
> this.
>
> Cheers and happy new year
> Boris
>
--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
Hey everyone,
On 20/10/2022 10:12, Christian Rohmann wrote:
> 1) May I bring up again my remarks about the timing:
>
> On 19/10/2022 11:46, Christian Rohmann wrote:
>
>> I believe the upload of a new release to the repo prior to the
>> announcement happens quite regularly - it might just be due to the
>> technical process of releasing.
>> But I agree it would be nice to have a more "bit flip" approach to
>> new releases in the repo and not have the packages appear as updates
>> prior to the announcement and final release and update notes.
> By my observations sometimes there are packages available on the
> download servers via the "last stable" folders such as
> https://download.ceph.com/debian-quincy/ quite some time before the
> announcement of a release is out.
> I know it's hard to time this right with mirrors requiring some time
> to sync files, but would be nice to not see the packages or have
> people install them before there are the release notes and potential
> pointers to changes out.
Todays 16.2.11 release shows the exact issue I described above ....
1) 16.2.11 packages are already available via e.g.
https://download.ceph.com/debian-pacific
2) release notes not yet merged:
(https://github.com/ceph/ceph/pull/49839), thus
https://ceph.io/en/news/blog/2022/v16-2-11-pacific-released/ show a 404 :-)
3) No announcement like
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/QOCU563UD3…
to the ML yet.
Regards
Christian
Hi everyone
I'm new to CEPH, just a french 4 days training session with Octopus on
VMs that convince me to build my first cluster.
At this time I have 4 old identical nodes for testing with 3 HDDs each,
2 network interfaces and running Alma Linux8 (el8). I try to replay the
training session but it fails, breaking the web interface because of
some problems with podman 4.2 not compatible with Octopus.
So I try to deploy Pacific with cephadm tool on my first node (mostha1)
(to enable testing also an upgrade later).
dnf -y install
https://download.ceph.com/rpm-16.2.13/el8/noarch/cephadm-16.2.13-0.el8.noar…
monip=$(getent ahostsv4 mostha1 |head -n 1| awk '{ print $1 }')
cephadm bootstrap --mon-ip $monip --initial-dashboard-password xxxxx \
--initial-dashboard-user admceph \
--allow-fqdn-hostname --cluster-network 10.1.0.0/16
This was sucessfull.
But running "*c**eph orch device ls*" do not show any HDD even if I have
/dev/sda (used by the OS), /dev/sdb and /dev/sdc
The web interface shows a row capacity which is an aggregate of the
sizes of the 3 HDDs for the node.
I've also tried to reset /dev/sdb but cephadm do not see it:
[ceph: root@mostha1 /]# ceph orch device zap
mostha1.legi.grenoble-inp.fr /dev/sdb --force
Error EINVAL: Device path '/dev/sdb' not found on host
'mostha1.legi.grenoble-inp.fr'
On my first attempt with octopus, I was able to list the available HDD
with this command line. Before moving to Pacific, the OS on this node
has been reinstalled from scratch.
Any advices for a CEPH beginner ?
Thanks
Patrick
Hi all,
we seem to have hit a bug in the ceph fs kernel client and I just want to confirm what action to take. We get the error "wrong peer at address" in dmesg and some jobs on that server seem to get stuck in fs access; log extract below. I found these 2 tracker items https://tracker.ceph.com/issues/23883 and https://tracker.ceph.com/issues/41519, which don't seem to have fixes.
My questions:
- Is this harmless or does it indicate invalid/corrupted client cache entries?
- How to resolve, ignore, umount+mount or reboot?
Here an extract from the dmesg log, the error has survived a couple of MDS restarts already:
[Mon Mar 6 12:56:46 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:05:18 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-1572619386
[Mon Mar 6 13:05:18 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:13:50 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-1572619386
[Mon Mar 6 13:13:50 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:16:41 2023] libceph: mds1 192.168.32.87:6801 socket closed (con state OPEN)
[Mon Mar 6 13:16:41 2023] libceph: mds1 192.168.32.87:6801 socket closed (con state OPEN)
[Mon Mar 6 13:16:45 2023] ceph: mds1 reconnect start
[Mon Mar 6 13:16:45 2023] ceph: mds1 reconnect start
[Mon Mar 6 13:16:48 2023] ceph: mds1 reconnect success
[Mon Mar 6 13:16:48 2023] ceph: mds1 reconnect success
[Mon Mar 6 13:18:13 2023] ceph: update_snap_trace error -22
[Mon Mar 6 13:18:17 2023] libceph: mds7 192.168.32.88:6801 socket closed (con state OPEN)
[Mon Mar 6 13:18:17 2023] libceph: mds7 192.168.32.88:6801 socket closed (con state OPEN)
[Mon Mar 6 13:18:23 2023] ceph: mds1 recovery completed
[Mon Mar 6 13:18:23 2023] ceph: mds1 recovery completed
[Mon Mar 6 13:18:28 2023] ceph: mds7 reconnect start
[Mon Mar 6 13:18:28 2023] ceph: mds7 reconnect start
[Mon Mar 6 13:18:28 2023] ceph: mds7 reconnect success
[Mon Mar 6 13:18:29 2023] ceph: mds7 reconnect success
[Mon Mar 6 13:18:35 2023] ceph: update_snap_trace error -22
[Mon Mar 6 13:18:35 2023] ceph: mds7 recovery completed
[Mon Mar 6 13:18:35 2023] ceph: mds7 recovery completed
[Mon Mar 6 13:22:22 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Mon Mar 6 13:22:22 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:30:54 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[...]
[Thu Mar 9 09:37:24 2023] slurm.epilog.cl (31457): drop_caches: 3
[Thu Mar 9 09:38:26 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 09:38:26 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Thu Mar 9 09:46:58 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 09:46:58 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Thu Mar 9 09:55:30 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 09:55:30 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Thu Mar 9 10:04:02 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 10:04:02 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14