Hi!
I have a problem after starting to upgrade to 16.2.4, from 15.2.13. I started the upgrade and it successfully redeployed 2 out of 3 mgr daemon containers. The third failed to upgrade and Cephadm started retrying to upgrade it forever. The only way I could stop this was to disable the cephadm module.
I found out I had an old version of podman installed and proceeded to upgrade it to one of the fitting versions according to the requirements docs. I have 3.0.1 installed now.
This solved some issue with being unable to start containers, due to a failing 'get podman version' command. (The Go template did not fit the output of the older version of podman.)
Ok, so now it got a little further in the process, but enabling the cephadm module would still start to retry the above action indefinitely. It now fails with this log:
https://pastebin.com/p3T1fbjs <https://pastebin.com/p3T1fbjs>
At first I thought it had something to do with rate limits on docker.io <http://docker.io/>, but it seems I can pull other stuff without problems. I also setup an account and played around with cephadm registry-login, but did not get much further.
When looking at the pull command in the logs, I see it is using some ID for the container image that needs to be resolved, I suppose. Could it maybe make an error here, resulting in a bad URL that hits a resource that it is not supposed to hit, resulting in access errors?
Any other thoughts on how to fix this error, or somehow make cephadm stop retrying this action and fixing it?
Thanks very much and with regards,
Samy
Hello Ceph-users,
I've upgraded my Ubuntu server from 18.04.5 LTS to Ubuntu 20.04.2 LTS via 'do-release-upgrade',
during that process ceph packages were upgraded from Luminous to Octopus and now ceph-mon daemon(I have only one) won't start, log error is:
"2021-06-15T20:23:41.843+0000 7fbb55e9b540 -1 mon.target@-1(probing) e2 current monmap has recorded min_mon_release 12 (luminous) is >2 releases older than installed 15 (octopus);
you can only upgrade 2 releases at a time you should first upgrade to 13 (mimic) or 14 (nautilus) stopping."
Is there any way to get cluster running or at least get data from OSDs?
Will appreciate any help.
Thank you
--
Best regards,
Petr
Hi
Our first upgrade (non-cephadm) from Octopus to Pacific 16.0.4 went very
smoothly. Thanks for all the effort.
The only thing that has bitten us is
https://tracker.ceph.com/issues/50556
<https://tracker.ceph.com/issues/50556> which prevents a multipart
upload to an RGW bucket that has a bucket policy. While I've been able
to rewrite the most urgent scripts to use s3api put-object (which
doesn't do multipart), that only works for objects up to a certain size.
Removing the bucket policies isn't an option.
I can see that it has been fixed, and is now pending backport
(https://tracker.ceph.com/issues/51001
<https://tracker.ceph.com/issues/51001>). Will this be included in
16.0.5? And do we have an estimated date for that?
We can wait a little longer, but otherwise I have to do some more
drastic changes to an application. Having an indication of date would
help me choose which...
Many thanks, Chris
This is a great start, thank you! Basically I can look through the code to
get the keys I need.
But maybe I'm approaching this task wrong? Maybe there's already some
better solution to monitor cluster health details?
ср, 16 июн. 2021 г. в 02:47, Anthony D'Atri <anthony.datri(a)gmail.com>:
> Before Luminous, mon clock skew was part of the health status JSON. With
> Luminous and later releases, one has to invoke a separate command to get
> the info.
>
> This is a royal PITA for monitoring / metrics infrastructure and I’ve
> never seen a reason why it was done.
>
> You might find the code here
> https://github.com/digitalocean/ceph_exporter
> useful. Note that there are multiple branches, which can be confusing.
>
> > On Jun 15, 2021, at 4:21 PM, Vladimir Prokofev <v(a)prokofev.me> wrote:
> >
> > Good day.
> >
> > I'm writing some code for parsing output data for monitoring purposes.
> > The data is that of "ceph status -f json", "ceph df -f json", "ceph osd
> > perf -f json" and "ceph osd pool stats -f json".
> > I also need support for all major CEPH releases, starting with Jewel till
> > Pacific.
> >
> > What I've stumbled upon is that:
> > - keys in JSON output are not present if there's no appropriate data.
> > For example the key ['pgmap', 'read_bytes_sec'] will not be present in
> > "ceph status" output if there's no read activity in the cluster;
> > - some keys changed between versions. For example ['health']['status']
> key
> > is not present in Jewel, but is available in all the following versions;
> > vice-versa, key ['osdmap', 'osdmap'] is not present in Pacific, but is in
> > all the previous versions.
> >
> > So I need to get a list of all possible keys for all CEPH releases. Any
> > ideas how this can be achieved? My only thought atm is to build a
> "failing"
> > cluster with all the possible states and get a reference data out of it.
> > Not only this is tedious work since it requires each possible cluster
> > version, but it is also prone for error.
> > Is there any publicly available JSON schema for output?
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
>
Hello,
I would like to ask about osd_scrub_max_preemptions in 14.2.20 for large
OSDs (mine are 12TB) and/or large k+m EC pools (mine are 8+2). I did
search the archives for this list, but I did not see any reference.
Symptoms:
I have been seeing a behavior in my cluster over the past 2 or 3 weeks
where, for no apparent reason, there are suddenly slow ops, followed by a
brief OSD down, massive but brief degradation/activating/peering, and then
back to normal.
I had thought this might have to do with some backfill activity due to a
recently failed (as in down and out and process wouldn't start), but now
all of that is over and the cluster is mostly back to HEALTH_OK.
Thinking this might be something that was introduced between 14.2.9 and
14.2.16, I upgraded to 14.2.20 this morning. However, I just saw the same
kind of event happen twice again. At the time, the only non-client
activity was a single deep-scrub.
Question:
The description for osd_scrub_max_preemptions indicates that a deep scrub
process will allow itself to be preempted a fixed number of times by client
I/O and will then block client I/O until it finishes. Although I don't
fully understand the deep scrub process, it seems that either the size of
the HDD or the k+m count of the EC Pool could affect the time needed to
complete a deep scrub and thus increase the likelihood that more than the
default 5 preemptions will occur.
Please tell me if my understanding is correct. If so, is there any
guideline for increasing osd_scrub_max_preemptions just enough balance
between scrub progress and client responsiveness?
Or perhaps there are other scrub attributes that should be tuned instead?
Thanks.
-Dave
--
Dave Hall
Binghamton University
kdhall(a)binghamton.edu
I've been working on some improvements to our large cluster's space
balancing, when I noticed that sometimes the OSD maps have strange upmap
entries. Here is an example on a clean cluster (PGs are active+clean):
{
"pgid": "1.1cb7",
...
"up": [
891,
170,
1338
],
"acting": [
891,
170,
1338
],
...
},
with an upmap entry:
pg_upmap_items 1.1cb7 [170,891]
this would make the "up" list [ 170, 170, 1338 ], which isn't allowed.
So the cluster just seems to ignore this upmap. When I remove the
upmap, nothing changes in the PG state, and I can even re-insert it
(without any effect). Any ideas why this upmap doesn't simply get
rejected/removed?
However, if I were to insert an upmap [170, 892], it gets rejected
correctly (since 891 and 892 are on the same host - violating crush rules).
Any insights would be helpful,
Andras
Hallo Ceph-Users,
I've been wondering about the state of OpenStack Keystone Auth in RADOSGW.
1) Even though the general documentation on RADOSGW S3 bucket policies
is a little "misleading"
https://docs.ceph.com/en/latest/radosgw/bucketpolicy/#creation-and-removal
in showing users being referred as Principal,
the documentation about Keystone integration at
https://docs.ceph.com/en/latest/radosgw/keystone/#integrating-with-openstac…
clearly states, that "A Ceph Object Gateway user is mapped into a
Keystone <tenant>"||.
In the keystone authentication code it strictly only takes the project
from the authenticating user:
*
https://github.com/ceph/ceph/blob/6ce6874bae8fbac8921f0bdfc3931371fc61d4ff/…
*
https://github.com/ceph/ceph/blob/6ce6874bae8fbac8921f0bdfc3931371fc61d4ff/…
This is rather unfortunate as this renders the usually powerful S3
bucket policies to be rather basic with granting access to all users
(with a certain role) of a project or more importantly all users of
another project / tenant, as in using
arn:aws:iam::$OS_REMOTE_PROJECT_ID:root
as principal.
Or am I just misreading anything here or is this really all that can be
done if using native keystone auth?
2) There is a PR open implementing generic external authentication
https://github.com/ceph/ceph/pull/34093
Apparently this seems to also address the lack of support for subusers
for Keystone - if I understand this correctly I could then grant access
to users
arn:aws:iam::$OS_REMOTE_PROJECT_ID:$user
Are there any plans on the roadmap to extend the functionality in
regards to keystone as authentication backend?
I know a similar question as been asked before
(https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/GY7VUKCQ5QU…)
but unfortunately with no discussion / responses then.
Regards
Christian
Hi all,
I know this came up before but I couldn't find a resolution.
We get the error
libceph: monX session lost, hunting for new mon
a lot on our samba servers that reexport cephfs. A lot means more than
once a minute. On other machines that are less busy we get it about
every 10-30 minutes. We only use a single network for both client and
backend traffic on bonded 10GE links.
So, my questions are: is this expected and normal behaviour? how to
track this problem down?
Regards
magnus
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.