December 2023 - ceph-users

ceph-dashboard odd behavior when visiting through haproxy

by Demian Romeijn

I'm currently trying to setup a ceph-dashboard using the official documentation on how to do so. I've managed to log-in by just visiting the URL & port, and by visting it through haproxy. However using haproxy to visit the site results in odd behavior. At my first login, nothing loads on the page and eventually at ~5s it times me out, sending me back to the log-in screen. After logging back on to the dashboard, everything loads and functions as expected. I can refresh my browser as many times as I want and it still keeps on working. After some time, usually ~30 minutes or so of inactivity, the problem arises again. Haproxy tells us the server is down for about ~10 seconds, running a simple HTTP check results in the following aswell: CRITICAL - Socket timeout after 10 seconds. In the ceph-mgr logs there isn't any special error other than: [dashboard ERROR frontend.error] (https://*redacted*/#/login): Http failure response for https://*redacted*/ui-api/orchestrator/get_name: 401 OK None It seems as such the ceph dashboard is "overloaded", changing haproxy config (following the official ceph documentation on how to set it up) to do health-checks less often results in the problem happening less often. Anything I might've overlooked that could sort out the issue?

4 months, 4 weeks

1
0
0 0

RGW requests piling up

by Gauvain Pocentek

Hello Ceph users, We've been having an issue with RGW for a couple days and we would appreciate some help, ideas, or guidance to figure out the issue. We run a multi-site setup which has been working pretty fine so far. We don't actually have data replication enabled yet, only metadata replication. On the master region we've started to see requests piling up in the rgw process, leading to very slow operations and failures all other the place (clients timeout before getting responses from rgw). The workaround for now is to restart the rgw containers regularly. We've made a mistake and forcefully deleted a bucket on a secondary zone, this might be the trigger but we are not sure. Other symptoms include: * Increased memory usage of the RGW processes (we bumped the container limits from 4G to 48G to cater for that) * Lots of read IOPS on the index pool (4 or 5 times more compared to what we were seeing before) * The prometheus ceph_rgw_qlen and ceph_rgw_qactive metrics (number of active requests) seem to show that the number of concurrent requests increases with time, although we don't see more requests coming in on the load-balancer side. The current thought is that the RGW process doesn't close the requests properly, or that some requests just hang. After a restart of the process things look OK but the situation turns bad fairly quickly (after 1 hour we start to see many timeouts). The rados cluster seems completely healthy, it is also used for rbd volumes, and we haven't seen any degradation there. Has anyone experienced that kind of issue? Anything we should be looking at? Thanks for your help! Gauvain

4 months, 4 weeks

1
1
0 0

Is there a way to find out which client uses which version of ceph?

by Simon Oosthoek

Hi, Our cluster is currently running quincy, and I want to set the minimal client version to luminous, to enable upmap balancer, but when I tried to, I got this: # ceph osd set-require-min-compat-client luminous Error EPERM: cannot set require_min_compat_client to luminous: 2 connected client(s) look like jewel (missing 0x800000000000000); add --yes-i-really-mean-it to do it anyway I think I know the most likely candidate (and I've asked them), but is there a way to find out, the way ceph seems to know? tnx /Simon -- I'm using my gmail.com address, because the gmail.com dmarc policy is "none", some mail servers will reject this (microsoft?) others will instead allow this when I send mail to a mailling list which has not yet been configured to send mail "on behalf of" the sender, but rather do a kind of "forward". The latter situation causes dkim/dmarc failures and the dmarc policy will be applied. see https://wiki.list.org/DEV/DMARC for more details

5 months

3
3
0 0

AssumeRoleWithWebIdentity with ServiceAccounts and IAM Roles

by Charlie Savage

Is it possible to configure Ceph so that STS AssumeRoleWithWebIdentity works with a Kubernetes serviceaccount token? My goal is that a pod running in a Kubernetes cluster can call AssumeRoleWithWebIdentity specifying an IAM role (previously created in Ceph) and the Kubernetes oicd service account token and get back a valid access key and secret. This would then be used to access objects in buckets hosted by Ceph object storage. This would allow our code to run unchanged between the cloud (S3) and on premise (Ceph providing object storage). Original AWS document is here - https://aws.amazon.com/blogs/opensource/introducing-fine-grained-iam-roles-… Minio implementation is here - https://min.io/docs/minio/kubernetes/upstream/developers/sts-for-operator.h… Kubernetes OIDC endpoints (Service account issuer discovery) discussed here - https://aws.amazon.com/blogs/opensource/introducing-fine-grained-iam-roles-… I have setup a Ceph role that specifies an oicd url pointing to Kubernetes API server and passes the service token. But I still need to enable STS in ceph I believe, and have ceph talk to Kubernetes oicd. Before continuing though I am wondering if this setup is supported? Thanks, Charlie

5 months

1
0
0 0

FS down

by Sake

5 months

1
0
0 0

could not find secret_id

by xiaowenhao111

5 months

1
0
0 0

v18.2.1 Reef released

by Yuri Weinstein

We're happy to announce the 1st backport release in the Reef series. This is the first backport release in the Reef series, and the first with Debian packages, for Debian Bookworm. We recommend all users update to this release. https://ceph.io/en/news/blog/2023/v18-2-1-reef-released/ Notable Changes --------------- * RGW: S3 multipart uploads using Server-Side Encryption now replicate correctly in multi-site. Previously, the replicas of such objects were corrupted on decryption. A new tool, ``radosgw-admin bucket resync encrypted multipart``, can be used to identify these original multipart uploads. The ``LastModified`` timestamp of any identified object is incremented by 1ns to cause peer zones to replicate it again. For multi-site deployments that make any use of Server-Side Encryption, we recommended running this command against every bucket in every zone after all zones have upgraded. * CEPHFS: MDS evicts clients which are not advancing their request tids which causes a large buildup of session metadata resulting in the MDS going read-only due to the RADOS operation exceeding the size threshold. `mds_session_metadata_threshold` config controls the maximum size that a (encoded) session metadata can grow. * RGW: New tools have been added to radosgw-admin for identifying and correcting issues with versioned bucket indexes. Historical bugs with the versioned bucket index transaction workflow made it possible for the index to accumulate extraneous "book-keeping" olh entries and plain placeholder entries. In some specific scenarios where clients made concurrent requests referencing the same object key, it was likely that a lot of extra index entries would accumulate. When a significant number of these entries are present in a single bucket index shard, they can cause high bucket listing latencies and lifecycle processing failures. To check whether a versioned bucket has unnecessary olh entries, users can now run ``radosgw-admin bucket check olh``. If the ``--fix`` flag is used, the extra entries will be safely removed. A distinct issue from the one described thus far, it is also possible that some versioned buckets are maintaining extra unlinked objects that are not listable from the S3/ Swift APIs. These extra objects are typically a result of PUT requests that exited abnormally, in the middle of a bucket index transaction - so the client would not have received a successful response. Bugs in prior releases made these unlinked objects easy to reproduce with any PUT request that was made on a bucket that was actively resharding. Besides the extra space that these hidden, unlinked objects consume, there can be another side effect in certain scenarios, caused by the nature of the failure mode that produced them, where a client of a bucket that was a victim of this bug may find the object associated with the key to7fe91d5d5842e04be3b4f514d6dd990c54b29c76 be in an inconsistent state. To check whether a versioned bucket has unlinked entries, users can now run ``radosgw-admin bucket check unlinked``. If the ``--fix`` flag is used, the unlinked objects will be safely removed. Finally, a third issue made it possible for versioned bucket index stats to be accounted inaccurately. The tooling for recalculating versioned bucket stats also had a bug, and was not previously capable of fixing these inaccuracies. This release resolves those issues and users can now expect that the existing ``radosgw-admin bucket check`` command will produce correct results. We recommend that users with versioned buckets, especially those that existed on prior releases, use these new tools to check whether their buckets are affected and to clean them up accordingly. * mgr/snap-schedule: For clusters with multiple CephFS file systems, all the snap-schedule commands now expect the '--fs' argument. * RADOS: A POOL_APP_NOT_ENABLED health warning will now be reported if the application is not enabled for the pool irrespective of whether the pool is in use or not. Always add ``application`` label to a pool to avoid reporting of POOL_APP_NOT_ENABLED health warning for that pool. The user might temporarilty mute this warning using ``ceph health mute POOL_APP_NOT_ENABLED``. Getting Ceph ------------ * Git at git://github.com/ceph/ceph.git * Tarball at https://download.ceph.com/tarballs/ceph-18.2.1.tar.gz * Containers at https://quay.io/repository/ceph/ceph * For packages, see https://docs.ceph.com/en/latest/install/get-packages/ * Release git sha1: 7fe91d5d5842e04be3b4f514d6dd990c54b29c76

5 months

7
8
0 0

Re: cephadm file "/sbin/cephadm", line 10098 PK ^

by farhad kh

hi,thank you for guidance There is no ability to change the global image before launching,I need to download the images from the private registry during the initial setup. i used option --image but it not worked. # cephadm bootstrap --image rgistry.test/ceph/ceph:v18 --mon-ip 192.168.0.160 --initial-dashboard-password P@ssw0rd --dashboard-password-noupdate --allow-fqdn-hostname --ssh-user cephadmin usage: cephadm [-h] [--image IMAGE] [--docker] [--data-dir DATA_DIR] [--log-dir LOG_DIR] [--logrotate-dir LOGROTATE_DIR] [--sysctl-dir SYSCTL_DIR] [--unit-dir UNIT_DIR] [--verbose] [--timeout TIMEOUT] [--retry RETRY] [--env ENV] [--no-container-init] [--no-cgroups-split] {version,pull,inspect-image,ls,list-networks,adopt,rm-daemon,rm-cluster,run,shell,enter,ceph-volume,zap-osds,unit,logs,bootstrap,deploy,_orch,check-host,prepare-host,add-repo,rm-repo,install,registry-login,gather-facts,host-maintenance,agent,disk-rescan} ... cephadm: error: unrecognized arguments: --image rgistry.test/ceph/ceph:v18 also i used cephadm registry-login and it show loged but when i bootstrap first node trying download image from quay registry. cephadm bootstrap --mon-ip 192.168.0.160 --registry-json /root/mylogin.json --initial-dashboard-password P@ssw0rd --dashboard-password-noupdate --allow-fqdn-hostname --ssh-user cephadmin Creating directory /etc/ceph for ceph.conf Verifying ssh connectivity using standard pubkey authentication ... Adding key to cephadmin@localhost authorized_keys... Verifying podman|docker is present... Verifying lvm2 is present... Verifying time synchronization is in place... Unit chronyd.service is enabled and running Repeating the final host check... docker (/usr/bin/docker) is present systemctl is present lvcreate is present Unit chronyd.service is enabled and running Host looks OK Cluster fsid: 3c00e38c-9e2e-11ee-95cd-000c29e9f44e Verifying IP 192.168.0.160 port 3300 ... Verifying IP 192.168.0.160 port 6789 ... Mon IP `192.168.0.160` is in CIDR network `192.168.0.0/24` Mon IP `192.168.0.160` is in CIDR network `192.168.0.0/24` Internal network (--cluster-network) has not been provided, OSD replication will default to the public_network Pulling custom registry login info from /root/mylogin.json. Logging into custom registry. Pulling container image quay.io/ceph/ceph:v18... befor i edited cephadm script but now file in coded and can't edited. and i don't know how can fix it :(((

5 months

3
2
0 0

CLT Meeting Minutes 2023-12-20

by Laura Flores

(Zac Dover) "make check" fails now even for some docs builds. For example: https://github.com/ceph/ceph/pull/54970, which is a simple edit of ReStructured Text in doc/radosgw/compression.rst. Greg Farnum and Dan Mick have already done preliminary investigation of this matter here: https://ceph-storage.slack.com/archives/C1HFJ4VTN/p1703048785756359. - Follow Slack thread for updates; we'll continue looking into it Still 38 PRs to scrub for 16.2.15: https://github.com/ceph/ceph/pulls?q=is%3Aopen+is%3Apr+milestone%3Apacific - Looking for PRs that are necessary in the release, as well as non-trivial PRs that have been open for a while -- Laura Flores She/Her/Hers Software Engineer, Ceph Storage <https://ceph.io> Chicago, IL lflores(a)ibm.com | lflores(a)redhat.com <lflores(a)redhat.com> M: +17087388804

5 months

1
0
0 0

Logging control

by Tim Holloway

Ceph version is Pacific (16.2.14), upgraded from a sloppy Octopus. I ran afoul of all the best bugs in Octopus, and in the process switched on a lot of stuff better left alone, including some detailed debug logging. Now I can't turn it off. I am confidently informed by the documentation that the first step would be the command: ceph daemon osd.1 config show | less But instead of config information I get back: Can't get admin socket path: unable to get conf option admin_socket for osd: b"error parsing 'osd': expected string of the form TYPE.ID, valid types are: auth, mon, osd, mds, mgr, client\n" Which seems to be kind of insane. Attempting to get daemon config info on a monitor on that machine gives: admin_socket: exception getting command descriptions: [Errno 2] No such file or directory Which doesn't help either. Anyone got an idea?

5 months

4
6
0 0

2024

2023

2022

2021

2020

2019

ceph-users December 2023