I'm currently trying to setup a ceph-dashboard using the official documentation on how to do so.
I've managed to log-in by just visiting the URL & port, and by visting it through haproxy. However using haproxy to visit the site results in odd behavior.
At my first login, nothing loads on the page and eventually at ~5s it times me out, sending me back to the log-in screen.
After logging back on to the dashboard, everything loads and functions as expected. I can refresh my browser as many times as I want and it still keeps on working.
After some time, usually ~30 minutes or so of inactivity, the problem arises again.
Haproxy tells us the server is down for about ~10 seconds, running a simple HTTP check results in the following aswell: CRITICAL - Socket timeout after 10 seconds.
In the ceph-mgr logs there isn't any special error other than: [dashboard ERROR frontend.error] (https://*redacted*/#/login): Http failure response for https://*redacted*/ui-api/orchestrator/get_name: 401 OK None
It seems as such the ceph dashboard is "overloaded", changing haproxy config (following the official ceph documentation on how to set it up) to do health-checks less often results in the problem happening less often.
Anything I might've overlooked that could sort out the issue?
Hello Ceph users,
We've been having an issue with RGW for a couple days and we would
appreciate some help, ideas, or guidance to figure out the issue.
We run a multi-site setup which has been working pretty fine so far. We
don't actually have data replication enabled yet, only metadata
replication. On the master region we've started to see requests piling up
in the rgw process, leading to very slow operations and failures all other
the place (clients timeout before getting responses from rgw). The
workaround for now is to restart the rgw containers regularly.
We've made a mistake and forcefully deleted a bucket on a secondary zone,
this might be the trigger but we are not sure.
Other symptoms include:
* Increased memory usage of the RGW processes (we bumped the container
limits from 4G to 48G to cater for that)
* Lots of read IOPS on the index pool (4 or 5 times more compared to what
we were seeing before)
* The prometheus ceph_rgw_qlen and ceph_rgw_qactive metrics (number of
active requests) seem to show that the number of concurrent requests
increases with time, although we don't see more requests coming in on the
load-balancer side.
The current thought is that the RGW process doesn't close the requests
properly, or that some requests just hang. After a restart of the process
things look OK but the situation turns bad fairly quickly (after 1 hour we
start to see many timeouts).
The rados cluster seems completely healthy, it is also used for rbd
volumes, and we haven't seen any degradation there.
Has anyone experienced that kind of issue? Anything we should be looking at?
Thanks for your help!
Gauvain
Hi,
Our cluster is currently running quincy, and I want to set the minimal
client version to luminous, to enable upmap balancer, but when I tried to,
I got this:
# ceph osd set-require-min-compat-client luminous Error EPERM: cannot set
require_min_compat_client to luminous: 2 connected client(s) look like
jewel (missing 0x800000000000000); add --yes-i-really-mean-it to do it
anyway
I think I know the most likely candidate (and I've asked them), but is
there a way to find out, the way ceph seems to know?
tnx
/Simon
--
I'm using my gmail.com address, because the gmail.com dmarc policy is
"none", some mail servers will reject this (microsoft?) others will instead
allow this when I send mail to a mailling list which has not yet been
configured to send mail "on behalf of" the sender, but rather do a kind of
"forward". The latter situation causes dkim/dmarc failures and the dmarc
policy will be applied. see https://wiki.list.org/DEV/DMARC for more details
Is it possible to configure Ceph so that STS AssumeRoleWithWebIdentity
works with a Kubernetes serviceaccount token?
My goal is that a pod running in a Kubernetes cluster can call
AssumeRoleWithWebIdentity specifying an IAM role (previously created in
Ceph) and the Kubernetes oicd service account token and get back a valid
access key and secret. This would then be used to access objects in
buckets hosted by Ceph object storage. This would allow our code to run
unchanged between the cloud (S3) and on premise (Ceph providing object
storage).
Original AWS document is here -
https://aws.amazon.com/blogs/opensource/introducing-fine-grained-iam-roles-…
Minio implementation is here -
https://min.io/docs/minio/kubernetes/upstream/developers/sts-for-operator.h…
Kubernetes OIDC endpoints (Service account issuer discovery) discussed
here -
https://aws.amazon.com/blogs/opensource/introducing-fine-grained-iam-roles-…
I have setup a Ceph role that specifies an oicd url pointing to
Kubernetes API server and passes the service token. But I still need to
enable STS in ceph I believe, and have ceph talk to Kubernetes oicd.
Before continuing though I am wondering if this setup is supported?
Thanks,
Charlie
We're happy to announce the 1st backport release in the Reef series.
This is the first backport release in the Reef series, and the first
with Debian packages, for Debian Bookworm.
We recommend all users update to this release.
https://ceph.io/en/news/blog/2023/v18-2-1-reef-released/
Notable Changes
---------------
* RGW: S3 multipart uploads using Server-Side Encryption now replicate
correctly in
multi-site. Previously, the replicas of such objects were corrupted
on decryption.
A new tool, ``radosgw-admin bucket resync encrypted multipart``, can
be used to
identify these original multipart uploads. The ``LastModified``
timestamp of any
identified object is incremented by 1ns to cause peer zones to
replicate it again.
For multi-site deployments that make any use of Server-Side Encryption, we
recommended running this command against every bucket in every zone after all
zones have upgraded.
* CEPHFS: MDS evicts clients which are not advancing their request
tids which causes
a large buildup of session metadata resulting in the MDS going
read-only due to
the RADOS operation exceeding the size threshold.
`mds_session_metadata_threshold`
config controls the maximum size that a (encoded) session metadata can grow.
* RGW: New tools have been added to radosgw-admin for identifying and
correcting issues with versioned bucket indexes. Historical bugs with the
versioned bucket index transaction workflow made it possible for the index
to accumulate extraneous "book-keeping" olh entries and plain placeholder
entries. In some specific scenarios where clients made concurrent requests
referencing the same object key, it was likely that a lot of extra index
entries would accumulate. When a significant number of these entries are
present in a single bucket index shard, they can cause high bucket listing
latencies and lifecycle processing failures. To check whether a versioned
bucket has unnecessary olh entries, users can now run ``radosgw-admin
bucket check olh``. If the ``--fix`` flag is used, the extra entries will
be safely removed. A distinct issue from the one described thus far, it is
also possible that some versioned buckets are maintaining extra unlinked
objects that are not listable from the S3/ Swift APIs. These extra objects
are typically a result of PUT requests that exited abnormally, in the middle
of a bucket index transaction - so the client would not have received a
successful response. Bugs in prior releases made these unlinked objects easy
to reproduce with any PUT request that was made on a bucket that was actively
resharding. Besides the extra space that these hidden, unlinked objects
consume, there can be another side effect in certain scenarios, caused by
the nature of the failure mode that produced them, where a client of a bucket
that was a victim of this bug may find the object associated with
the key to7fe91d5d5842e04be3b4f514d6dd990c54b29c76
be in an inconsistent state. To check whether a versioned bucket has unlinked
entries, users can now run ``radosgw-admin bucket check unlinked``. If the
``--fix`` flag is used, the unlinked objects will be safely removed. Finally,
a third issue made it possible for versioned bucket index stats to be
accounted inaccurately. The tooling for recalculating versioned bucket stats
also had a bug, and was not previously capable of fixing these inaccuracies.
This release resolves those issues and users can now expect that the existing
``radosgw-admin bucket check`` command will produce correct results. We
recommend that users with versioned buckets, especially those that existed
on prior releases, use these new tools to check whether their buckets are
affected and to clean them up accordingly.
* mgr/snap-schedule: For clusters with multiple CephFS file systems, all the
snap-schedule commands now expect the '--fs' argument.
* RADOS: A POOL_APP_NOT_ENABLED health warning will now be reported if
the application is not enabled for the pool irrespective of whether
the pool is in use or not. Always add ``application`` label to a pool
to avoid reporting of POOL_APP_NOT_ENABLED health warning for that pool.
The user might temporarilty mute this warning using
``ceph health mute POOL_APP_NOT_ENABLED``.
Getting Ceph
------------
* Git at git://github.com/ceph/ceph.git
* Tarball at https://download.ceph.com/tarballs/ceph-18.2.1.tar.gz
* Containers at https://quay.io/repository/ceph/ceph
* For packages, see https://docs.ceph.com/en/latest/install/get-packages/
* Release git sha1: 7fe91d5d5842e04be3b4f514d6dd990c54b29c76
hi,thank you for guidance
There is no ability to change the global image before launching,I need to download the images from the private registry during the initial setup.
i used option --image but it not worked.
# cephadm bootstrap --image rgistry.test/ceph/ceph:v18 --mon-ip 192.168.0.160 --initial-dashboard-password P@ssw0rd --dashboard-password-noupdate --allow-fqdn-hostname --ssh-user cephadmin
usage: cephadm [-h] [--image IMAGE] [--docker] [--data-dir DATA_DIR]
[--log-dir LOG_DIR] [--logrotate-dir LOGROTATE_DIR]
[--sysctl-dir SYSCTL_DIR] [--unit-dir UNIT_DIR] [--verbose]
[--timeout TIMEOUT] [--retry RETRY] [--env ENV]
[--no-container-init] [--no-cgroups-split]
{version,pull,inspect-image,ls,list-networks,adopt,rm-daemon,rm-cluster,run,shell,enter,ceph-volume,zap-osds,unit,logs,bootstrap,deploy,_orch,check-host,prepare-host,add-repo,rm-repo,install,registry-login,gather-facts,host-maintenance,agent,disk-rescan}
...
cephadm: error: unrecognized arguments: --image rgistry.test/ceph/ceph:v18
also i used cephadm registry-login and it show loged but when i bootstrap first node trying download image from quay registry.
cephadm bootstrap --mon-ip 192.168.0.160 --registry-json /root/mylogin.json --initial-dashboard-password P@ssw0rd --dashboard-password-noupdate --allow-fqdn-hostname --ssh-user cephadmin
Creating directory /etc/ceph for ceph.conf
Verifying ssh connectivity using standard pubkey authentication ...
Adding key to cephadmin@localhost authorized_keys...
Verifying podman|docker is present...
Verifying lvm2 is present...
Verifying time synchronization is in place...
Unit chronyd.service is enabled and running
Repeating the final host check...
docker (/usr/bin/docker) is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK
Cluster fsid: 3c00e38c-9e2e-11ee-95cd-000c29e9f44e
Verifying IP 192.168.0.160 port 3300 ...
Verifying IP 192.168.0.160 port 6789 ...
Mon IP `192.168.0.160` is in CIDR network `192.168.0.0/24`
Mon IP `192.168.0.160` is in CIDR network `192.168.0.0/24`
Internal network (--cluster-network) has not been provided, OSD replication will default to the public_network
Pulling custom registry login info from /root/mylogin.json.
Logging into custom registry.
Pulling container image quay.io/ceph/ceph:v18...
befor i edited cephadm script but now file in coded and can't edited.
and i don't know how can fix it :(((
(Zac Dover) "make check" fails now even for some docs builds. For example:
https://github.com/ceph/ceph/pull/54970, which is a simple edit of
ReStructured Text in doc/radosgw/compression.rst. Greg Farnum and Dan Mick
have already done preliminary investigation of this matter here:
https://ceph-storage.slack.com/archives/C1HFJ4VTN/p1703048785756359.
- Follow Slack thread for updates; we'll continue looking into it
Still 38 PRs to scrub for 16.2.15:
https://github.com/ceph/ceph/pulls?q=is%3Aopen+is%3Apr+milestone%3Apacific
- Looking for PRs that are necessary in the release, as well as
non-trivial PRs that have been open for a while
--
Laura Flores
She/Her/Hers
Software Engineer, Ceph Storage <https://ceph.io>
Chicago, IL
lflores(a)ibm.com | lflores(a)redhat.com <lflores(a)redhat.com>
M: +17087388804
Ceph version is Pacific (16.2.14), upgraded from a sloppy Octopus.
I ran afoul of all the best bugs in Octopus, and in the process
switched on a lot of stuff better left alone, including some detailed
debug logging. Now I can't turn it off.
I am confidently informed by the documentation that the first step
would be the command:
ceph daemon osd.1 config show | less
But instead of config information I get back:
Can't get admin socket path: unable to get conf option admin_socket for
osd: b"error parsing 'osd': expected string of the form TYPE.ID, valid
types are: auth, mon, osd, mds, mgr, client\n"
Which seems to be kind of insane.
Attempting to get daemon config info on a monitor on that machine
gives:
admin_socket: exception getting command descriptions: [Errno 2] No such
file or directory
Which doesn't help either.
Anyone got an idea?