Hi all,
Here are this week's notes from the CLT:
* Collective review of the Reef/Squid "State of Cephalopod" slides.
* Smoke test suite was unscheduled but it's back on now.
* Releases:
* 17.2.7: about to start building last week, delayed by a few
issues (https://tracker.ceph.com/issues/63257,
https://tracker.ceph.com/issues/63305,
https://github.com/ceph/ceph/pull/54169). ceph_exporter test coverage
will be prioritized.
* 18.2.1: all PRs in testing or merged.
* Ceph Board approved a new Foundation member tiers model, Silver,
Gold, Platinum, Diamond. Working on implementation with LF.
-- dan
Hi all,
I have a case where I want to set options for a set of HDDs under a common sub-tree with root A. I have also HDDs in another disjoint sub-tree with root B. Therefore, I would like to do something like
ceph config set osd/class:hdd,datacenter:A option value
The above does not give a syntax error, but I'm also not sure it does the right thing. Does the above mean "class:hdd and datacenter:A" or does it mean "for OSDs with device class 'hdd,datacenter:A'"?
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hi,
I'm trying to use the rgw mgr module to configure RGWs. Unfortunately it
is not present in 'ceph mgr module ls' list and any attempt to enable it
suggests that one mgr doesn't support it and that --force should be
added. Adding --force effectively enabled it.
It is strange as it is a brand new cluster, created in Quincy, using
cephadm. Why this need for --force? And it seems that even if the module
is listed as enabled, the 'ceph rgw' command is not recognized and the
help is not available for the rgw subcommand? What are we doing wrong?
Cheers,
Michel
I'm fairly new to the community so I figured I'd ask about this here before creating an issue - I'm not sure how supported this config is.
I am running rook v1.12.6 and ceph 18.2.0. I've enabled the dashboard in the CRD and it has been working for a while. However, the charts are empty.
I do have Prometheus+Grafana running on my cluster, and I can access many of the ceph metrics from there. With the upgrade to reef I noticed that many of the qunicy dashboard elements have been replaced by charts, so I wanted to get those working.
I discovered that if I run ceph dashboard set-prometheus-api-host <url> the charts immediately are populated (including historical data). However, when I do this I rapidly start getting ceph health alerts due to a crashing mgr module. If I set the prometheus api host url to '' the crashes stop accumulating, though this disables the charts.
I am running the prometheus-community/prometheus-25.2.0 chart. Various ceph grafana dashboards that I've found published work fine.
The following are relevant dumps. Please let me know if you have any ideas, or if I should go ahead and create an issue for this...
mgr console output during crash:
debug 2023-10-24T15:11:23.498+0000 7fc81fa5d700 -1 mgr.server reply reply (2) No such file or directory This Orchestrator does not support `orch prometheus access info`
debug 2023-10-24T15:11:23.502+0000 7fc7ea3f3700 0 [dashboard INFO request] [::ffff:10.1.0.106:49760] [GET] [200] [0.012s] [admin] [101.0B] /api/health/get_cluster_capacity
debug 2023-10-24T15:11:23.502+0000 7fc813985700 0 [stats WARNING root] cmdtag not found in client metadata
debug 2023-10-24T15:11:23.502+0000 7fc813985700 0 [stats WARNING root] cmdtag not found in client metadata
debug 2023-10-24T15:11:23.502+0000 7fc7e83ef700 0 [dashboard INFO request] [::ffff:10.1.0.106:5580] [GET] [200] [0.011s] [admin] [73.0B] /api/osd/settings
debug 2023-10-24T15:11:23.506+0000 7fc85411a700 0 log_channel(audit) log [DBG] : from='mon.2 -' entity='mon.' cmd=[{"prefix": "balancer status", "format": "json"}]: dispatch
debug 2023-10-24T15:11:23.506+0000 7fc813985700 0 [stats WARNING root] cmdtag not found in client metadata
debug 2023-10-24T15:11:23.506+0000 7fc7e9bf2700 0 [dashboard INFO request] [::ffff:10.1.0.106:20241] [GET] [200] [0.014s] [admin] [34.0B] /api/prometheus/rules
debug 2023-10-24T15:11:23.630+0000 7fc7ebbf6700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-10-24T15:11:23.734+0000 7fc7ebbf6700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-10-24T15:11:23.802+0000 7fc86511c700 0 log_channel(cluster) log [DBG] : pgmap v126: 617 pgs: 53 active+remapped+backfill_wait, 2 active+remapped+backfilling, 562 active+clean; 34 TiB data, 68 TiB used, 64 TiB / 132 TiB avail; 2.4 MiB/s rd, 93 KiB/s wr, 21 op/s; 1213586/22505781 objects misplaced (5.392%)
debug 2023-10-24T15:11:23.862+0000 7fc7ebbf6700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-10-24T15:11:23.962+0000 7fc7ebbf6700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-10-24T15:11:24.058+0000 7fc7ebbf6700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-10-24T15:11:24.158+0000 7fc7ebbf6700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-10-24T15:11:24.270+0000 7fc7ebbf6700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-10-24T15:11:24.546+0000 7fc7ebbf6700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-10-24T15:11:24.654+0000 7fc7ebbf6700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-10-24T15:11:24.654+0000 7fc7ebbf6700 0 [dashboard INFO request] [::ffff:10.1.0.106:13711] [GET] [200] [1.170s] [admin] [3.2K] /api/health/minimal
debug 2023-10-24T15:11:25.802+0000 7fc86511c700 0 log_channel(cluster) log [DBG] : pgmap v127: 617 pgs: 53 active+remapped+backfill_wait, 2 active+remapped+backfilling, 562 active+clean; 34 TiB data, 68 TiB used, 64 TiB / 132 TiB avail; 1.1 MiB/s rd, 53 KiB/s wr, 17 op/s; 1213586/22505781 objects misplaced (5.392%)
debug 2023-10-24T15:11:27.802+0000 7fc86511c700 0 log_channel(cluster) log [DBG] : pgmap v128: 617 pgs: 53 active+remapped+backfill_wait, 2 active+remapped+backfilling, 562 active+clean; 34 TiB data, 68 TiB used, 64 TiB / 132 TiB avail; 1.8 MiB/s rd, 58 KiB/s wr, 18 op/s; 1213586/22505781 objects misplaced (5.392%)
debug 2023-10-24T15:11:28.494+0000 7fc813985700 0 [stats WARNING root] cmdtag not found in client metadata
debug 2023-10-24T15:11:28.498+0000 7fc7eb3f5700 0 [dashboard INFO request] [::ffff:10.1.0.106:20241] [GET] [200] [0.011s] [admin] [73.0B] /api/osd/settings
debug 2023-10-24T15:11:28.498+0000 7fc85411a700 0 log_channel(audit) log [DBG] : from='mon.2 -' entity='mon.' cmd=[{"prefix": "orch prometheus access info"}]: dispatch
debug 2023-10-24T15:11:28.502+0000 7fc7ec3f7700 0 [dashboard INFO request] [::ffff:10.1.0.106:5580] [GET] [200] [0.006s] [admin] [102.0B] /api/health/get_cluster_capacity
debug 2023-10-24T15:11:28.502+0000 7fc7e93f1700 0 [dashboard INFO request] [::ffff:10.1.0.106:44009] [GET] [200] [0.005s] [admin] [22.0B] /api/prometheus/notifications
debug 2023-10-24T15:11:28.502+0000 7fc81fa5d700 -1 Remote method threw exception: Traceback (most recent call last):
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 675, in get_prometheus_access_info
raise NotImplementedError()
NotImplementedError
ceph crash info dump:
{
"backtrace": [
" File \"/usr/share/ceph/mgr/orchestrator/_interface.py\", line 675, in get_prometheus_access_info\n raise NotImplementedError()",
"NotImplementedError"
],
"ceph_version": "18.2.0",
"crash_id": "2023-10-24T14:59:52.921427Z_08e06575-0431-47fe-afc5-be8e4a7d1144",
"entity_name": "mgr.a",
"mgr_module": "rook",
"mgr_module_caller": "ActivePyModule::dispatch_remote get_prometheus_access_info",
"mgr_python_exception": "NotImplementedError",
"os_id": "centos",
"os_name": "CentOS Stream",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-mgr",
"stack_sig": "bbf52dcdbbe54d67edf59ebdb5d201fffd921db5a9dd4431964c2aaac2250c7e",
"timestamp": "2023-10-24T14:59:52.921427Z",
"utsname_hostname": "k8s4",
"utsname_machine": "x86_64",
"utsname_release": "5.15.0-87-generic",
"utsname_sysname": "Linux",
"utsname_version": "#97-Ubuntu SMP Mon Oct 2 21:09:21 UTC 2023"
}
--
Rich
Hi all,
I couldn't understand what does the status -125 mean from the docs. I'm
getting 500 response status code when I call rgw admin APIs and the only
log in the rgw log files is as follows.
s3:get_obj recalculating target
initializing for trans_id =
tx00000aa90f570fb8281cf-006537bf9e-84395fa-default
s3:get_obj reading permissions
getting op 1
s3:put_obj verifying requester
s3:put_obj normalizing buckets and tenants
s3:put_obj init permissions
s3:put_obj recalculating target
s3:put_obj reading permissions
s3:put_obj init op
s3:put_obj verifying op mask
s3:put_obj verifying op permissions
s3:put_obj verifying op params
s3:put_obj pre-executing
s3:put_obj executing
:modify_user completing
WARNING: set_req_state_err err_no=125 resorting to 500
:modify_user op status=-125
:modify_user http status=500
====== req done req=0x7f3f85a78620 op status=-125 http_status=500
latency=0.076000459s ======
Can anyone explain what this error means and why it's happening?
Best Regards,
Mahnoosh
Hey all,
My Ceph cluster is managed mostly by cephadm / ceph orch to avoid
circular dependencies between in our infrastructure deployment. Our
RadosGW endpoints, however, are managed by Kubernetes, since it provides
proper load balancing and service health checks.
This leaves me in the unsatisfactory situation that Ceph complains about
'stray' RGW daemons in the cluster. The only two solutions to this that
I found were a) to turn of the warning, which applies to all daemons and
not just the RGWs (not pretty!), or b) to move the deployment out of
Kubernetes. For the latter, I could define external Endpoints in
Kubernetes, so that I still have load balancing, but then I don't have
proper health checks any more. Meaning, if one of the RGW endpoints goes
down, requests to our S3 endpoint will intermittently time out in
round-robin fashion (not pretty at all!).
Can you think of a better option to solve this? I would already be
satisfied with turning off the warning for RGW daemons only, but there
doesn't seem to be a config option for that.
Thanks
Janek
Hi,
I recently moved from a manual Ceph deployment using Saltstack to a
hybrid of Saltstack and cephadm / ceph orch. We are provisioning our
Ceph hosts using a stateless PXE RAM root, so I definitely need
Saltstack to bootstrap at least the Ceph APT repository and the MON/MGR
deployment. After that, ceph orch can take over and deploy the remaining
daemons.
The MONs/MGRs are deployed after each reboot with
cephadm deploy --name mon.{{ ceph.node_id }} --fsid {{
ceph.conf.global.fsid }} --config /etc/ceph/ceph.conf
cephadm deploy --name mgr.{{ ceph.node_id }} --fsid {{
ceph.conf.global.fsid }} --config /etc/ceph/ceph.conf
(the MON store is provided in /var/lib/ceph/{{ ceph.conf.global.fsid
}}/mon.{{ ceph.node_id }}).
Since cephadm ceph-volume lvm activate --all is broken (see
https://tracker.ceph.com/issues/55395), I am activating each OSD
individually like this:
cephadm deploy --name osd.{{ osd_id }} --fsid {{ ceph.conf.global.fsid
}} --osd-fsid {{ osd_fsid }} --config /etc/ceph/ceph.conf
Now my question: Is there a better way to do this and can ceph orch take
care of this in the same way it deploys my MDS?
All OSDs are listed as <unmanaged> in ceph orch ls (I think this is by
design?) and I cannot find a way to activate them automatically via ceph
orch when the host boots up. I tried
ceph cephadm osd activate HOSTNAME,
but all I get is "Created no osd(s) on host HOSTNAME; already created?"
The docs only talk about how I can create new OSDs, but not how I can
automatically redeploy existing OSDs after a fresh boot. It seems like
it is generally assumed that OSD deployments are persistent and next
time the host boots, systemd simply activates the existing units.
I'd be glad about any hints!
Janek
Hi,
The latest documentation
<https://docs.ceph.com/en/latest/radosgw/s3-notification-compatibility/#supp…>
says that:
"pulling and acking of events stored in Ceph (as an internal destination)."
is a valid source for s3 notifications.
If i got it right, pull-based subscription was possible only with radosgw
PubSub sync module, which was removed in reef release
<https://docs.ceph.com/en/latest/releases/reef/#:~:text=The%20pubsub%20funct…>
.
Suggested changes:
Currently, we support: HTTP/S, Kafka and AMQP. And also support pulling
and acking of events stored in Ceph (as an internal destination).
Best regards,
Artem