Hello team,
I have a ceph cluster deployed using ceph-ansible , running on ubuntu 20.04
OS which have 6 hosts , 3 hosts for OSD and 3 hosts used as monitors and
managers , I have deployed RGW on all those hosts and RGWLOADBALENCER on
top of them , for testing purpose , I have switched off one OSD , to check
if the rest can work properly , The test went well as expected,
unfortunately while coming back an OSD , the RGW failed to connect through
the dashboard. below is the message :
The Object Gateway Service is not configuredError connecting to Object
GatewayPlease consult the documentation
<https://docs.ceph.com/en/latest/mgr/dashboard/#enabling-the-object-gateway-…>
on
how to configure and enable the Object Gateway management functionality.
would like to ask how to solve that issue or how can I proceed to remove
completely RGW and redeploy it after .
root@ceph-mon1:~# ceph -s
cluster:
id: cb0caedc-eb5b-42d1-a34f-96facfda8c27
health: HEALTH_OK
services:
mon: 3 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon3 (age 72m)
mgr: ceph-mon2(active, since 71m), standbys: ceph-mon3, ceph-mon1
osd: 48 osds: 48 up (since 79m), 48 in (since 3d)
rgw: 6 daemons active (6 hosts, 1 zones)
data:
pools: 9 pools, 257 pgs
objects: 59.49k objects, 314 GiB
usage: 85 TiB used, 348 TiB / 433 TiB avail
pgs: 257 active+clean
io:
client: 2.0 KiB/s wr, 0 op/s rd, 0 op/s wr
Kindly help
Best Regards
Michel
Hi,
I've increased the placement group in my octopus cluster firstly in the index pool and I gave almost 2.5 hours bad performance for the user. I'm planning to increase the data pool also, but first I'd like to know is there any way to make it smoother or not.
At the moment I have these values:
osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_op_priority = 1
But seems like this still generates slow ops.
Should I turn off scrubbing or any other way to make it even smoother?
Some information about the setup:
* I have 9 nodes, each of them has 2x nvme drives with 4osd on those and this is where the index pool lives.
* Currently has 2048 pg-s for the index pool
Thank you
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
Question:
What does the future hold with regard to cephadm vs rpm/deb packages? If it is now suggested to use cephadm and thus containers to deploy new clusters, what does the future hold? Is there an intent, at sometime in the future, to no longer support rpm/deb packages for Linux systems, and only support the cephadm container method?
I am not asking to argue containers vs traditional bare metal installs. I am just trying to plan for the future. Thanks
-Chris
Hi everyone,
(sorry for the spam, apparently I was not subscribed to the ml)
I have a ceph test cluster and a proxmox test cluster (for try upgrade in test before the prod).
My ceph cluster is made up of three servers running debian 11, with two separate networks (cluster_network and public_network, in VLANs).
In ceph version 16.2.10 (cephadm with docker).
Each server has one MGR, one MON and 8 OSDs.
cluster:
id: xxx
health: HEALTH_OK
services:
mon: 3 daemons, quorum ceph01,ceph03,ceph02 (age 2h)
mgr: ceph03(active, since 77m), standbys: ceph01, ceph02
osd: 24 osds: 24 up (since 7w), 24 in (since 6M)
data:
pools: 3 pools, 65 pgs
objects: 29.13k objects, 113 GiB
usage: 344 GiB used, 52 TiB / 52 TiB avail
pgs: 65 active+clean
io:
client: 1.3 KiB/s wr, 0 op/s rd, 0 op/s wr
The proxmox cluster is also made up of 3 servers running proxmox 7.2-7 (with proxmox ceph pacific which is on 16.2.9 version). The ceph storage used is RBD (on the ceph public_network). I added the RBD datastores simply via the GUI.
So far so good. I have several VMs, on each of the proxmox.
When I update ceph to 16.2.11, that's where things go wrong.
I don't like when the update does everything for me without control, so I did a "staggered upgrade", following the official procedure (https://docs.ceph.com/en/pacific/cephadm/upgrade/#staggered-upgrade). As the version I'm starting from doesn't support staggered upgrade, I follow the procedure (https://docs.ceph.com/en/pacific/cephadm/upgrade/#upgrading-to-a-version-th…).
When I do the "ceph orch redeploy" of the two standby MGRs, everything is fine.
I do the "sudo ceph mgr fail", everything is fine (it switches well to an mgr which was standby, so I get an MGR 16.2.11).
However, when I do the "sudo ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.11 --daemon-types mgr", it updates me the last MGR which was not updated (so far everything is still fine), but it does a last restart of all the MGRs to finish, and there, the proxmox visibly loses the RBD and turns off all my VMs.
Here is the message in the proxmox syslog:
Feb 2 16:20:52 pmox01 QEMU[436706]: terminate called after throwing an instance of 'std::system_error'
Feb 2 16:20:52 pmox01 QEMU[436706]: what(): Resource deadlock avoided
Feb 2 16:20:52 pmox01 kernel: [17038607.686686] vmbr0: port 2(tap102i0) entered disabled state
Feb 2 16:20:52 pmox01 kernel: [17038607.779049] vmbr0: port 2(tap102i0) entered disabled state
Feb 2 16:20:52 pmox01 systemd[1]: 102.scope: Succeeded.
Feb 2 16:20:52 pmox01 systemd[1]: 102.scope: Consumed 43.136s CPU time.
Feb 2 16:20:53 pmox01 qmeventd[446872]: Starting cleanup for 102
Feb 2 16:20:53 pmox01 qmeventd[446872]: Finished cleanup for 102
For ceph, everything is fine, it does the update, and tells me everything is OK in the end.
Ceph is now on 16.2.11 and the health is OK.
When I redo a downgrade of the MGRs, I have the problem again and when I start the procedure again, I still have the problem. It's very reproducible.
According to my tests, the "sudo ceph orch upgrade" command always gives me trouble, even when trying a real staggered upgrade from and to version 16.2.11 with the command:
sudo ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.11 --daemon-types mgr --hosts ceph01 --limit 1
Does anyone have an idea?
Thank you everyone !
Pierre.
Hi to all!
We are running a Ceph cluster (Octopus) on (99%) CentOS 7 (deployed at
the time with ceph-deploy) and we would like to upgrade it. As far as I
know for Pacific (and later releases) there aren't packages for CentOS 7
distribution (at least not on download.ceph.com), so we need to upgrade
(change) not only Ceph but also the distribution.
What is the raccomended path to do so?
We could upgrade (reinstall) all the nodes to Rocky 8 and then upgrade
Ceph to Quincy, but we will "stuck" with "not the latest" distribution
and probably we will have to upgrade (reinstall) again in the near future.
Our second idea is to leverage cephadm (which we would like to
implement) and switch from rpms to containers, but I don't have a clear
vision of how to do it. I was thinking to:
1. install a new monitor/manager with Rocky 9.
2. prepare the node for cephadm.
3. start the manager/monitor containers on that node.
4. repeat for the other monitors.
5. repeat for the OSD servers.
I'm not sure how to execute the point 2 and 3. The documentation says
how to bootstrap a NEW cluster and how to ADOPT an existing one, but our
situation is a hybrid (or in my mind it is...).
I cannot also adopt my current cluster to cephadm because we have 30% of
our OSD still on filestore. My intention was to drain them, reinstall
them and then adopt them. But I would like to avoid (if not necessary)
multiple reinstallations. In my mind all the OSD servers will be drained
before been reinstalled, just to be sure to have a "fresh" start).
Have you any ideas and/or advice to give us?
Thanks a lot!
Iztok
P.S. I saw that the script cephadm doesn't support Rocky, I can modify
it to do so and it should work, but is there a plan to officially
support it?
--
Iztok Gregori
ICT Systems and Services
Elettra - Sincrotrone Trieste S.C.p.A.
Telephone: +39 040 3758948
http://www.elettra.eu
Hi everyone,
I have a ceph test cluster and a proxmox test cluster (for try upgrade in test before the prod).
My ceph cluster is made up of three servers running debian 11, with two separate networks (cluster_network and public_network, in VLANs).
In ceph version 16.2.10 (cephadm with docker).
Each server has one MGR, one MON and 8 OSDs.
cluster:
id: xxx
health: HEALTH_OK
services:
mon: 3 daemons, quorum ceph01,ceph03,ceph02 (age 2h)
mgr: ceph03(active, since 77m), standbys: ceph01, ceph02
osd: 24 osds: 24 up (since 7w), 24 in (since 6M)
data:
pools: 3 pools, 65 pgs
objects: 29.13k objects, 113 GiB
usage: 344 GiB used, 52 TiB / 52 TiB avail
pgs: 65 active+clean
io:
client: 1.3 KiB/s wr, 0 op/s rd, 0 op/s wr
The proxmox cluster is also made up of 3 servers running proxmox 7.2-7. The ceph storage used is RBD (on the ceph public_network). I added the RBD datastores simply via the GUI.
So far so good. I have several VMs, on each of the proxmox.
When I update ceph to 16.2.11, that's where things go wrong.
I don't like when the update does everything for me without control, so I did a "staggered upgrade", following the official procedure (https://docs.ceph.com/en/pacific/cephadm/upgrade/#staggered-upgrade). As the version I'm starting from doesn't support staggered upgrade, I follow the procedure (https://docs.ceph.com/en/pacific/cephadm/upgrade/#upgrading-to-a-version-th…).
When I do the "ceph orch redeploy" of the two standby MGRs, everything is fine.
I do the "sudo ceph mgr fail", everything is fine (it switches well to an mgr which was standby, so I get an MGR 16.2.11).
However, when I do the "sudo ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.11 --daemon-types mgr", it updates me the last MGR which was not updated (so far everything is still fine), but it does a last restart of all the MGRs to finish, and there, the proxmox visibly loses the RBD and turns off all my VMs.
Here is the message in the proxmox syslog:
Feb 2 16:20:52 pmox01 QEMU[436706]: terminate called after throwing an instance of 'std::system_error'
Feb 2 16:20:52 pmox01 QEMU[436706]: what(): Resource deadlock avoided
Feb 2 16:20:52 pmox01 kernel: [17038607.686686] vmbr0: port 2(tap102i0) entered disabled state
Feb 2 16:20:52 pmox01 kernel: [17038607.779049] vmbr0: port 2(tap102i0) entered disabled state
Feb 2 16:20:52 pmox01 systemd[1]: 102.scope: Succeeded.
Feb 2 16:20:52 pmox01 systemd[1]: 102.scope: Consumed 43.136s CPU time.
Feb 2 16:20:53 pmox01 qmeventd[446872]: Starting cleanup for 102
Feb 2 16:20:53 pmox01 qmeventd[446872]: Finished cleanup for 102
For ceph, everything is fine, it does the update, and tells me everything is OK in the end.
Ceph is now on 16.2.11 and the health is OK.
When I redo a downgrade of the MGRs, I have the problem again and when I start the procedure again, I still have the problem. It's very reproducible.
According to my tests, the "sudo ceph orch upgrade" command always gives me trouble, even when trying a real staggered upgrade from and to version 16.2.11 with the command:
sudo ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.11 --daemon-types mgr --hosts ceph01 --limit 1
Does anyone have an idea?
Thank you everyone !
Pierre.
Hey all,
We will be having a Ceph science/research/big cluster call on Tuesday
January 31st. If anyone wants to discuss something specific they can add
it to the pad linked below. If you have questions or comments you can
contact me.
This is an informal open call of community members mostly from
hpc/htc/research environments where we discuss whatever is on our minds
regarding ceph. Updates, outages, features, maintenance, etc...there is
no set presenter but I do attempt to keep the conversation lively.
Pad URL:
https://pad.ceph.com/p/Ceph_Science_User_Group_20230131
Ceph calendar event details:
January 31, 2023
15:00 UTC
4pm Central European
9am Central US
Description: Main pad for discussions:
https://pad.ceph.com/p/Ceph_Science_User_Group_Index
Meetings will be recorded and posted to the Ceph Youtube channel.
To join the meeting on a computer or mobile phone:
https://bluejeans.com/908675367?src=calendarLink
To join from a Red Hat Deskphone or Softphone, dial: 84336.
Connecting directly from a room system?
1.) Dial: 199.48.152.152 or bjn.vc <http://bjn.vc>
2.) Enter Meeting ID: 908675367
Just want to dial in on your phone?
1.) Dial one of the following numbers: 408-915-6466 (US)
See all numbers: https://www.redhat.com/en/conference-numbers
2.) Enter Meeting ID: 908675367
3.) Press #
Want to test your video connection? https://bluejeans.com/111
Kevin
--
Kevin Hrpcek
NASA VIIRS Atmosphere SIPS/TROPICS
Space Science & Engineering Center
University of Wisconsin-Madison
Greetings from the enthusiastic ceph official team!
Recently, when our company was using ceph quincy (stable),
it was found that the initial value of osd_recovery_max_active could not be changed.
1. When I try to modify osd_mclock_override_recovery_settings to true, I get an error.
It seems that there is no such option? How should I modify the initial value of osd_recovery_max_active
root@pve-ceph01:~# ceph config set osd osd_mclock_override_recovery_settings true
Error EINVAL: unrecognized config option 'osd_mclock_override_recovery_settings'
Hi All,
I'm getting this error while setting up a ceph cluster. I'm relatively new to ceph, so there is no telling what kind of mistakes I've been making. I'm using cephadm, ceph v16 and I apparently have a stray daemon. But it also doesn't seem to exist and I can't get ceph to forget about it.
$ ceph health detail
[WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm
stray daemon mon.cmon01 on host cmgmt01 not managed by cephadm
mon.cmon01 also shows up in dashboard->hosts as running on cmgmt01. It does not show up in the monitors section though.
But, there isn't a monitor daemon running on that machine at all (no podman container, not in process list, not listening on a port).
On that host in cephadm shell,
# ceph orch daemon rm mon.cmon01 --force
Error EINVAL: Unable to find daemon(s) ['mon.cmon01']
I don't currently have any real data on the cluster, so I've also tried deleting the existing pools (except device_health_metrics) in case ceph was connecting that monitor to one of the pools.
I'm not sure what to try next in order to get ceph to forget about that daemon.