February 2023 - ceph-users

by Michel Niyoyita

Hello team, I have a ceph cluster deployed using ceph-ansible , running on ubuntu 20.04 OS which have 6 hosts , 3 hosts for OSD and 3 hosts used as monitors and managers , I have deployed RGW on all those hosts and RGWLOADBALENCER on top of them , for testing purpose , I have switched off one OSD , to check if the rest can work properly , The test went well as expected, unfortunately while coming back an OSD , the RGW failed to connect through the dashboard. below is the message : The Object Gateway Service is not configuredError connecting to Object GatewayPlease consult the documentation <https://docs.ceph.com/en/latest/mgr/dashboard/#enabling-the-object-gateway-…> on how to configure and enable the Object Gateway management functionality. would like to ask how to solve that issue or how can I proceed to remove completely RGW and redeploy it after . root@ceph-mon1:~# ceph -s cluster: id: cb0caedc-eb5b-42d1-a34f-96facfda8c27 health: HEALTH_OK services: mon: 3 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon3 (age 72m) mgr: ceph-mon2(active, since 71m), standbys: ceph-mon3, ceph-mon1 osd: 48 osds: 48 up (since 79m), 48 in (since 3d) rgw: 6 daemons active (6 hosts, 1 zones) data: pools: 9 pools, 257 pgs objects: 59.49k objects, 314 GiB usage: 85 TiB used, 348 TiB / 433 TiB avail pgs: 257 active+clean io: client: 2.0 KiB/s wr, 0 op/s rd, 0 op/s wr Kindly help Best Regards Michel

1 year, 2 months

2
1
0 0

PG increase / data movement fine tuning

by Szabo, Istvan (Agoda)

Hi, I've increased the placement group in my octopus cluster firstly in the index pool and I gave almost 2.5 hours bad performance for the user. I'm planning to increase the data pool also, but first I'd like to know is there any way to make it smoother or not. At the moment I have these values: osd_max_backfills = 1 osd_recovery_max_active = 1 osd_recovery_op_priority = 1 But seems like this still generates slow ops. Should I turn off scrubbing or any other way to make it even smoother? Some information about the setup: * I have 9 nodes, each of them has 2x nvme drives with 4osd on those and this is where the index pool lives. * Currently has 2048 pg-s for the index pool Thank you ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.

1 year, 2 months

1
0
0 0

cephadm and the future

by Christopher Durham

Question: What does the future hold with regard to cephadm vs rpm/deb packages? If it is now suggested to use cephadm and thus containers to deploy new clusters, what does the future hold? Is there an intent, at sometime in the future, to no longer support rpm/deb packages for Linux systems, and only support the cephadm container method? I am not asking to argue containers vs traditional bare metal installs. I am just trying to plan for the future. Thanks -Chris

1 year, 2 months

2
1
0 0

'ceph orch upgrade...' causes an rbd outage on a proxmox cluster

by Pierre BELLEMAIN

Hi everyone, (sorry for the spam, apparently I was not subscribed to the ml) I have a ceph test cluster and a proxmox test cluster (for try upgrade in test before the prod). My ceph cluster is made up of three servers running debian 11, with two separate networks (cluster_network and public_network, in VLANs). In ceph version 16.2.10 (cephadm with docker). Each server has one MGR, one MON and 8 OSDs. cluster: id: xxx health: HEALTH_OK services: mon: 3 daemons, quorum ceph01,ceph03,ceph02 (age 2h) mgr: ceph03(active, since 77m), standbys: ceph01, ceph02 osd: 24 osds: 24 up (since 7w), 24 in (since 6M) data: pools: 3 pools, 65 pgs objects: 29.13k objects, 113 GiB usage: 344 GiB used, 52 TiB / 52 TiB avail pgs: 65 active+clean io: client: 1.3 KiB/s wr, 0 op/s rd, 0 op/s wr The proxmox cluster is also made up of 3 servers running proxmox 7.2-7 (with proxmox ceph pacific which is on 16.2.9 version). The ceph storage used is RBD (on the ceph public_network). I added the RBD datastores simply via the GUI. So far so good. I have several VMs, on each of the proxmox. When I update ceph to 16.2.11, that's where things go wrong. I don't like when the update does everything for me without control, so I did a "staggered upgrade", following the official procedure (https://docs.ceph.com/en/pacific/cephadm/upgrade/#staggered-upgrade). As the version I'm starting from doesn't support staggered upgrade, I follow the procedure (https://docs.ceph.com/en/pacific/cephadm/upgrade/#upgrading-to-a-version-th…). When I do the "ceph orch redeploy" of the two standby MGRs, everything is fine. I do the "sudo ceph mgr fail", everything is fine (it switches well to an mgr which was standby, so I get an MGR 16.2.11). However, when I do the "sudo ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.11 --daemon-types mgr", it updates me the last MGR which was not updated (so far everything is still fine), but it does a last restart of all the MGRs to finish, and there, the proxmox visibly loses the RBD and turns off all my VMs. Here is the message in the proxmox syslog: Feb 2 16:20:52 pmox01 QEMU[436706]: terminate called after throwing an instance of 'std::system_error' Feb 2 16:20:52 pmox01 QEMU[436706]: what(): Resource deadlock avoided Feb 2 16:20:52 pmox01 kernel: [17038607.686686] vmbr0: port 2(tap102i0) entered disabled state Feb 2 16:20:52 pmox01 kernel: [17038607.779049] vmbr0: port 2(tap102i0) entered disabled state Feb 2 16:20:52 pmox01 systemd[1]: 102.scope: Succeeded. Feb 2 16:20:52 pmox01 systemd[1]: 102.scope: Consumed 43.136s CPU time. Feb 2 16:20:53 pmox01 qmeventd[446872]: Starting cleanup for 102 Feb 2 16:20:53 pmox01 qmeventd[446872]: Finished cleanup for 102 For ceph, everything is fine, it does the update, and tells me everything is OK in the end. Ceph is now on 16.2.11 and the health is OK. When I redo a downgrade of the MGRs, I have the problem again and when I start the procedure again, I still have the problem. It's very reproducible. According to my tests, the "sudo ceph orch upgrade" command always gives me trouble, even when trying a real staggered upgrade from and to version 16.2.11 with the command: sudo ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.11 --daemon-types mgr --hosts ceph01 --limit 1 Does anyone have an idea? Thank you everyone ! Pierre.

1 year, 2 months

1
1
0 0

Ceph Upgrade path

by Iztok Gregori

Hi to all! We are running a Ceph cluster (Octopus) on (99%) CentOS 7 (deployed at the time with ceph-deploy) and we would like to upgrade it. As far as I know for Pacific (and later releases) there aren't packages for CentOS 7 distribution (at least not on download.ceph.com), so we need to upgrade (change) not only Ceph but also the distribution. What is the raccomended path to do so? We could upgrade (reinstall) all the nodes to Rocky 8 and then upgrade Ceph to Quincy, but we will "stuck" with "not the latest" distribution and probably we will have to upgrade (reinstall) again in the near future. Our second idea is to leverage cephadm (which we would like to implement) and switch from rpms to containers, but I don't have a clear vision of how to do it. I was thinking to: 1. install a new monitor/manager with Rocky 9. 2. prepare the node for cephadm. 3. start the manager/monitor containers on that node. 4. repeat for the other monitors. 5. repeat for the OSD servers. I'm not sure how to execute the point 2 and 3. The documentation says how to bootstrap a NEW cluster and how to ADOPT an existing one, but our situation is a hybrid (or in my mind it is...). I cannot also adopt my current cluster to cephadm because we have 30% of our OSD still on filestore. My intention was to drain them, reinstall them and then adopt them. But I would like to avoid (if not necessary) multiple reinstallations. In my mind all the OSD servers will be drained before been reinstalled, just to be sure to have a "fresh" start). Have you any ideas and/or advice to give us? Thanks a lot! Iztok P.S. I saw that the script cephadm doesn't support Rocky, I can modify it to do so and it should work, but is there a plan to officially support it? -- Iztok Gregori ICT Systems and Services Elettra - Sincrotrone Trieste S.C.p.A. Telephone: +39 040 3758948 http://www.elettra.eu

1 year, 2 months

3
2
0 0

'ceph orch upgrade...' causes an rbd outage on a proxmox cluster

by Pierre BELLEMAIN

Hi everyone, I have a ceph test cluster and a proxmox test cluster (for try upgrade in test before the prod). My ceph cluster is made up of three servers running debian 11, with two separate networks (cluster_network and public_network, in VLANs). In ceph version 16.2.10 (cephadm with docker). Each server has one MGR, one MON and 8 OSDs. cluster: id: xxx health: HEALTH_OK services: mon: 3 daemons, quorum ceph01,ceph03,ceph02 (age 2h) mgr: ceph03(active, since 77m), standbys: ceph01, ceph02 osd: 24 osds: 24 up (since 7w), 24 in (since 6M) data: pools: 3 pools, 65 pgs objects: 29.13k objects, 113 GiB usage: 344 GiB used, 52 TiB / 52 TiB avail pgs: 65 active+clean io: client: 1.3 KiB/s wr, 0 op/s rd, 0 op/s wr The proxmox cluster is also made up of 3 servers running proxmox 7.2-7. The ceph storage used is RBD (on the ceph public_network). I added the RBD datastores simply via the GUI. So far so good. I have several VMs, on each of the proxmox. When I update ceph to 16.2.11, that's where things go wrong. I don't like when the update does everything for me without control, so I did a "staggered upgrade", following the official procedure (https://docs.ceph.com/en/pacific/cephadm/upgrade/#staggered-upgrade). As the version I'm starting from doesn't support staggered upgrade, I follow the procedure (https://docs.ceph.com/en/pacific/cephadm/upgrade/#upgrading-to-a-version-th…). When I do the "ceph orch redeploy" of the two standby MGRs, everything is fine. I do the "sudo ceph mgr fail", everything is fine (it switches well to an mgr which was standby, so I get an MGR 16.2.11). However, when I do the "sudo ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.11 --daemon-types mgr", it updates me the last MGR which was not updated (so far everything is still fine), but it does a last restart of all the MGRs to finish, and there, the proxmox visibly loses the RBD and turns off all my VMs. Here is the message in the proxmox syslog: Feb 2 16:20:52 pmox01 QEMU[436706]: terminate called after throwing an instance of 'std::system_error' Feb 2 16:20:52 pmox01 QEMU[436706]: what(): Resource deadlock avoided Feb 2 16:20:52 pmox01 kernel: [17038607.686686] vmbr0: port 2(tap102i0) entered disabled state Feb 2 16:20:52 pmox01 kernel: [17038607.779049] vmbr0: port 2(tap102i0) entered disabled state Feb 2 16:20:52 pmox01 systemd[1]: 102.scope: Succeeded. Feb 2 16:20:52 pmox01 systemd[1]: 102.scope: Consumed 43.136s CPU time. Feb 2 16:20:53 pmox01 qmeventd[446872]: Starting cleanup for 102 Feb 2 16:20:53 pmox01 qmeventd[446872]: Finished cleanup for 102 For ceph, everything is fine, it does the update, and tells me everything is OK in the end. Ceph is now on 16.2.11 and the health is OK. When I redo a downgrade of the MGRs, I have the problem again and when I start the procedure again, I still have the problem. It's very reproducible. According to my tests, the "sudo ceph orch upgrade" command always gives me trouble, even when trying a real staggered upgrade from and to version 16.2.11 with the command: sudo ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.11 --daemon-types mgr --hosts ceph01 --limit 1 Does anyone have an idea? Thank you everyone ! Pierre.

1 year, 2 months

1
0
0 0

January Ceph Science Virtual User Group

by Kevin Hrpcek

Hey all, We will be having a Ceph science/research/big cluster call on Tuesday January 31st. If anyone wants to discuss something specific they can add it to the pad linked below. If you have questions or comments you can contact me. This is an informal open call of community members mostly from hpc/htc/research environments where we discuss whatever is on our minds regarding ceph. Updates, outages, features, maintenance, etc...there is no set presenter but I do attempt to keep the conversation lively. Pad URL: https://pad.ceph.com/p/Ceph_Science_User_Group_20230131 Ceph calendar event details: January 31, 2023 15:00 UTC 4pm Central European 9am Central US Description: Main pad for discussions: https://pad.ceph.com/p/Ceph_Science_User_Group_Index Meetings will be recorded and posted to the Ceph Youtube channel. To join the meeting on a computer or mobile phone: https://bluejeans.com/908675367?src=calendarLink To join from a Red Hat Deskphone or Softphone, dial: 84336. Connecting directly from a room system? 1.) Dial: 199.48.152.152 or bjn.vc <http://bjn.vc> 2.) Enter Meeting ID: 908675367 Just want to dial in on your phone? 1.) Dial one of the following numbers: 408-915-6466 (US) See all numbers: https://www.redhat.com/en/conference-numbers 2.) Enter Meeting ID: 908675367 3.) Press # Want to test your video connection? https://bluejeans.com/111 Kevin -- Kevin Hrpcek NASA VIIRS Atmosphere SIPS/TROPICS Space Science & Engineering Center University of Wisconsin-Madison

1 year, 2 months

4
3
0 0

ceph quincy cannot change osd_recovery_max_active, please help

by 辣条➀号

Greetings from the enthusiastic ceph official team! Recently, when our company was using ceph quincy (stable), it was found that the initial value of osd_recovery_max_active could not be changed. 1. When I try to modify osd_mclock_override_recovery_settings to true, I get an error. It seems that there is no such option? How should I modify the initial value of osd_recovery_max_active root@pve-ceph01:~# ceph config set osd osd_mclock_override_recovery_settings true Error EINVAL: unrecognized config option 'osd_mclock_override_recovery_settings'

1 year, 2 months

1
0
0 0

CEPHADM_STRAY_DAEMON does not exist, how do I remove knowledge of it from ceph?

by ceph＠mikesoffice.com

Hi All, I'm getting this error while setting up a ceph cluster. I'm relatively new to ceph, so there is no telling what kind of mistakes I've been making. I'm using cephadm, ceph v16 and I apparently have a stray daemon. But it also doesn't seem to exist and I can't get ceph to forget about it. $ ceph health detail [WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm stray daemon mon.cmon01 on host cmgmt01 not managed by cephadm mon.cmon01 also shows up in dashboard->hosts as running on cmgmt01. It does not show up in the monitors section though. But, there isn't a monitor daemon running on that machine at all (no podman container, not in process list, not listening on a port). On that host in cephadm shell, # ceph orch daemon rm mon.cmon01 --force Error EINVAL: Unable to find daemon(s) ['mon.cmon01'] I don't currently have any real data on the cluster, so I've also tried deleting the existing pools (except device_health_metrics) in case ceph was connecting that monitor to one of the pools. I'm not sure what to try next in order to get ceph to forget about that daemon.

1 year, 2 months

3
2
0 0

How to get RBD client log?

by Jinhao Hu

Hi, How can I collect the logs of the RBD client?

1 year, 2 months

3
3
0 0

2024

2023

2022

2021

2020

2019

ceph-users February 2023