September 2023 - ceph-users

Re: VM hangs when overwriting a file on erasure coded RBD

by peter.linder＠fiberdirekt.se

(sorry for duplicate emails) This turns out to be a good question actually. The cluster is running Quincy, 17.2.6. The compute node that is running the VM is proxmox, version 7.4-3. Supposedly this is fairly new, but the version of librbd1 claims to be 14.2.21 when I check with "apt list". We are not using proxmox's own ceph cluster release. However we haven't had any issues with this setup before, but we haven't been using neither erasure coded pools nor had the node-half-dead problem for such a long time. The VM is configured using proxmox which is not libvirt but similar, and krbd is not enabled. I don't know for sure if proxmox has its own librbd linked in qemu/kvm. "ceph features" looks like this: { "mon": [ { "features": "0x3f01cfbf7ffdffff", "release": "luminous", "num": 5 } ], "osd": [ { "features": "0x3f01cfbf7ffdffff", "release": "luminous", "num": 24 } ], "client": [ { "features": "0x3f01cfb87fecffff", "release": "luminous", "num": 4 }, { "features": "0x3f01cfbf7ffdffff", "release": "luminous", "num": 12 } ], "mgr": [ { "features": "0x3f01cfbf7ffdffff", "release": "luminous", "num": 2 } ] } Regards, Peter Den 2023-09-29 kl. 17:55, skrev Anthony D'Atri: > Which Ceph releases are installed on the VM and the back end? Is the VM using librbd through libvirt, or krbd? > >> On Sep 29, 2023, at 09:09, Peter Linder <peter.linder(a)fiberdirekt.se> wrote: >> >> Dear all, >> >> I have a problem that after an OSD host lost connection to the sync/cluster rear network for many hours (the public network was online), a test VM using RBD cant overwrite its files. I can create a new file inside it just fine, but not overwrite it, the process just hangs. >> >> The VM's disk is on an erasure coded data pool and a replicated pool in front of it. EC overwrites is on for the pool. >> >> The custer consists of 5 hosts and 4 OSDs on each, and separate hosts for compute. There is a public and separate cluster network, separated. In this case, the AOC cable to the cluster network went link down on a host and it had to be replaced and the host was rebooted. Recovery took about a week to complete. The host was half-down for about 12 hours like this. >> >> I have some other VMs as well with images in the same pool (totally 4), and they seem to work fine, it is just this one that cant overwrite. >> >> I'm thinking there is somehow something wrong with just this image? >> >> Regards, >> >> Peter >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io

7 months, 1 week

1
0
0 0

Join us for the User + Dev Relaunch, happening this Thursday!

by Laura Flores

Hi Ceph users and developers, We invite you to join us at the User + Dev Relaunch, happening this Thursday at 10:00 AM EST! See below for more meeting details. Also see this blog post to read more about the relaunch: https://ceph.io/en/news/blog/2023/user-dev-meeting-relaunch/ We have two guest speakers who will present their focus topics during the first 40 minutes of the session: 1. "What to do when Ceph isn't Ceph-ing" by Cory Snyder Topics include troubleshooting tips, effective ways to gather help from the community, ways to improve cluster health and insights, and more! 2. "Ceph Usability Improvements" by Jonas Sterr A continuation of a talk from Cephalocon 2023, updated after trying out the Reef Dashboard. The last 20 minutes of the meeting will be dedicated to open discussion. Feel free to add questions for the speakers or additional topics under the "Open Discussion" section on the agenda: https://pad.ceph.com/p/ceph-user-dev-monthly-minutes If you have an idea for a focus topic you'd like to present at a future meeting, you are welcome to submit it to this Google Form: https://docs.google.com/forms/d/e/1FAIpQLSdboBhxVoBZoaHm8xSmeBoemuXoV_rmh4v… Any Ceph user or developer is eligible to submit! Thanks, Laura Flores Meeting link: https://meet.jit.si/ceph-user-dev-monthly Time conversions: UTC: Thursday, September 21, 14:00 UTC Mountain View, CA, US: Thursday, September 21, 7:00 PDT Phoenix, AZ, US: Thursday, September 21, 7:00 MST Denver, CO, US: Thursday, September 21, 8:00 MDT Huntsville, AL, US: Thursday, September 21, 9:00 CDT Raleigh, NC, US: Thursday, September 21, 10:00 EDT London, England: Thursday, September 21, 15:00 BST Paris, France: Thursday, September 21, 16:00 CEST Helsinki, Finland: Thursday, September 21, 17:00 EEST Tel Aviv, Israel: Thursday, September 21, 17:00 IDT Pune, India: Thursday, September 21, 19:30 IST Brisbane, Australia: Friday, September 22, 0:00 AEST Singapore, Asia: Thursday, September 21, 22:00 +08 Auckland, New Zealand: Friday, September 22, 2:00 NZST -- Laura Flores She/Her/Hers Software Engineer, Ceph Storage <https://ceph.io> Chicago, IL lflores(a)ibm.com | lflores(a)redhat.com <lflores(a)redhat.com> M: +17087388804

7 months, 1 week

3
4
1 0

VM hangs when overwriting a file on erasure coded RBD

by Peter Linder

Dear all, I have a problem that after an OSD host lost connection to the sync/cluster rear network for many hours (the public network was online), a test VM using RBD cant overwrite its files. I can create a new file inside it just fine, but not overwrite it, the process just hangs. The VM's disk is on an erasure coded data pool and a replicated pool in front of it. EC overwrites is on for the pool. The custer consists of 5 hosts and 4 OSDs on each, and separate hosts for compute. There is a public and separate cluster network, separated. In this case, the AOC cable to the cluster network went link down on a host and it had to be replaced and the host was rebooted. Recovery took about a week to complete. The host was half-down for about 12 hours like this. I have some other VMs as well with images in the same pool (totally 4), and they seem to work fine, it is just this one that cant overwrite. I'm thinking there is somehow something wrong with just this image? Regards, Peter

7 months, 1 week

2
1
0 0

After upgrading from 17.2.6 to 18.2.0, OSDs are very frequently restarting due to livenessprobe failures

by sbengeri＠gmail.com

Since upgrading to 18.2.0 , OSDs are very frequently restarting due to livenessprobe failures making the cluster unusable. Has anyone else seen this behavior? Upgrade path: ceph 17.2.6 to 18.2.0 (and rook from 1.11.9 to 1.12.1) on ubuntu 20.04 kernel 5.15.0-79-generic Thanks.

7 months, 1 week

6
9
0 0

Snap_schedule does not always work.

by Kushagr Gupta

Hi Teams, *Ceph-version*: Quincy, Reef *OS*: Almalinux 8 *Issue*: snap_schedule doesn't create the scheduled snapshots consistently. *Description:* Hi team, We are currently working in a 3-node ceph cluster. We are currently exploring the scheduled snapshot capability of the ceph-mgr module. To enable/configure scheduled snapshots, we followed the following link: https://docs.ceph.com/en/quincy/cephfs/snap-schedule/ Using this we were able to schedule a snapshot for the subvolume which we created. Initially, it was not working but after two days it started working. Later on a fresh cluster we again tried to schedule the snapshot for a subvolume and it worked fine. But next time when we again tried the snapshot creation on a fresh cluster it didn't work this time. We have observed this inconsistency in scheduling the snapshots for a long time now. Everytime we follow the exact same steps to add a snap_schedule but sometimes it works and sometimes it doesn't. We have a chronyd service running and the timezone set to UTC timezone. Could you please help us out with this? Kindly let me know if you require any kind of logs for this Thanks and Regards, Kushagra Gupta

7 months, 1 week

2
3
0 0

Re: Not able to find a standardized restoration procedure for subvolume snapshots.

by Kushagr Gupta

Hi Team, Any update on this? Thanks and Regards, Kushagra Gupta On Tue, Sep 5, 2023 at 10:51 AM Kushagr Gupta <kushagrguptasps.mun(a)gmail.com> wrote: > *Ceph-version*: Quincy > *OS*: Centos 8 stream > > *Issue*: Not able to find a standardized restoration procedure for > subvolume snapshots. > > *Description:* > Hi team, > > We are currently working in a 3-node ceph cluster. > We are currently exploring the scheduled snapshot capability of the > ceph-mgr module. > To enable/configure scheduled snapshots, we followed the following link: > > https://docs.ceph.com/en/quincy/cephfs/snap-schedule/ > > The scheduled snapshots are working as expected. But we are unable to find > any standardized restoration procedure for the same. > > We have found the following link( not official documentation): > https://www.suse.com/support/kb/doc/?id=000019627 > > We have also found a link of cloning a new subvolume from snapshots: > https://docs.ceph.com/en/reef/cephfs/fs-volumes/ > (Section: Cloning Snapshots) > > Is there a standard procedure to restore from a snapshot. > By this I mean, is there some kind of command link maybe > ceph fs subvolume snapshot restore <snapshot-name> > > Or any other procedure please let us know. > > Thanks and Regards, > Kushagra Gupta >

7 months, 1 week

2
2
0 0

Quincy NFS ingress failover

by Thorne Lawler

Fellow cephalopods, I'm trying to get quick, seamless NFS failover happening on my four-node Ceph cluster. I followed the instructions here: https://docs.ceph.com/en/latest/cephadm/services/nfs/#high-availability-nfs but testing shows that failover doesn't happen. When I placed node 2 ("san2") in maintenance mode, the NFS service shut down: Aug 24 14:19:03 san2 ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66-nfs-xcpnfs-1-0-san2-datsvq[1962479]: 24/08/2023 04:19:03 : epoch 64b8af5a : san2 : ganesha.nfsd-8[Admin] do_shutdown :MAIN :EVENT :Removing all exports. Aug 24 14:19:13 san2 bash[3235994]: time="2023-08-24T14:19:13+10:00" level=warning msg="StopSignal SIGTERM failed to stop container ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66-nfs-xcpnfs-1-0-san2-datsvq in 10 seconds, resorting to SIGKILL" Aug 24 14:19:13 san2 bash[3235994]: ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66-nfs-xcpnfs-1-0-san2-datsvq Aug 24 14:19:13 san2 systemd[1]:ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66@nfs.xcpnfs.1.0.san2.datsvq.servic <mailto:ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66@nfs.xcpnfs.1.0.san2.datsvq.servic>e: Main process exited, code=exited, status=137/n/a Aug 24 14:19:14 san2 systemd[1]:ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66@nfs.xcpnfs.1.0.san2.datsvq.servic <mailto:ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66@nfs.xcpnfs.1.0.san2.datsvq.servic>e: Failed with result 'exit-code'. Aug 24 14:19:14 san2 systemd[1]: Stopped Ceph nfs.xcpnfs.1.0.san2.datsvq for e2f1b934-ed43-11ec-80fa-04421a1a1d66. And that's it. The ingress IP didn't move. More odd, the cluster seems to have placed the ingress IP on node 1 (san1) but seems to be using the NFS service on node 2. Do I need to more tightly connect the NFS service to the keepalive and haproxy services, or do I need to expand the ingress services to refer to multiple NFS services? Thank you. -- Regards, Thorne Lawler - Senior System Administrator *DDNS* | ABN 76 088 607 265 First registrar certified ISO 27001-2013 Data Security Standard ITGOV40172 P +61 499 449 170 _DDNS /_*Please note:* The information contained in this email message and any attached files may be confidential information, and may also be the subject of legal professional privilege. _If you are not the intended recipient any use, disclosure or copying of this email is unauthorised. _If you received this email in error, please notify Discount Domain Name Services Pty Ltd on 03 9815 6868 to report this matter and delete all copies of this transmission together with any attachments. /

7 months, 1 week

3
9
0 0

CVE-2023-43040 - Improperly verified POST keys in Ceph RGW?

by Christian Rohmann

Hey Ceph-users, i just noticed there is a post to oss-security (https://www.openwall.com/lists/oss-security/2023/09/26/10) about a security issue with Ceph RGW. Signed by IBM / Redhat and including a patch by DO. I also raised an issue on the tracker (https://tracker.ceph.com/issues/63004) about this, as I could not find one yet. It seems a weird way of disclosing such a thing and am wondering if anybody knew any more about this? Regards Christian

7 months, 1 week

2
1
0 0

Ceph leadership team notes 9/27

by Gregory Farnum

Hi everybody, The CLT met today as usual. We only had a few topics under discussion: * the User + Dev relaunch went off well! We’d like reliable recordings and have found Jitsi to be somewhat glitchy; Laura will communicate about workarounds for that while we work on a longer-term solution (self-hosting Jitsi has a better reputation and is a possibility). We also discussed a GitHub repo for hosting presentation files, and organizing them on the website. * CVE handling. As noted elsewhere on the mailing list, CVE-2023-43040 (a privilege escalation impacting RGW) was disclosed elsewhere, and we do not have coordinated releases for it. This was not deemed important enough on the security list for that effort, but we do want to be more prepared for it than we were — our CVE handling process has broken down a bit since some of the CVE work is now being handled by IBM instead of Red Hat. Tech leads and IBM employees will be working on refining that so we have better disclosures. Also, if you were previously on the security mailing list and a did not see these emails, please reach out to the team — some subscribers were lost and not recovered in the lab disaster end of last year. (For obvious reasons this is a closed list — if you do not work for a Linux distribution or at a large deployer with established relationships in Ceph and security communities, it’s hard for us to put you there.) -Greg

7 months, 1 week

1
0
0 0

Dashboard daemon logging not working

by Thomas Bennett

Hey, Has anyone else had issues with exploring Loki after deploying ceph monitoring services <https://docs.ceph.com/en/latest/cephadm/services/monitoring/>? I'm running 17.2.6. When clicking on the Ceph dashboard daemon logs (i.e Cluster -> Logs -> Daemon Logs), it took me through to an embedded Grafana dashboard for "Dashboard1" so it's not working for me. I found a workaround by enabling viewer role edit permissions. So I added viewers_can_edit = true to my grafana.ini. After I fixed this, the 'Explore' button appeared in my Grafana dashboard and I could explore the log files. If you've hit the same problem and have a better solution, please let me know. For anyone who has the same problem and want more details of how I fixed it, here is what I did: From my cephadm shell: ceph config-key get mgr/cephadm/services/grafana/grafana.ini > /tmp/grafana.ini Edit /tmp/grafana.ini and add the line below in red. # {{ cephadm_managed }} # Source /usr/share/ceph/mgr/cephadm/templates/services/grafana/grafana.ini.j2 [users] default_theme = light *viewers_can_edit = true* ... Then updated the config: ceph config-key set mgr/cephadm/services/grafana/grafana.ini -i /tmp/grafana.ini Then a reconfig and restart grafana: ceph orch reconfig grafana ceph orch restart grafana Cheers, Tom

7 months, 1 week

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users September 2023