(sorry for duplicate emails)
This turns out to be a good question actually.
The cluster is running Quincy, 17.2.6.
The compute node that is running the VM is proxmox, version 7.4-3.
Supposedly this is fairly new, but the version of librbd1 claims to be
14.2.21 when I check with "apt list". We are not using proxmox's own
ceph cluster release. However we haven't had any issues with this setup
before, but we haven't been using neither erasure coded pools nor had
the node-half-dead problem for such a long time.
The VM is configured using proxmox which is not libvirt but similar, and
krbd is not enabled. I don't know for sure if proxmox has its own librbd
linked in qemu/kvm.
"ceph features" looks like this:
{
"mon": [
{
"features": "0x3f01cfbf7ffdffff",
"release": "luminous",
"num": 5
}
],
"osd": [
{
"features": "0x3f01cfbf7ffdffff",
"release": "luminous",
"num": 24
}
],
"client": [
{
"features": "0x3f01cfb87fecffff",
"release": "luminous",
"num": 4
},
{
"features": "0x3f01cfbf7ffdffff",
"release": "luminous",
"num": 12
}
],
"mgr": [
{
"features": "0x3f01cfbf7ffdffff",
"release": "luminous",
"num": 2
}
]
}
Regards,
Peter
Den 2023-09-29 kl. 17:55, skrev Anthony D'Atri:
> Which Ceph releases are installed on the VM and the back end? Is the VM using librbd through libvirt, or krbd?
>
>> On Sep 29, 2023, at 09:09, Peter Linder <peter.linder(a)fiberdirekt.se> wrote:
>>
>> Dear all,
>>
>> I have a problem that after an OSD host lost connection to the sync/cluster rear network for many hours (the public network was online), a test VM using RBD cant overwrite its files. I can create a new file inside it just fine, but not overwrite it, the process just hangs.
>>
>> The VM's disk is on an erasure coded data pool and a replicated pool in front of it. EC overwrites is on for the pool.
>>
>> The custer consists of 5 hosts and 4 OSDs on each, and separate hosts for compute. There is a public and separate cluster network, separated. In this case, the AOC cable to the cluster network went link down on a host and it had to be replaced and the host was rebooted. Recovery took about a week to complete. The host was half-down for about 12 hours like this.
>>
>> I have some other VMs as well with images in the same pool (totally 4), and they seem to work fine, it is just this one that cant overwrite.
>>
>> I'm thinking there is somehow something wrong with just this image?
>>
>> Regards,
>>
>> Peter
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hi Ceph users and developers,
We invite you to join us at the User + Dev Relaunch, happening this
Thursday at 10:00 AM EST! See below for more meeting details. Also see this
blog post to read more about the relaunch:
https://ceph.io/en/news/blog/2023/user-dev-meeting-relaunch/
We have two guest speakers who will present their focus topics during the
first 40 minutes of the session:
1. "What to do when Ceph isn't Ceph-ing" by Cory Snyder
Topics include troubleshooting tips, effective ways to gather help
from the community, ways to improve cluster health and insights, and more!
2. "Ceph Usability Improvements" by Jonas Sterr
A continuation of a talk from Cephalocon 2023, updated after trying
out the Reef Dashboard.
The last 20 minutes of the meeting will be dedicated to open discussion.
Feel free to add questions for the speakers or additional topics under the
"Open Discussion" section on the agenda:
https://pad.ceph.com/p/ceph-user-dev-monthly-minutes
If you have an idea for a focus topic you'd like to present at a future
meeting, you are welcome to submit it to this Google Form:
https://docs.google.com/forms/d/e/1FAIpQLSdboBhxVoBZoaHm8xSmeBoemuXoV_rmh4v…
Any Ceph user or developer is eligible to submit!
Thanks,
Laura Flores
Meeting link: https://meet.jit.si/ceph-user-dev-monthly
Time conversions:
UTC: Thursday, September 21, 14:00 UTC
Mountain View, CA, US: Thursday, September 21, 7:00 PDT
Phoenix, AZ, US: Thursday, September 21, 7:00 MST
Denver, CO, US: Thursday, September 21, 8:00 MDT
Huntsville, AL, US: Thursday, September 21, 9:00 CDT
Raleigh, NC, US: Thursday, September 21, 10:00 EDT
London, England: Thursday, September 21, 15:00 BST
Paris, France: Thursday, September 21, 16:00 CEST
Helsinki, Finland: Thursday, September 21, 17:00 EEST
Tel Aviv, Israel: Thursday, September 21, 17:00 IDT
Pune, India: Thursday, September 21, 19:30 IST
Brisbane, Australia: Friday, September 22, 0:00 AEST
Singapore, Asia: Thursday, September 21, 22:00 +08
Auckland, New Zealand: Friday, September 22, 2:00 NZST
--
Laura Flores
She/Her/Hers
Software Engineer, Ceph Storage <https://ceph.io>
Chicago, IL
lflores(a)ibm.com | lflores(a)redhat.com <lflores(a)redhat.com>
M: +17087388804
Dear all,
I have a problem that after an OSD host lost connection to the
sync/cluster rear network for many hours (the public network was
online), a test VM using RBD cant overwrite its files. I can create a
new file inside it just fine, but not overwrite it, the process just hangs.
The VM's disk is on an erasure coded data pool and a replicated pool in
front of it. EC overwrites is on for the pool.
The custer consists of 5 hosts and 4 OSDs on each, and separate hosts
for compute. There is a public and separate cluster network, separated.
In this case, the AOC cable to the cluster network went link down on a
host and it had to be replaced and the host was rebooted. Recovery took
about a week to complete. The host was half-down for about 12 hours like
this.
I have some other VMs as well with images in the same pool (totally 4),
and they seem to work fine, it is just this one that cant overwrite.
I'm thinking there is somehow something wrong with just this image?
Regards,
Peter
Since upgrading to 18.2.0 , OSDs are very frequently restarting due to livenessprobe failures making the cluster unusable. Has anyone else seen this behavior?
Upgrade path: ceph 17.2.6 to 18.2.0 (and rook from 1.11.9 to 1.12.1)
on ubuntu 20.04 kernel 5.15.0-79-generic
Thanks.
Hi Teams,
*Ceph-version*: Quincy, Reef
*OS*: Almalinux 8
*Issue*: snap_schedule doesn't create the scheduled snapshots consistently.
*Description:*
Hi team,
We are currently working in a 3-node ceph cluster.
We are currently exploring the scheduled snapshot capability of the
ceph-mgr module.
To enable/configure scheduled snapshots, we followed the following link:
https://docs.ceph.com/en/quincy/cephfs/snap-schedule/
Using this we were able to schedule a snapshot for the subvolume which we
created.
Initially, it was not working but after two days it started working.
Later on a fresh cluster we again tried to schedule the snapshot for a
subvolume and it worked fine.
But next time when we again tried the snapshot creation on a fresh cluster
it didn't work this time.
We have observed this inconsistency in scheduling the snapshots for a long
time now.
Everytime we follow the exact same steps to add a snap_schedule but
sometimes it works and sometimes it doesn't.
We have a chronyd service running and the timezone set to UTC timezone.
Could you please help us out with this?
Kindly let me know if you require any kind of logs for this
Thanks and Regards,
Kushagra Gupta
Hi Team,
Any update on this?
Thanks and Regards,
Kushagra Gupta
On Tue, Sep 5, 2023 at 10:51 AM Kushagr Gupta <kushagrguptasps.mun(a)gmail.com>
wrote:
> *Ceph-version*: Quincy
> *OS*: Centos 8 stream
>
> *Issue*: Not able to find a standardized restoration procedure for
> subvolume snapshots.
>
> *Description:*
> Hi team,
>
> We are currently working in a 3-node ceph cluster.
> We are currently exploring the scheduled snapshot capability of the
> ceph-mgr module.
> To enable/configure scheduled snapshots, we followed the following link:
>
> https://docs.ceph.com/en/quincy/cephfs/snap-schedule/
>
> The scheduled snapshots are working as expected. But we are unable to find
> any standardized restoration procedure for the same.
>
> We have found the following link( not official documentation):
> https://www.suse.com/support/kb/doc/?id=000019627
>
> We have also found a link of cloning a new subvolume from snapshots:
> https://docs.ceph.com/en/reef/cephfs/fs-volumes/
> (Section: Cloning Snapshots)
>
> Is there a standard procedure to restore from a snapshot.
> By this I mean, is there some kind of command link maybe
> ceph fs subvolume snapshot restore <snapshot-name>
>
> Or any other procedure please let us know.
>
> Thanks and Regards,
> Kushagra Gupta
>
Fellow cephalopods,
I'm trying to get quick, seamless NFS failover happening on my four-node
Ceph cluster.
I followed the instructions here:
https://docs.ceph.com/en/latest/cephadm/services/nfs/#high-availability-nfs
but testing shows that failover doesn't happen. When I placed node 2
("san2") in maintenance mode, the NFS service shut down:
Aug 24 14:19:03 san2 ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66-nfs-xcpnfs-1-0-san2-datsvq[1962479]: 24/08/2023 04:19:03 : epoch 64b8af5a : san2 : ganesha.nfsd-8[Admin] do_shutdown :MAIN :EVENT :Removing all exports.
Aug 24 14:19:13 san2 bash[3235994]: time="2023-08-24T14:19:13+10:00" level=warning msg="StopSignal SIGTERM failed to stop container ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66-nfs-xcpnfs-1-0-san2-datsvq in 10 seconds, resorting to SIGKILL"
Aug 24 14:19:13 san2 bash[3235994]: ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66-nfs-xcpnfs-1-0-san2-datsvq
Aug 24 14:19:13 san2 systemd[1]:ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66@nfs.xcpnfs.1.0.san2.datsvq.servic <mailto:ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66@nfs.xcpnfs.1.0.san2.datsvq.servic>e: Main process exited, code=exited, status=137/n/a
Aug 24 14:19:14 san2 systemd[1]:ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66@nfs.xcpnfs.1.0.san2.datsvq.servic <mailto:ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66@nfs.xcpnfs.1.0.san2.datsvq.servic>e: Failed with result 'exit-code'.
Aug 24 14:19:14 san2 systemd[1]: Stopped Ceph nfs.xcpnfs.1.0.san2.datsvq for e2f1b934-ed43-11ec-80fa-04421a1a1d66.
And that's it. The ingress IP didn't move.
More odd, the cluster seems to have placed the ingress IP on node 1
(san1) but seems to be using the NFS service on node 2.
Do I need to more tightly connect the NFS service to the keepalive and
haproxy services, or do I need to expand the ingress services to refer
to multiple NFS services?
Thank you.
--
Regards,
Thorne Lawler - Senior System Administrator
*DDNS* | ABN 76 088 607 265
First registrar certified ISO 27001-2013 Data Security Standard ITGOV40172
P +61 499 449 170
_DDNS
/_*Please note:* The information contained in this email message and any
attached files may be confidential information, and may also be the
subject of legal professional privilege. _If you are not the intended
recipient any use, disclosure or copying of this email is unauthorised.
_If you received this email in error, please notify Discount Domain Name
Services Pty Ltd on 03 9815 6868 to report this matter and delete all
copies of this transmission together with any attachments. /
Hey Ceph-users,
i just noticed there is a post to oss-security
(https://www.openwall.com/lists/oss-security/2023/09/26/10) about a
security issue with Ceph RGW.
Signed by IBM / Redhat and including a patch by DO.
I also raised an issue on the tracker
(https://tracker.ceph.com/issues/63004) about this, as I could not find
one yet.
It seems a weird way of disclosing such a thing and am wondering if
anybody knew any more about this?
Regards
Christian
Hi everybody,
The CLT met today as usual. We only had a few topics under discussion:
* the User + Dev relaunch went off well! We’d like reliable recordings and
have found Jitsi to be somewhat glitchy; Laura will communicate about
workarounds for that while we work on a longer-term solution (self-hosting
Jitsi has a better reputation and is a possibility). We also discussed a
GitHub repo for hosting presentation files, and organizing them on the
website.
* CVE handling. As noted elsewhere on the mailing list, CVE-2023-43040 (a
privilege escalation impacting RGW) was disclosed elsewhere, and we do not
have coordinated releases for it. This was not deemed important enough on
the security list for that effort, but we do want to be more prepared for
it than we were — our CVE handling process has broken down a bit since some
of the CVE work is now being handled by IBM instead of Red Hat. Tech leads
and IBM employees will be working on refining that so we have better
disclosures.
Also, if you were previously on the security mailing list and a did not see
these emails, please reach out to the team — some subscribers were lost and
not recovered in the lab disaster end of last year. (For obvious reasons
this is a closed list — if you do not work for a Linux distribution or at a
large deployer with established relationships in Ceph and security
communities, it’s hard for us to put you there.)
-Greg
Hey,
Has anyone else had issues with exploring Loki after deploying ceph
monitoring services
<https://docs.ceph.com/en/latest/cephadm/services/monitoring/>?
I'm running 17.2.6.
When clicking on the Ceph dashboard daemon logs (i.e Cluster -> Logs ->
Daemon Logs), it took me through to an embedded Grafana dashboard for
"Dashboard1" so it's not working for me.
I found a workaround by enabling viewer role edit permissions. So I added
viewers_can_edit = true
to my grafana.ini. After I fixed this, the 'Explore' button appeared in my
Grafana dashboard and I could explore the log files.
If you've hit the same problem and have a better solution, please let me
know.
For anyone who has the same problem and want more details of how I fixed
it, here is what I did:
From my cephadm shell:
ceph config-key get mgr/cephadm/services/grafana/grafana.ini >
/tmp/grafana.ini
Edit /tmp/grafana.ini and add the line below in red.
# {{ cephadm_managed }}
# Source
/usr/share/ceph/mgr/cephadm/templates/services/grafana/grafana.ini.j2
[users]
default_theme = light
*viewers_can_edit = true*
...
Then updated the config:
ceph config-key set mgr/cephadm/services/grafana/grafana.ini -i
/tmp/grafana.ini
Then a reconfig and restart grafana:
ceph orch reconfig grafana
ceph orch restart grafana
Cheers,
Tom