Hi,
I'm really lost with my Ceph system. I built a small cluster for home
usage which has two uses for me: I want to replace an old NAS and I want
to learn about Ceph so that I have hands-on experience. We're using it
in our company but I need some real-life experience without risking any
company or customers data. That's my preferred way of learning.
The cluster consists of 3 Raspberry Pis plus a few VMs running on
Proxmox. I'm not using Proxmox' built in Ceph because I want to focus on
Ceph and not just use it as a preconfigured tool.
All hosts are running Fedora (x86_64 and arm64) and during an Upgrade
from F36 to F37 my cluster suddenly showed all PGs as unavailable. I
worked nearly a week to get it back online and I learned a lot about
Ceph management and recovery. The cluster is back but I still can't
access my data. Maybe you can help me?
Here are my versions:
[ceph: root@ceph04 /]# ceph versions
{
"mon": {
"ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 3
},
"mgr": {
"ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 3
},
"osd": {
"ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 5
},
"mds": {
"ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 4
},
"overall": {
"ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 15
}
}
Here's MDS status output of one MDS:
[ceph: root@ceph04 /]# ceph tell mds.mds01.ceph05.pqxmvt status
2023-01-14T15:30:28.607+0000 7fb9e17fa700 0 client.60986454
ms_handle_reset on v2:192.168.23.65:6800/2680651694
2023-01-14T15:30:28.640+0000 7fb9e17fa700 0 client.60986460
ms_handle_reset on v2:192.168.23.65:6800/2680651694
{
"cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4",
"whoami": 0,
"id": 60984167,
"want_state": "up:replay",
"state": "up:replay",
"fs_name": "cephfs",
"replay_status": {
"journal_read_pos": 0,
"journal_write_pos": 0,
"journal_expire_pos": 0,
"num_events": 0,
"num_segments": 0
},
"rank_uptime": 1127.54018615,
"mdsmap_epoch": 98056,
"osdmap_epoch": 12362,
"osdmap_epoch_barrier": 0,
"uptime": 1127.957307273
}
It's staying like that for days now. If there was a counter moving, I
just would wait but it doesn't change anything and alle stats says, the
MDS aren't working at all.
The symptom I have is that Dashboard and all other tools I use say, it's
more or less ok. (Some old messages about failed daemons and scrubbing
aside). But I can't mount anything. When I try to start a VM that's on
RDS I just get a timeout. And when I try to mount a CephFS, mount just
hangs forever.
Whatever command I give MDS or journal, it just hangs. The only thing I
could do, was take all CephFS offline, kill the MDS's and do a "ceph fs
reset <fs name> --yes-i-really-mean-it". After that I rebooted all
nodes, just to be sure but I still have no access to data.
Could you please help me? I'm kinda desperate. If you need any more
information, just let me know.
Cheers,
Thomas
--
Thomas Widhalm
Lead Systems Engineer
NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg
Tel: +49 911 92885-0 | Fax: +49 911 92885-77
CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510
https://www.netways.de | thomas.widhalm(a)netways.de
** stackconf 2023 - September -
https://stackconf.eu **
** OSMC 2023 - November -
https://osmc.de **
** New at NWS: Managed Database -
https://nws.netways.de/managed-database **
** NETWAYS Web Services -
https://nws.netways.de **