[ceph-users] MDS stuck in "up:replay"

14 Jan 2023

Hi,

I'm really lost with my Ceph system. I built a small cluster for home
usage which has two uses for me: I want to replace an old NAS and I want
to learn about Ceph so that I have hands-on experience. We're using it
in our company but I need some real-life experience without risking any
company or customers data. That's my preferred way of learning.

The cluster consists of 3 Raspberry Pis plus a few VMs running on
Proxmox. I'm not using Proxmox' built in Ceph because I want to focus on
Ceph and not just use it as a preconfigured tool.

All hosts are running Fedora (x86_64 and arm64) and during an Upgrade
from F36 to F37 my cluster suddenly showed all PGs as unavailable. I
worked nearly a week to get it back online and I learned a lot about
Ceph management and recovery. The cluster is back but I still can't
access my data. Maybe you can help me?

Here are my versions:

[ceph: root@ceph04 /]# ceph versions
{
     "mon": {
         "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 3
     },
     "mgr": {
         "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 3
     },
     "osd": {
         "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 5
     },
     "mds": {
         "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 4
     },
     "overall": {
         "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 15
     }
}

Here's MDS status output of one MDS:
[ceph: root@ceph04 /]# ceph tell mds.mds01.ceph05.pqxmvt status
2023-01-14T15:30:28.607+0000 7fb9e17fa700  0 client.60986454
ms_handle_reset on v2:192.168.23.65:6800/2680651694
2023-01-14T15:30:28.640+0000 7fb9e17fa700  0 client.60986460
ms_handle_reset on v2:192.168.23.65:6800/2680651694
{
     "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4",
     "whoami": 0,
     "id": 60984167,
     "want_state": "up:replay",
     "state": "up:replay",
     "fs_name": "cephfs",
     "replay_status": {
         "journal_read_pos": 0,
         "journal_write_pos": 0,
         "journal_expire_pos": 0,
         "num_events": 0,
         "num_segments": 0
     },
     "rank_uptime": 1127.54018615,
     "mdsmap_epoch": 98056,
     "osdmap_epoch": 12362,
     "osdmap_epoch_barrier": 0,
     "uptime": 1127.957307273
}

It's staying like that for days now. If there was a counter moving, I
just would wait but it doesn't change anything and alle stats says, the
MDS aren't working at all.

The symptom I have is that Dashboard and all other tools I use say, it's
more or less ok. (Some old messages about failed daemons and scrubbing
aside). But I can't mount anything. When I try to start a VM that's on
RDS I just get a timeout. And when I try to mount a CephFS, mount just
hangs forever.

Whatever command I give MDS or journal, it just hangs. The only thing I
could do, was take all CephFS offline, kill the MDS's and do a "ceph fs
reset <fs name> --yes-i-really-mean-it". After that I rebooted all
nodes, just to be sure but I still have no access to data.

Could you please help me? I'm kinda desperate. If you need any more
information, just let me know.

Cheers,
Thomas

-- 
Thomas Widhalm
Lead Systems Engineer

NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg
Tel: +49 911 92885-0 | Fax: +49 911 92885-77
CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510
https://www.netways.de | thomas.widhalm(a)netways.de

** stackconf 2023 - September - https://stackconf.eu **
** OSMC 2023 - November - https://osmc.de **
** New at NWS: Managed Database - https://nws.netways.de/managed-database **
** NETWAYS Web Services - https://nws.netways.de **

2024

2023

2022

2021

2020

2019

[ceph-users] MDS stuck in "up:replay"