MDS stuck in "up:replay"

List overview All Threads
Download

newer

older

Daily failed capability releases,...

kernel client osdc ops stuck and...

Thomas Widhalm

14 Jan 2023 14 Jan '23

7:37 a.m.

Hi, I'm really lost with my Ceph system. I built a small cluster for home usage which has two uses for me: I want to replace an old NAS and I want to learn about Ceph so that I have hands-on experience. We're using it in our company but I need some real-life experience without risking any company or customers data. That's my preferred way of learning. The cluster consists of 3 Raspberry Pis plus a few VMs running on Proxmox. I'm not using Proxmox' built in Ceph because I want to focus on Ceph and not just use it as a preconfigured tool. All hosts are running Fedora (x86_64 and arm64) and during an Upgrade from F36 to F37 my cluster suddenly showed all PGs as unavailable. I worked nearly a week to get it back online and I learned a lot about Ceph management and recovery. The cluster is back but I still can't access my data. Maybe you can help me? Here are my versions: [ceph: root@ceph04 /]# ceph versions { "mon": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 3 }, "mgr": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 3 }, "osd": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 5 }, "mds": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 4 }, "overall": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 15 } } Here's MDS status output of one MDS: [ceph: root@ceph04 /]# ceph tell mds.mds01.ceph05.pqxmvt status 2023-01-14T15:30:28.607+0000 7fb9e17fa700 0 client.60986454 ms_handle_reset on v2:192.168.23.65:6800/2680651694 2023-01-14T15:30:28.640+0000 7fb9e17fa700 0 client.60986460 ms_handle_reset on v2:192.168.23.65:6800/2680651694 { "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", "whoami": 0, "id": 60984167, "want_state": "up:replay", "state": "up:replay", "fs_name": "cephfs", "replay_status": { "journal_read_pos": 0, "journal_write_pos": 0, "journal_expire_pos": 0, "num_events": 0, "num_segments": 0 }, "rank_uptime": 1127.54018615, "mdsmap_epoch": 98056, "osdmap_epoch": 12362, "osdmap_epoch_barrier": 0, "uptime": 1127.957307273 } It's staying like that for days now. If there was a counter moving, I just would wait but it doesn't change anything and alle stats says, the MDS aren't working at all. The symptom I have is that Dashboard and all other tools I use say, it's more or less ok. (Some old messages about failed daemons and scrubbing aside). But I can't mount anything. When I try to start a VM that's on RDS I just get a timeout. And when I try to mount a CephFS, mount just hangs forever. Whatever command I give MDS or journal, it just hangs. The only thing I could do, was take all CephFS offline, kill the MDS's and do a "ceph fs reset <fs name> --yes-i-really-mean-it". After that I rebooted all nodes, just to be sure but I still have no access to data. Could you please help me? I'm kinda desperate. If you need any more information, just let me know. Cheers, Thomas -- Thomas Widhalm Lead Systems Engineer NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg Tel: +49 911 92885-0 | Fax: +49 911 92885-77 CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 https://www.netways.de | thomas.widhalm(a)netways.de ** stackconf 2023 - September - https://stackconf.eu ** ** OSMC 2023 - November - https://osmc.de ** ** New at NWS: Managed Database - https://nws.netways.de/managed-database ** ** NETWAYS Web Services - https://nws.netways.de **

Show replies by date

Kotresh Hiremath Ravishankar

15 Jan 15 Jan

9:01 p.m.

Hi Thomas, As the documentation says, the MDS enters up:resolve from up:replay if the Ceph file system has multiple ranks (including this one), i.e. it’s not a single active MDS cluster. The MDS is resolving any uncommitted inter-MDS operations. All ranks in the file system must be in this state or later for progress to be made, i.e. no rank can be failed/damaged or up:replay. So please check the status of the other active mds if it's failed. Also please share the mds logs and the output of 'ceph fs dump' and 'ceph fs status' Thanks, Kotresh H R On Sat, Jan 14, 2023 at 9:07 PM Thomas Widhalm <thomas.widhalm(a)netways.de> wrote:

...

Thomas Widhalm

16 Jan 16 Jan

1:08 a.m.

Hi Kotresh, Thanks for your reply! I only have one rank. Here's the output of all MDS I have: ################### [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph05.pqxmvt status 2023-01-16T08:55:26.055+0000 7f3412ffd700 0 client.61249926 ms_handle_reset on v2:192.168.23.65:6800/2680651694 2023-01-16T08:55:26.084+0000 7f3412ffd700 0 client.61299199 ms_handle_reset on v2:192.168.23.65:6800/2680651694 { "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", "whoami": 0, "id": 60984167, "want_state": "up:replay", "state": "up:replay", "fs_name": "cephfs", "replay_status": { "journal_read_pos": 0, "journal_write_pos": 0, "journal_expire_pos": 0, "num_events": 0, "num_segments": 0 }, "rank_uptime": 150224.982558844, "mdsmap_epoch": 143757, "osdmap_epoch": 12395, "osdmap_epoch_barrier": 0, "uptime": 150225.39968057699 } ######################## [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph04.cvdhsx status 2023-01-16T08:59:05.434+0000 7fdb82ff5700 0 client.61299598 ms_handle_reset on v2:192.168.23.64:6800/3930607515 2023-01-16T08:59:05.466+0000 7fdb82ff5700 0 client.61299604 ms_handle_reset on v2:192.168.23.64:6800/3930607515 { "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", "whoami": 0, "id": 60984134, "want_state": "up:replay", "state": "up:replay", "fs_name": "cephfs_insecure", "replay_status": { "journal_read_pos": 0, "journal_write_pos": 0, "journal_expire_pos": 0, "num_events": 0, "num_segments": 0 }, "rank_uptime": 150450.96934037199, "mdsmap_epoch": 143815, "osdmap_epoch": 12395, "osdmap_epoch_barrier": 0, "uptime": 150451.93533502301 } ########################### [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph06.wcfdom status 2023-01-16T08:59:28.572+0000 7f16538c0b80 -1 client.61250376 resolve_mds: no MDS daemons found by name `mds01.ceph06.wcfdom' 2023-01-16T08:59:28.583+0000 7f16538c0b80 -1 client.61250376 FSMap: cephfs:1/1 cephfs_insecure:1/1 {cephfs:0=mds01.ceph05.pqxmvt=up:replay,cephfs_insecure:0=mds01.ceph04.cvdhsx=up:replay} 2 up:standby Error ENOENT: problem getting command descriptions from mds.mds01.ceph06.wcfdom ############################ [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph07.omdisd status 2023-01-16T09:00:02.802+0000 7fb7affff700 0 client.61250454 ms_handle_reset on v2:192.168.23.67:6800/942898192 2023-01-16T09:00:02.831+0000 7fb7affff700 0 client.61299751 ms_handle_reset on v2:192.168.23.67:6800/942898192 { "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", "whoami": -1, "id": 60984161, "want_state": "up:standby", "state": "up:standby", "mdsmap_epoch": 97687, "osdmap_epoch": 0, "osdmap_epoch_barrier": 0, "uptime": 150508.29091721401 } The error message from ceph06 is new to me. That didn't happen the last times. [ceph: root@ceph06 /]# ceph fs dump e143850 enable_multiple, ever_enabled_multiple: 1,1 default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} legacy client fscid: 2 Filesystem 'cephfs' (2) fs_name cephfs epoch 143850 flags 12 joinable allow_snaps allow_multimds_snaps created 2023-01-14T14:30:05.723421+0000 modified 2023-01-16T09:00:53.663007+0000 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 required_client_features {} last_failure 0 last_failure_osd_epoch 12321 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 1 in 0 up {0=60984167} failed damaged stopped data_pools [4] metadata_pool 5 inline_data disabled balancer standby_count_wanted 1 [mds.mds01.ceph05.pqxmvt{0:60984167} state up:replay seq 37637 addr [v2:192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694] compat {c=[1],r=[1],i=[7ff]}] Filesystem 'cephfs_insecure' (3) fs_name cephfs_insecure epoch 143849 flags 12 joinable allow_snaps allow_multimds_snaps created 2023-01-14T14:22:46.360062+0000 modified 2023-01-16T09:00:52.632163+0000 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 required_client_features {} last_failure 0 last_failure_osd_epoch 12319 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 1 in 0 up {0=60984134} failed damaged stopped data_pools [7] metadata_pool 6 inline_data disabled balancer standby_count_wanted 1 [mds.mds01.ceph04.cvdhsx{0:60984134} state up:replay seq 37639 addr [v2:192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515] compat {c=[1],r=[1],i=[7ff]}] Standby daemons: [mds.mds01.ceph07.omdisd{-1:60984161} state up:standby seq 2 addr [v2:192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192] compat {c=[1],r=[1],i=[7ff]}] [mds.mds01.ceph06.hsuhqd{-1:60984828} state up:standby seq 1 addr [v2:192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518] compat {c=[1],r=[1],i=[7ff]}] dumped fsmap epoch 143850 ############################# [ceph: root@ceph06 /]# ceph fs status (doesn't come back) ############################# All MDS show log lines similar to this one: Jan 16 10:05:00 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating MDS map to version 143927 from mon.1 Jan 16 10:05:05 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating MDS map to version 143929 from mon.1 Jan 16 10:05:09 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating MDS map to version 143930 from mon.1 Jan 16 10:05:13 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating MDS map to version 143931 from mon.1 Jan 16 10:05:20 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating MDS map to version 143933 from mon.1 Jan 16 10:05:24 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating MDS map to version 143935 from mon.1 Jan 16 10:05:29 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating MDS map to version 143936 from mon.1 Jan 16 10:05:33 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating MDS map to version 143937 from mon.1 Jan 16 10:05:40 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating MDS map to version 143939 from mon.1 Jan 16 10:05:44 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating MDS map to version 143941 from mon.1 Jan 16 10:05:49 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating MDS map to version 143942 from mon.1 Anything else, I can provide? Cheers and thanks again! Thomas On 16.01.23 06:01, Kotresh Hiremath Ravishankar wrote:

...

Hi Thomas, As the documentation says, the MDS enters up:resolve from |up:replay| if the Ceph file system has multiple ranks (including this one), i.e. it’s not a single active MDS cluster. The MDS is resolving any uncommitted inter-MDS operations. All ranks in the file system must be in this state or later for progress to be made, i.e. no rank can be failed/damaged or |up:replay|. So please check the status of the other active mds if it's failed. Also please share the mds logs and the output of 'ceph fs dump' and 'ceph fs status' Thanks, Kotresh H R On Sat, Jan 14, 2023 at 9:07 PM Thomas Widhalm <thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>> wrote: Hi, I'm really lost with my Ceph system. I built a small cluster for home usage which has two uses for me: I want to replace an old NAS and I want to learn about Ceph so that I have hands-on experience. We're using it in our company but I need some real-life experience without risking any company or customers data. That's my preferred way of learning. The cluster consists of 3 Raspberry Pis plus a few VMs running on Proxmox. I'm not using Proxmox' built in Ceph because I want to focus on Ceph and not just use it as a preconfigured tool. All hosts are running Fedora (x86_64 and arm64) and during an Upgrade from F36 to F37 my cluster suddenly showed all PGs as unavailable. I worked nearly a week to get it back online and I learned a lot about Ceph management and recovery. The cluster is back but I still can't access my data. Maybe you can help me? Here are my versions: [ceph: root@ceph04 /]# ceph versions { "mon": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 3 }, "mgr": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 3 }, "osd": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 5 }, "mds": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 4 }, "overall": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 15 } } Here's MDS status output of one MDS: [ceph: root@ceph04 /]# ceph tell mds.mds01.ceph05.pqxmvt status 2023-01-14T15:30:28.607+0000 7fb9e17fa700 0 client.60986454 ms_handle_reset on v2:192.168.23.65:6800/2680651694 <http://192.168.23.65:6800/2680651694> 2023-01-14T15:30:28.640+0000 7fb9e17fa700 0 client.60986460 ms_handle_reset on v2:192.168.23.65:6800/2680651694 <http://192.168.23.65:6800/2680651694> { "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", "whoami": 0, "id": 60984167, "want_state": "up:replay", "state": "up:replay", "fs_name": "cephfs", "replay_status": { "journal_read_pos": 0, "journal_write_pos": 0, "journal_expire_pos": 0, "num_events": 0, "num_segments": 0 }, "rank_uptime": 1127.54018615, "mdsmap_epoch": 98056, "osdmap_epoch": 12362, "osdmap_epoch_barrier": 0, "uptime": 1127.957307273 } It's staying like that for days now. If there was a counter moving, I just would wait but it doesn't change anything and alle stats says, the MDS aren't working at all. The symptom I have is that Dashboard and all other tools I use say, it's more or less ok. (Some old messages about failed daemons and scrubbing aside). But I can't mount anything. When I try to start a VM that's on RDS I just get a timeout. And when I try to mount a CephFS, mount just hangs forever. Whatever command I give MDS or journal, it just hangs. The only thing I could do, was take all CephFS offline, kill the MDS's and do a "ceph fs reset <fs name> --yes-i-really-mean-it". After that I rebooted all nodes, just to be sure but I still have no access to data. Could you please help me? I'm kinda desperate. If you need any more information, just let me know. Cheers, Thomas -- Thomas Widhalm Lead Systems Engineer NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg Tel: +49 911 92885-0 | Fax: +49 911 92885-77 CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 https://www.netways.de <https://www.netways.de> | thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de> ** stackconf 2023 - September - https://stackconf.eu <https://stackconf.eu> ** ** OSMC 2023 - November - https://osmc.de <https://osmc.de> ** ** New at NWS: Managed Database - https://nws.netways.de/managed-database <https://nws.netways.de/managed-database> ** ** NETWAYS Web Services - https://nws.netways.de <https://nws.netways.de> ** _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

-- Thomas Widhalm Lead Systems Engineer NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg Tel: +49 911 92885-0 | Fax: +49 911 92885-77 CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 https://www.netways.de | thomas.widhalm(a)netways.de ** stackconf 2023 - September - https://stackconf.eu ** ** OSMC 2023 - November - https://osmc.de ** ** New at NWS: Managed Database - https://nws.netways.de/managed-database ** ** NETWAYS Web Services - https://nws.netways.de **

Kotresh Hiremath Ravishankar

9:27 p.m.

...

want

to learn about Ceph so that I have hands-on experience. We're using

in our company but I need some real-life experience without risking

any

company or customers data. That's my preferred way of learning. The cluster consists of 3 Raspberry Pis plus a few VMs running on Proxmox. I'm not using Proxmox' built in Ceph because I want to

focus on

Ceph and not just use it as a preconfigured tool. All hosts are running Fedora (x86_64 and arm64) and during an Upgrade from F36 to F37 my cluster suddenly showed all PGs as unavailable. I worked nearly a week to get it back online and I learned a lot about Ceph management and recovery. The cluster is back but I still can't access my data. Maybe you can help me? Here are my versions: [ceph: root@ceph04 /]# ceph versions { "mon": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 3 }, "mgr": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 3 }, "osd": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 5 }, "mds": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 4 }, "overall": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 15 } } Here's MDS status output of one MDS: [ceph: root@ceph04 /]# ceph tell mds.mds01.ceph05.pqxmvt status 2023-01-14T15:30:28.607+0000 7fb9e17fa700 0 client.60986454 ms_handle_reset on v2:192.168.23.65:6800/2680651694 <http://192.168.23.65:6800/2680651694> 2023-01-14T15:30:28.640+0000 7fb9e17fa700 0 client.60986460 ms_handle_reset on v2:192.168.23.65:6800/2680651694 <http://192.168.23.65:6800/2680651694> { "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", "whoami": 0, "id": 60984167, "want_state": "up:replay", "state": "up:replay", "fs_name": "cephfs", "replay_status": { "journal_read_pos": 0, "journal_write_pos": 0, "journal_expire_pos": 0, "num_events": 0, "num_segments": 0 }, "rank_uptime": 1127.54018615, "mdsmap_epoch": 98056, "osdmap_epoch": 12362, "osdmap_epoch_barrier": 0, "uptime": 1127.957307273 } It's staying like that for days now. If there was a counter moving, I just would wait but it doesn't change anything and alle stats says,

the

MDS aren't working at all. The symptom I have is that Dashboard and all other tools I use say,

it's

more or less ok. (Some old messages about failed daemons and

scrubbing

aside). But I can't mount anything. When I try to start a VM that's

RDS I just get a timeout. And when I try to mount a CephFS, mount

just

hangs forever. Whatever command I give MDS or journal, it just hangs. The only

thing I

could do, was take all CephFS offline, kill the MDS's and do a "ceph

reset <fs name> --yes-i-really-mean-it". After that I rebooted all nodes, just to be sure but I still have no access to data. Could you please help me? I'm kinda desperate. If you need any more information, just let me know. Cheers, Thomas -- Thomas Widhalm Lead Systems Engineer NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg Tel: +49 911 92885-0 | Fax: +49 911 92885-77 CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 https://www.netways.de <https://www.netways.de> | thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de> ** stackconf 2023 - September - https://stackconf.eu <https://stackconf.eu> ** ** OSMC 2023 - November - https://osmc.de <https://osmc.de> ** ** New at NWS: Managed Database - https://nws.netways.de/managed-database <https://nws.netways.de/managed-database> ** ** NETWAYS Web Services - https://nws.netways.de <https://nws.netways.de> ** _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

Thomas Widhalm

17 Jan 17 Jan

1:13 a.m.

...

Hi Thomas, As the documentation says, the MDS enters up:resolve from

|up:replay| if

the Ceph file system has multiple ranks (including this one),

i.e. it’s

not a single active MDS cluster. The MDS is resolving any uncommitted inter-MDS operations. All

ranks in

the file system must be in this state or later for progress to be

made,

i.e. no rank can be failed/damaged or |up:replay|. So please check the status of the other active mds if it's failed. Also please share the mds logs and the output of 'ceph fs dump' and 'ceph fs status' Thanks, Kotresh H R On Sat, Jan 14, 2023 at 9:07 PM Thomas Widhalm <thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>

<mailto:thomas.widhalm@netways.de <mailto:thomas.widhalm@netways.de>>> wrote:

Hi, I'm really lost with my Ceph system. I built a small cluster

for home

usage which has two uses for me: I want to replace an old NAS

and I want

to learn about Ceph so that I have hands-on experience. We're

using it

in our company but I need some real-life experience without

risking any

company or customers data. That's my preferred way of learning. The cluster consists of 3 Raspberry Pis plus a few VMs running on Proxmox. I'm not using Proxmox' built in Ceph because I want

to focus on

Ceph and not just use it as a preconfigured tool. All hosts are running Fedora (x86_64 and arm64) and during an

Upgrade

from F36 to F37 my cluster suddenly showed all PGs as

unavailable. I

worked nearly a week to get it back online and I learned a

lot about

Ceph management and recovery. The cluster is back but I still

can't

access my data. Maybe you can help me? Here are my versions: [ceph: root@ceph04 /]# ceph versions { "mon": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 3 }, "mgr": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 3 }, "osd": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 5 }, "mds": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 4 }, "overall": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 15 } } Here's MDS status output of one MDS: [ceph: root@ceph04 /]# ceph tell mds.mds01.ceph05.pqxmvt status 2023-01-14T15:30:28.607+0000 7fb9e17fa700 0 client.60986454 ms_handle_reset on v2:192.168.23.65:6800/2680651694

<http://192.168.23.65:6800/2680651694>

<http://192.168.23.65:6800/2680651694

<http://192.168.23.65:6800/2680651694>>

2023-01-14T15:30:28.640+0000 7fb9e17fa700 0 client.60986460 ms_handle_reset on v2:192.168.23.65:6800/2680651694

<http://192.168.23.65:6800/2680651694>

<http://192.168.23.65:6800/2680651694

<http://192.168.23.65:6800/2680651694>>

{ "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", "whoami": 0, "id": 60984167, "want_state": "up:replay", "state": "up:replay", "fs_name": "cephfs", "replay_status": { "journal_read_pos": 0, "journal_write_pos": 0, "journal_expire_pos": 0, "num_events": 0, "num_segments": 0 }, "rank_uptime": 1127.54018615, "mdsmap_epoch": 98056, "osdmap_epoch": 12362, "osdmap_epoch_barrier": 0, "uptime": 1127.957307273 } It's staying like that for days now. If there was a counter

moving, I

just would wait but it doesn't change anything and alle stats

says, the

MDS aren't working at all. The symptom I have is that Dashboard and all other tools I

use say, it's

more or less ok. (Some old messages about failed daemons and

scrubbing

aside). But I can't mount anything. When I try to start a VM

that's on

RDS I just get a timeout. And when I try to mount a CephFS,

mount just

hangs forever. Whatever command I give MDS or journal, it just hangs. The

only thing I

could do, was take all CephFS offline, kill the MDS's and do

a "ceph fs

reset <fs name> --yes-i-really-mean-it". After that I

rebooted all

nodes, just to be sure but I still have no access to data. Could you please help me? I'm kinda desperate. If you need

any more

information, just let me know. Cheers, Thomas -- Thomas Widhalm Lead Systems Engineer NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg Tel: +49 911 92885-0 | Fax: +49 911 92885-77 CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 https://www.netways.de <https://www.netways.de>

<https://www.netways.de <https://www.netways.de>> |

thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>

<mailto:thomas.widhalm@netways.de <mailto:thomas.widhalm@netways.de>>

** stackconf 2023 - September - https://stackconf.eu

<https://stackconf.eu>

<https://stackconf.eu <https://stackconf.eu>> ** ** OSMC 2023 - November - https://osmc.de <https://osmc.de>

<https://osmc.de <https://osmc.de>> **

** New at NWS: Managed Database - https://nws.netways.de/managed-database

<https://nws.netways.de/managed-database>

<https://nws.netways.de/managed-database

<https://nws.netways.de/managed-database>> **

** NETWAYS Web Services - https://nws.netways.de

<https://nws.netways.de>

<https://nws.netways.de <https://nws.netways.de>> ** _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

<mailto:ceph-users@ceph.io <mailto:ceph-users@ceph.io>> To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io>

<mailto:ceph-users-leave@ceph.io

<mailto:ceph-users-leave@ceph.io>>

-- Thomas Widhalm Lead Systems Engineer NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg Tel: +49 911 92885-0 | Fax: +49 911 92885-77 CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 https://www.netways.de <https://www.netways.de> | thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de> ** stackconf 2023 - September - https://stackconf.eu <https://stackconf.eu> ** ** OSMC 2023 - November - https://osmc.de <https://osmc.de> ** ** New at NWS: Managed Database - https://nws.netways.de/managed-database <https://nws.netways.de/managed-database> ** ** NETWAYS Web Services - https://nws.netways.de <https://nws.netways.de> **

Thomas Widhalm

1:54 a.m.

...

Hi Thomas, As the documentation says, the MDS enters up:resolve from

|up:replay| if

the Ceph file system has multiple ranks (including this one),

i.e. it’s

not a single active MDS cluster. The MDS is resolving any uncommitted inter-MDS operations. All

ranks in

the file system must be in this state or later for progress to be

made,

i.e. no rank can be failed/damaged or |up:replay|. So please check the status of the other active mds if it's failed. Also please share the mds logs and the output of 'ceph fs dump'

and

'ceph fs status' Thanks, Kotresh H R On Sat, Jan 14, 2023 at 9:07 PM Thomas Widhalm <thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>

<mailto:thomas.widhalm@netways.de <mailto:thomas.widhalm@netways.de>>> wrote:

Hi, I'm really lost with my Ceph system. I built a small cluster

for home

usage which has two uses for me: I want to replace an old NAS

and I want

to learn about Ceph so that I have hands-on experience. We're

using it

in our company but I need some real-life experience without

risking any

company or customers data. That's my preferred way of

learning.

The cluster consists of 3 Raspberry Pis plus a few VMs

running on

Proxmox. I'm not using Proxmox' built in Ceph because I want

to focus on

Ceph and not just use it as a preconfigured tool. All hosts are running Fedora (x86_64 and arm64) and during an

Upgrade

from F36 to F37 my cluster suddenly showed all PGs as

unavailable. I

worked nearly a week to get it back online and I learned a

lot about

Ceph management and recovery. The cluster is back but I still

can't

status

2023-01-14T15:30:28.607+0000 7fb9e17fa700 0 client.60986454 ms_handle_reset on v2:192.168.23.65:6800/2680651694

<http://192.168.23.65:6800/2680651694>

<http://192.168.23.65:6800/2680651694

<http://192.168.23.65:6800/2680651694>>

2023-01-14T15:30:28.640+0000 7fb9e17fa700 0 client.60986460 ms_handle_reset on v2:192.168.23.65:6800/2680651694

<http://192.168.23.65:6800/2680651694>

<http://192.168.23.65:6800/2680651694

<http://192.168.23.65:6800/2680651694>>

moving, I

just would wait but it doesn't change anything and alle stats

says, the

MDS aren't working at all. The symptom I have is that Dashboard and all other tools I

use say, it's

more or less ok. (Some old messages about failed daemons and

scrubbing

aside). But I can't mount anything. When I try to start a VM

that's on

RDS I just get a timeout. And when I try to mount a CephFS,

mount just

hangs forever. Whatever command I give MDS or journal, it just hangs. The

only thing I

could do, was take all CephFS offline, kill the MDS's and do

a "ceph fs

reset <fs name> --yes-i-really-mean-it". After that I

rebooted all

nodes, just to be sure but I still have no access to data. Could you please help me? I'm kinda desperate. If you need

any more

<https://www.netways.de <https://www.netways.de>> |

thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>

<mailto:thomas.widhalm@netways.de <mailto:thomas.widhalm@netways.de>>

** stackconf 2023 - September - https://stackconf.eu

<https://stackconf.eu>

<https://stackconf.eu <https://stackconf.eu>> ** ** OSMC 2023 - November - https://osmc.de <https://osmc.de>

<https://osmc.de <https://osmc.de>> **

** New at NWS: Managed Database - https://nws.netways.de/managed-database

<https://nws.netways.de/managed-database>

<https://nws.netways.de/managed-database

<https://nws.netways.de/managed-database>> **

** NETWAYS Web Services - https://nws.netways.de

<https://nws.netways.de>

<https://nws.netways.de <https://nws.netways.de>> ** _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

<mailto:ceph-users@ceph.io <mailto:ceph-users@ceph.io>> To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io>

<mailto:ceph-users-leave@ceph.io

<mailto:ceph-users-leave@ceph.io>>

-- Thomas Widhalm Lead Systems Engineer NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg Tel: +49 911 92885-0 | Fax: +49 911 92885-77 CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 https://www.netways.de <https://www.netways.de> | thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de> ** stackconf 2023 - September - https://stackconf.eu <https://stackconf.eu> ** ** OSMC 2023 - November - https://osmc.de <https://osmc.de> ** ** New at NWS: Managed Database - https://nws.netways.de/managed-database <https://nws.netways.de/managed-database> ** ** NETWAYS Web Services - https://nws.netways.de <https://nws.netways.de> **

Thomas Widhalm

4:03 a.m.

Another new thing that just happened: One of the MDS just crashed out of nowhere. /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc: In function 'void EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)' thread 7fccc7153700 time 2023-01-17T10:05:15.420191+0000 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc: 1625: FAILED ceph_assert(g_conf()->mds_wipe_sessions) ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7fccd759943f] 2: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605] 3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c) [0x55fb2b98e89c] 4: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0] 5: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443] 6: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31] 7: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca] 8: clone() and *** Caught signal (Aborted) ** in thread 7fccc7153700 thread_name:md_log_replay ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable) 1: /lib64/libpthread.so.0(+0x12cf0) [0x7fccd6593cf0] 2: gsignal() 3: abort() 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x7fccd7599499] 5: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605] 6: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c) [0x55fb2b98e89c] 7: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0] 8: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443] 9: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31] 10: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca] 11: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Is what I found in the logs. Since it's referring to log replaying, could this be related to my issue? On 17.01.23 10:54, Thomas Widhalm wrote:

...

Hi Thomas, As the documentation says, the MDS enters up:resolve from

|up:replay| if

the Ceph file system has multiple ranks (including this one),

i.e. it’s

not a single active MDS cluster. The MDS is resolving any uncommitted inter-MDS operations. All

ranks in

the file system must be in this state or later for progress to be

made,

i.e. no rank can be failed/damaged or |up:replay|. So please check the status of the other active mds if it's

failed.

Also please share the mds logs and the output of 'ceph fs dump'

and

'ceph fs status' Thanks, Kotresh H R On Sat, Jan 14, 2023 at 9:07 PM Thomas Widhalm <thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>

<mailto:thomas.widhalm@netways.de <mailto:thomas.widhalm@netways.de>>> wrote:

Hi, I'm really lost with my Ceph system. I built a small cluster

for home

usage which has two uses for me: I want to replace an old NAS

and I want

to learn about Ceph so that I have hands-on experience. We're

using it

in our company but I need some real-life experience without

risking any

company or customers data. That's my preferred way of

learning.

The cluster consists of 3 Raspberry Pis plus a few VMs

running on

Proxmox. I'm not using Proxmox' built in Ceph because I want

to focus on

Ceph and not just use it as a preconfigured tool. All hosts are running Fedora (x86_64 and arm64) and during an

Upgrade

from F36 to F37 my cluster suddenly showed all PGs as

unavailable. I

worked nearly a week to get it back online and I learned a

lot about

Ceph management and recovery. The cluster is back but I still

can't

status

2023-01-14T15:30:28.607+0000 7fb9e17fa700 0 client.60986454 ms_handle_reset on v2:192.168.23.65:6800/2680651694

<http://192.168.23.65:6800/2680651694>

<http://192.168.23.65:6800/2680651694

<http://192.168.23.65:6800/2680651694>>

2023-01-14T15:30:28.640+0000 7fb9e17fa700 0 client.60986460 ms_handle_reset on v2:192.168.23.65:6800/2680651694

<http://192.168.23.65:6800/2680651694>

<http://192.168.23.65:6800/2680651694

<http://192.168.23.65:6800/2680651694>>

moving, I

just would wait but it doesn't change anything and alle stats

says, the

MDS aren't working at all. The symptom I have is that Dashboard and all other tools I

use say, it's

more or less ok. (Some old messages about failed daemons and

scrubbing

aside). But I can't mount anything. When I try to start a VM

that's on

RDS I just get a timeout. And when I try to mount a CephFS,

mount just

hangs forever. Whatever command I give MDS or journal, it just hangs. The

only thing I

could do, was take all CephFS offline, kill the MDS's and do

a "ceph fs

reset <fs name> --yes-i-really-mean-it". After that I

rebooted all

nodes, just to be sure but I still have no access to data. Could you please help me? I'm kinda desperate. If you need

any more

<https://www.netways.de <https://www.netways.de>> |

thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>

<mailto:thomas.widhalm@netways.de <mailto:thomas.widhalm@netways.de>>

** stackconf 2023 - September - https://stackconf.eu

<https://stackconf.eu>

<https://stackconf.eu <https://stackconf.eu>> ** ** OSMC 2023 - November - https://osmc.de <https://osmc.de>

<https://osmc.de <https://osmc.de>> **

** New at NWS: Managed Database - https://nws.netways.de/managed-database

<https://nws.netways.de/managed-database>

<https://nws.netways.de/managed-database

<https://nws.netways.de/managed-database>> **

** NETWAYS Web Services - https://nws.netways.de

<https://nws.netways.de>

<https://nws.netways.de <https://nws.netways.de>> ** _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

<mailto:ceph-users@ceph.io <mailto:ceph-users@ceph.io>> To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io>

<mailto:ceph-users-leave@ceph.io

<mailto:ceph-users-leave@ceph.io>>

-- Thomas Widhalm Lead Systems Engineer NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg Tel: +49 911 92885-0 | Fax: +49 911 92885-77 CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 https://www.netways.de <https://www.netways.de> | thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de> ** stackconf 2023 - September - https://stackconf.eu <https://stackconf.eu> ** ** OSMC 2023 - November - https://osmc.de <https://osmc.de> ** ** New at NWS: Managed Database - https://nws.netways.de/managed-database <https://nws.netways.de/managed-database> ** ** NETWAYS Web Services - https://nws.netways.de <https://nws.netways.de> **

Kotresh Hiremath Ravishankar

18 Jan 18 Jan

3:26 a.m.

...

{cephfs:0=mds01.ceph05.pqxmvt=up:replay,cephfs_insecure:0=mds01.ceph04.cvdhsx=up:replay}

>> 2 up:standby >> Error ENOENT: problem getting command descriptions from >> mds.mds01.ceph06.wcfdom >> >> ############################ >> >> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph07.omdisd status >> 2023-01-16T09:00:02.802+0000 7fb7affff700 0 client.61250454 >> ms_handle_reset on v2:192.168.23.67:6800/942898192 >> <http://192.168.23.67:6800/942898192> >> 2023-01-16T09:00:02.831+0000 7fb7affff700 0 client.61299751 >> ms_handle_reset on v2:192.168.23.67:6800/942898192 >> <http://192.168.23.67:6800/942898192> >> { >> "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", >> "whoami": -1, >> "id": 60984161, >> "want_state": "up:standby", >> "state": "up:standby", >> "mdsmap_epoch": 97687, >> "osdmap_epoch": 0, >> "osdmap_epoch_barrier": 0, >> "uptime": 150508.29091721401 >> } >> >> The error message from ceph06 is new to me. That didn't happen the >> last >> times. >> >> [ceph: root@ceph06 /]# ceph fs dump >> e143850 >> enable_multiple, ever_enabled_multiple: 1,1 >> default compat: compat={},rocompat={},incompat={1=base >> v0.20,2=client >> writeable ranges,3=default file layouts on dirs,4=dir inode in >> separate >> object,5=mds uses versioned encoding,6=dirfrag is stored in >> omap,8=no >> anchor table,9=file layout v2,10=snaprealm v2} >> legacy client fscid: 2 >> >> Filesystem 'cephfs' (2) >> fs_name cephfs >> epoch 143850 >> flags 12 joinable allow_snaps allow_multimds_snaps >> created 2023-01-14T14:30:05.723421+0000 >> modified 2023-01-16T09:00:53.663007+0000 >> tableserver 0 >> root 0 >> session_timeout 60 >> session_autoclose 300 >> max_file_size 1099511627776 >> required_client_features {} >> last_failure 0 >> last_failure_osd_epoch 12321 >> compat compat={},rocompat={},incompat={1=base v0.20,2=client >> writeable >> ranges,3=default file layouts on dirs,4=dir inode in separate >> object,5=mds uses versioned encoding,6=dirfrag is stored in >> omap,7=mds >> uses inline data,8=no anchor table,9=file layout v2,10=snaprealm

v2}

> max_mds 1 > in 0 > up {0=60984167} > failed > damaged > stopped > data_pools [4] > metadata_pool 5 > inline_data disabled > balancer > standby_count_wanted 1 > [mds.mds01.ceph05.pqxmvt{0:60984167} state up:replay seq 37637 addr > [v2:192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694 > > <http://192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694

] >> compat {c=[1],r=[1],i=[7ff]}] >> >> >> Filesystem 'cephfs_insecure' (3) >> fs_name cephfs_insecure >> epoch 143849 >> flags 12 joinable allow_snaps allow_multimds_snaps >> created 2023-01-14T14:22:46.360062+0000 >> modified 2023-01-16T09:00:52.632163+0000 >> tableserver 0 >> root 0 >> session_timeout 60 >> session_autoclose 300 >> max_file_size 1099511627776 >> required_client_features {} >> last_failure 0 >> last_failure_osd_epoch 12319 >> compat compat={},rocompat={},incompat={1=base v0.20,2=client >> writeable >> ranges,3=default file layouts on dirs,4=dir inode in separate >> object,5=mds uses versioned encoding,6=dirfrag is stored in >> omap,7=mds >> uses inline data,8=no anchor table,9=file layout v2,10=snaprealm

v2}

> max_mds 1 > in 0 > up {0=60984134} > failed > damaged > stopped > data_pools [7] > metadata_pool 6 > inline_data disabled > balancer > standby_count_wanted 1 > [mds.mds01.ceph04.cvdhsx{0:60984134} state up:replay seq 37639 addr > [v2:192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515 > > <http://192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515

]

> compat {c=[1],r=[1],i=[7ff]}] > > > Standby daemons: > > [mds.mds01.ceph07.omdisd{-1:60984161} state up:standby seq 2 addr > [v2:192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192 > > <http://192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192>] > compat > {c=[1],r=[1],i=[7ff]}] > [mds.mds01.ceph06.hsuhqd{-1:60984828} state up:standby seq 1 addr > [v2:192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518 > > <http://192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518

] >> compat {c=[1],r=[1],i=[7ff]}] >> dumped fsmap epoch 143850 >> >> ############################# >> >> [ceph: root@ceph06 /]# ceph fs status >> >> (doesn't come back) >> >> ############################# >> >> All MDS show log lines similar to this one: >> >> Jan 16 10:05:00 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143927 from mon.1 >> Jan 16 10:05:05 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143929 from mon.1 >> Jan 16 10:05:09 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143930 from mon.1 >> Jan 16 10:05:13 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143931 from mon.1 >> Jan 16 10:05:20 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143933 from mon.1 >> Jan 16 10:05:24 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143935 from mon.1 >> Jan 16 10:05:29 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143936 from mon.1 >> Jan 16 10:05:33 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143937 from mon.1 >> Jan 16 10:05:40 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143939 from mon.1 >> Jan 16 10:05:44 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143941 from mon.1 >> Jan 16 10:05:49 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143942 from mon.1 >> >> Anything else, I can provide? >> >> Cheers and thanks again! >> Thomas >> >> On 16.01.23 06:01, Kotresh Hiremath Ravishankar wrote: >> > Hi Thomas, >> > >> > As the documentation says, the MDS enters up:resolve from >> |up:replay| if >> > the Ceph file system has multiple ranks (including this one), >> i.e. it’s >> > not a single active MDS cluster. >> > The MDS is resolving any uncommitted inter-MDS operations. All >> ranks in >> > the file system must be in this state or later for progress to

>> made, >> > i.e. no rank can be failed/damaged or |up:replay|. >> > >> > So please check the status of the other active mds if it's >> failed. >> > >> > Also please share the mds logs and the output of 'ceph fs dump' >> and >> > 'ceph fs status' >> > >> > Thanks, >> > Kotresh H R >> > >> > On Sat, Jan 14, 2023 at 9:07 PM Thomas Widhalm >> > <thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de> >> <mailto:thomas.widhalm@netways.de >> <mailto:thomas.widhalm@netways.de>>> wrote: >> > >> > Hi, >> > >> > I'm really lost with my Ceph system. I built a small cluster >> for home >> > usage which has two uses for me: I want to replace an old

NAS

>> and I want >> > to learn about Ceph so that I have hands-on experience.

We're

>> using it >> > in our company but I need some real-life experience without >> risking any >> > company or customers data. That's my preferred way of >> learning. >> > >> > The cluster consists of 3 Raspberry Pis plus a few VMs >> running on >> > Proxmox. I'm not using Proxmox' built in Ceph because I want >> to focus on >> > Ceph and not just use it as a preconfigured tool. >> > >> > All hosts are running Fedora (x86_64 and arm64) and during

>> Upgrade >> > from F36 to F37 my cluster suddenly showed all PGs as >> unavailable. I >> > worked nearly a week to get it back online and I learned a >> lot about >> > Ceph management and recovery. The cluster is back but I

still

>> can't >> > access my data. Maybe you can help me? >> > >> > Here are my versions: >> > >> > [ceph: root@ceph04 /]# ceph versions >> > { >> > "mon": { >> > "ceph version 17.2.5 >> > (98318ae89f1a893a6ded3a640405cdbb33e08757) >> > quincy (stable)": 3 >> > }, >> > "mgr": { >> > "ceph version 17.2.5 >> > (98318ae89f1a893a6ded3a640405cdbb33e08757) >> > quincy (stable)": 3 >> > }, >> > "osd": { >> > "ceph version 17.2.5 >> > (98318ae89f1a893a6ded3a640405cdbb33e08757) >> > quincy (stable)": 5 >> > }, >> > "mds": { >> > "ceph version 17.2.5 >> > (98318ae89f1a893a6ded3a640405cdbb33e08757) >> > quincy (stable)": 4 >> > }, >> > "overall": { >> > "ceph version 17.2.5 >> > (98318ae89f1a893a6ded3a640405cdbb33e08757) >> > quincy (stable)": 15 >> > } >> > } >> > >> > >> > Here's MDS status output of one MDS: >> > [ceph: root@ceph04 /]# ceph tell mds.mds01.ceph05.pqxmvt >> status >> > 2023-01-14T15:30:28.607+0000 7fb9e17fa700 0 client.60986454 >> > ms_handle_reset on v2:192.168.23.65:6800/2680651694 >> <http://192.168.23.65:6800/2680651694> >> > <http://192.168.23.65:6800/2680651694 >> <http://192.168.23.65:6800/2680651694>> >> > 2023-01-14T15:30:28.640+0000 7fb9e17fa700 0 client.60986460 >> > ms_handle_reset on v2:192.168.23.65:6800/2680651694 >> <http://192.168.23.65:6800/2680651694> >> > <http://192.168.23.65:6800/2680651694 >> <http://192.168.23.65:6800/2680651694>> >> > { >> > "cluster_fsid":

"ff6e50de-ed72-11ec-881c-dca6325c2cc4",

>> > "whoami": 0, >> > "id": 60984167, >> > "want_state": "up:replay", >> > "state": "up:replay", >> > "fs_name": "cephfs", >> > "replay_status": { >> > "journal_read_pos": 0, >> > "journal_write_pos": 0, >> > "journal_expire_pos": 0, >> > "num_events": 0, >> > "num_segments": 0 >> > }, >> > "rank_uptime": 1127.54018615, >> > "mdsmap_epoch": 98056, >> > "osdmap_epoch": 12362, >> > "osdmap_epoch_barrier": 0, >> > "uptime": 1127.957307273 >> > } >> > >> > It's staying like that for days now. If there was a counter >> moving, I >> > just would wait but it doesn't change anything and alle

stats

>> says, the >> > MDS aren't working at all. >> > >> > The symptom I have is that Dashboard and all other tools I >> use say, it's >> > more or less ok. (Some old messages about failed daemons and >> scrubbing >> > aside). But I can't mount anything. When I try to start a VM >> that's on >> > RDS I just get a timeout. And when I try to mount a CephFS, >> mount just >> > hangs forever. >> > >> > Whatever command I give MDS or journal, it just hangs. The >> only thing I >> > could do, was take all CephFS offline, kill the MDS's and do >> a "ceph fs >> > reset <fs name> --yes-i-really-mean-it". After that I >> rebooted all >> > nodes, just to be sure but I still have no access to data. >> > >> > Could you please help me? I'm kinda desperate. If you need >> any more >> > information, just let me know. >> > >> > Cheers, >> > Thomas >> > >> > -- >> > Thomas Widhalm >> > Lead Systems Engineer >> > >> > NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19

> D-90429 Nuernberg > Tel: +49 911 92885-0 | Fax: +49 911 92885-77 > CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 > https://www.netways.de <https://www.netways.de> <https://www.netways.de <https://www.netways.de>> | > thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de> <mailto:thomas.widhalm@netways.de <mailto:thomas.widhalm@netways.de>> > > ** stackconf 2023 - September - https://stackconf.eu <https://stackconf.eu> > <https://stackconf.eu <https://stackconf.eu>> ** > ** OSMC 2023 - November - https://osmc.de <https://osmc.de> <https://osmc.de <https://osmc.de>> ** > ** New at NWS: Managed Database - > https://nws.netways.de/managed-database <https://nws.netways.de/managed-database> > <https://nws.netways.de/managed-database <https://nws.netways.de/managed-database>> ** > ** NETWAYS Web Services - https://nws.netways.de <https://nws.netways.de> > <https://nws.netways.de <https://nws.netways.de>> ** > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> > <mailto:ceph-users@ceph.io <mailto:ceph-users@ceph.io>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io> > <mailto:ceph-users-leave@ceph.io <mailto:ceph-users-leave@ceph.io>> > -- Thomas Widhalm Lead Systems Engineer NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg Tel: +49 911 92885-0 | Fax: +49 911 92885-77 CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 https://www.netways.de <https://www.netways.de> | thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de> ** stackconf 2023 - September - https://stackconf.eu <https://stackconf.eu> ** ** OSMC 2023 - November - https://osmc.de <https://osmc.de> ** ** New at NWS: Managed Database - https://nws.netways.de/managed-database <https://nws.netways.de/managed-database> ** ** NETWAYS Web Services - https://nws.netways.de <https://nws.netways.de> **

Kotresh Hiremath Ravishankar

3:34 a.m.

Hi Thomas, I have created the tracker https://tracker.ceph.com/issues/58489 to track this. Please upload the debug mds logs here. Thanks, Kotresh H R On Wed, Jan 18, 2023 at 4:56 PM Kotresh Hiremath Ravishankar < khiremat(a)redhat.com> wrote:

...

for

> MDS. Maybe that's where my knowledge failed me but I guess, MDS is > waiting for something that was never there. > > Yes, there are two filesystems. Due to "budget restrictions" (it's my > personal system at home, I configured a second CephFS with only one > replica for data that could be easily restored. > > Here's what I got when turning up the debug level: > > Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread

waiting

> interval 1.000000000s > Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > Sending beacon up:replay seq 11107 > Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > sender thread waiting interval 4s > Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > received beacon reply up:replay seq 11107 rtt 0.00200002 > Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status > Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167 > schedule_update_timer_task > Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have

caps,

> 0 caps, 0 caps per inode > Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > trimming > Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread

waiting

> interval 1.000000000s > Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have

caps,

> 0 caps, 0 caps per inode > Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > trimming > Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread

waiting

> interval 1.000000000s > Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status > Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.158167 > schedule_update_timer_task > Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have

caps,

> 0 caps, 0 caps per inode > Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > trimming > Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread

waiting

> interval 1.000000000s > Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have

caps,

> 0 caps, 0 caps per inode > Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > trimming > Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread

waiting

> interval 1.000000000s > Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > Sending beacon up:replay seq 11108 > Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > sender thread waiting interval 4s > Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > received beacon reply up:replay seq 11108 rtt 0.00200002 > Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status > Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.158167 > schedule_update_timer_task > Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have

caps,

> 0 caps, 0 caps per inode > Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > trimming > Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread

waiting

> interval 1.000000000s > Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have

caps,

> 0 caps, 0 caps per inode > Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > trimming > Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread

waiting

> interval 1.000000000s > Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status > Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.158167 > schedule_update_timer_task > Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have

caps,

> 0 caps, 0 caps per inode > Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > trimming > Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread

waiting

> interval 1.000000000s > Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have

caps,

> 0 caps, 0 caps per inode > Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > trimming > Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread

waiting

> interval 1.000000000s > Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > Sending beacon up:replay seq 11109 > Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > sender thread waiting interval 4s > Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > received beacon reply up:replay seq 11109 rtt 0.00600006 > Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status > Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.158167 > schedule_update_timer_task > Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > 372640, rss 57344, heap 207124, baseline 182548, 0 / 3 inodes have

caps,

> 0 caps, 0 caps per inode > Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > trimming > Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache releasing free

memory

> Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread

waiting

> interval 1.000000000s > Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > 372640, rss 57272, heap 207124, baseline 182548, 0 / 3 inodes have

caps,

> 0 caps, 0 caps per inode > Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > trimming > Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread

waiting

> interval 1.000000000s > Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status > Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.158167 > schedule_update_timer_task > Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > 372640, rss 57040, heap 207124, baseline 182548, 0 / 3 inodes have

caps,

> 0 caps, 0 caps per inode > Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > trimming > Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread

waiting

> interval 1.000000000s > > > The only thing that gives me hope here is that the line > mds.beacon.mds01.ceph05.pqxmvt Sending beacon up:replay seq 11109 is > chaning its sequence number. > > Anything else I can provide? > > Cheers, > Thomas > > On 17.01.23 06:27, Kotresh Hiremath Ravishankar wrote: >> Hi Thomas, >> >> Sorry, I misread the mds state to be stuck in 'up:resolve' state. The >> mds is stuck in 'up:replay' which means the MDS taking over a failed >> rank. >> This state represents that the MDS is recovering its journal and other >> metadata. >> >> I notice that there are two filesystems 'cephfs' and 'cephfs_insecure' >> and the active mds for both filesystems are stuck in 'up:replay'. The >> mds >> logs shared are not providing much information to infer anything. >> >> Could you please enable the debug logs and pass on the mds logs ? >> >> Thanks, >> Kotresh H R >> >> On Mon, Jan 16, 2023 at 2:38 PM Thomas Widhalm >> <thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>> wrote: >> >> Hi Kotresh, >> >> Thanks for your reply! >> >> I only have one rank. Here's the output of all MDS I have: >> >> ################### >> >> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph05.pqxmvt status >> 2023-01-16T08:55:26.055+0000 7f3412ffd700 0 client.61249926 >> ms_handle_reset on v2:192.168.23.65:6800/2680651694 >> <http://192.168.23.65:6800/2680651694> >> 2023-01-16T08:55:26.084+0000 7f3412ffd700 0 client.61299199 >> ms_handle_reset on v2:192.168.23.65:6800/2680651694 >> <http://192.168.23.65:6800/2680651694> >> { >> "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", >> "whoami": 0, >> "id": 60984167, >> "want_state": "up:replay", >> "state": "up:replay", >> "fs_name": "cephfs", >> "replay_status": { >> "journal_read_pos": 0, >> "journal_write_pos": 0, >> "journal_expire_pos": 0, >> "num_events": 0, >> "num_segments": 0 >> }, >> "rank_uptime": 150224.982558844, >> "mdsmap_epoch": 143757, >> "osdmap_epoch": 12395, >> "osdmap_epoch_barrier": 0, >> "uptime": 150225.39968057699 >> } >> >> ######################## >> >> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph04.cvdhsx status >> 2023-01-16T08:59:05.434+0000 7fdb82ff5700 0 client.61299598 >> ms_handle_reset on v2:192.168.23.64:6800/3930607515 >> <http://192.168.23.64:6800/3930607515> >> 2023-01-16T08:59:05.466+0000 7fdb82ff5700 0 client.61299604 >> ms_handle_reset on v2:192.168.23.64:6800/3930607515 >> <http://192.168.23.64:6800/3930607515> >> { >> "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", >> "whoami": 0, >> "id": 60984134, >> "want_state": "up:replay", >> "state": "up:replay", >> "fs_name": "cephfs_insecure", >> "replay_status": { >> "journal_read_pos": 0, >> "journal_write_pos": 0, >> "journal_expire_pos": 0, >> "num_events": 0, >> "num_segments": 0 >> }, >> "rank_uptime": 150450.96934037199, >> "mdsmap_epoch": 143815, >> "osdmap_epoch": 12395, >> "osdmap_epoch_barrier": 0, >> "uptime": 150451.93533502301 >> } >> >> ########################### >> >> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph06.wcfdom status >> 2023-01-16T08:59:28.572+0000 7f16538c0b80 -1 client.61250376 >> resolve_mds: no MDS daemons found by name `mds01.ceph06.wcfdom' >> 2023-01-16T08:59:28.583+0000 7f16538c0b80 -1 client.61250376

FSMap:

>> cephfs:1/1 cephfs_insecure:1/1 >> >>

{cephfs:0=mds01.ceph05.pqxmvt=up:replay,cephfs_insecure:0=mds01.ceph04.cvdhsx=up:replay}

v2}

>> max_mds 1 >> in 0 >> up {0=60984167} >> failed >> damaged >> stopped >> data_pools [4] >> metadata_pool 5 >> inline_data disabled >> balancer >> standby_count_wanted 1 >> [mds.mds01.ceph05.pqxmvt{0:60984167} state up:replay seq 37637

addr

>> [v2:

192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694

>> >> <

http://192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694>]

>> compat {c=[1],r=[1],i=[7ff]}] >> >> >> Filesystem 'cephfs_insecure' (3) >> fs_name cephfs_insecure >> epoch 143849 >> flags 12 joinable allow_snaps allow_multimds_snaps >> created 2023-01-14T14:22:46.360062+0000 >> modified 2023-01-16T09:00:52.632163+0000 >> tableserver 0 >> root 0 >> session_timeout 60 >> session_autoclose 300 >> max_file_size 1099511627776 >> required_client_features {} >> last_failure 0 >> last_failure_osd_epoch 12319 >> compat compat={},rocompat={},incompat={1=base v0.20,2=client >> writeable >> ranges,3=default file layouts on dirs,4=dir inode in separate >> object,5=mds uses versioned encoding,6=dirfrag is stored in >> omap,7=mds >> uses inline data,8=no anchor table,9=file layout v2,10=snaprealm

v2}

>> max_mds 1 >> in 0 >> up {0=60984134} >> failed >> damaged >> stopped >> data_pools [7] >> metadata_pool 6 >> inline_data disabled >> balancer >> standby_count_wanted 1 >> [mds.mds01.ceph04.cvdhsx{0:60984134} state up:replay seq 37639

addr

>> [v2:

192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515

>> >> <

http://192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515>] >>> compat {c=[1],r=[1],i=[7ff]}] >>> >>> >>> Standby daemons: >>> >>> [mds.mds01.ceph07.omdisd{-1:60984161} state up:standby seq 2 addr >>> [v2:192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192 >>> >>> <http://192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192 >] >>> compat >>> {c=[1],r=[1],i=[7ff]}] >>> [mds.mds01.ceph06.hsuhqd{-1:60984828} state up:standby seq 1 addr

>> [v2:

192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518

>> >> <

http://192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518>]

>> compat {c=[1],r=[1],i=[7ff]}] >> dumped fsmap epoch 143850 >> >> ############################# >> >> [ceph: root@ceph06 /]# ceph fs status >> >> (doesn't come back) >> >> ############################# >> >> All MDS show log lines similar to this one: >> >> Jan 16 10:05:00 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143927 from mon.1 >> Jan 16 10:05:05 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143929 from mon.1 >> Jan 16 10:05:09 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143930 from mon.1 >> Jan 16 10:05:13 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143931 from mon.1 >> Jan 16 10:05:20 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143933 from mon.1 >> Jan 16 10:05:24 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143935 from mon.1 >> Jan 16 10:05:29 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143936 from mon.1 >> Jan 16 10:05:33 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143937 from mon.1 >> Jan 16 10:05:40 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143939 from mon.1 >> Jan 16 10:05:44 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143941 from mon.1 >> Jan 16 10:05:49 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143942 from mon.1 >> >> Anything else, I can provide? >> >> Cheers and thanks again! >> Thomas >> >> On 16.01.23 06:01, Kotresh Hiremath Ravishankar wrote: >> > Hi Thomas, >> > >> > As the documentation says, the MDS enters up:resolve from >> |up:replay| if >> > the Ceph file system has multiple ranks (including this one), >> i.e. it’s >> > not a single active MDS cluster. >> > The MDS is resolving any uncommitted inter-MDS operations. All >> ranks in >> > the file system must be in this state or later for progress to

cluster

>> for home >> > usage which has two uses for me: I want to replace an old

NAS

>> and I want >> > to learn about Ceph so that I have hands-on experience.

We're

want

>> to focus on >> > Ceph and not just use it as a preconfigured tool. >> > >> > All hosts are running Fedora (x86_64 and arm64) and during

still

client.60986454

>> > ms_handle_reset on v2:192.168.23.65:6800/2680651694 >> <http://192.168.23.65:6800/2680651694> >> > <http://192.168.23.65:6800/2680651694 >> <http://192.168.23.65:6800/2680651694>> >> > 2023-01-14T15:30:28.640+0000 7fb9e17fa700 0

client.60986460

"ff6e50de-ed72-11ec-881c-dca6325c2cc4",

stats

>> says, the >> > MDS aren't working at all. >> > >> > The symptom I have is that Dashboard and all other tools I >> use say, it's >> > more or less ok. (Some old messages about failed daemons

and

>> scrubbing >> > aside). But I can't mount anything. When I try to start a

>> that's on >> > RDS I just get a timeout. And when I try to mount a CephFS, >> mount just >> > hangs forever. >> > >> > Whatever command I give MDS or journal, it just hangs. The >> only thing I >> > could do, was take all CephFS offline, kill the MDS's and

>> a "ceph fs >> > reset <fs name> --yes-i-really-mean-it". After that I >> rebooted all >> > nodes, just to be sure but I still have no access to data. >> > >> > Could you please help me? I'm kinda desperate. If you need >> any more >> > information, just let me know. >> > >> > Cheers, >> > Thomas >> > >> > -- >> > Thomas Widhalm >> > Lead Systems Engineer >> > >> > NETWAYS Professional Services GmbH | Deutschherrnstr.

15-19 |

> > D-90429 Nuernberg > > Tel: +49 911 92885-0 | Fax: +49 911 92885-77 > > CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 > > https://www.netways.de <https://www.netways.de> > <https://www.netways.de <https://www.netways.de>> | > > thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de> > <mailto:thomas.widhalm@netways.de > <mailto:thomas.widhalm@netways.de>> > > > > ** stackconf 2023 - September - https://stackconf.eu > <https://stackconf.eu> > > <https://stackconf.eu <https://stackconf.eu>> ** > > ** OSMC 2023 - November - https://osmc.de <https://osmc.de

> <https://osmc.de <https://osmc.de>> ** > > ** New at NWS: Managed Database - > > https://nws.netways.de/managed-database > <https://nws.netways.de/managed-database> > > <https://nws.netways.de/managed-database > <https://nws.netways.de/managed-database>> ** > > ** NETWAYS Web Services - https://nws.netways.de > <https://nws.netways.de> > > <https://nws.netways.de <https://nws.netways.de>> ** > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > <mailto:ceph-users@ceph.io> > > <mailto:ceph-users@ceph.io <mailto:ceph-users@ceph.io>> > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > <mailto:ceph-users-leave@ceph.io> > > <mailto:ceph-users-leave@ceph.io > <mailto:ceph-users-leave@ceph.io>> > > > > -- > Thomas Widhalm > Lead Systems Engineer > > NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | > D-90429 Nuernberg > Tel: +49 911 92885-0 | Fax: +49 911 92885-77 > CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 > https://www.netways.de <https://www.netways.de> | > thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de> > > ** stackconf 2023 - September - https://stackconf.eu > <https://stackconf.eu> ** > ** OSMC 2023 - November - https://osmc.de <https://osmc.de> ** > ** New at NWS: Managed Database - > https://nws.netways.de/managed-database > <https://nws.netways.de/managed-database> ** > ** NETWAYS Web Services - https://nws.netways.de > <https://nws.netways.de> ** > -- Thomas Widhalm Lead Systems Engineer NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg Tel: +49 911 92885-0 | Fax: +49 911 92885-77 CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 https://www.netways.de | thomas.widhalm(a)netways.de ** stackconf 2023 - September - https://stackconf.eu ** ** OSMC 2023 - November - https://osmc.de ** ** New at NWS: Managed Database - https://nws.netways.de/managed-database ** ** NETWAYS Web Services - https://nws.netways.de ** _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Thomas Widhalm

4:11 a.m.

Thank you. I'm setting the debug level and await authorization for Tracker. I'll upload the logs as soon as I can collect them. Thank you so much for your help On 18.01.23 12:26, Kotresh Hiremath Ravishankar wrote:

...

Hi Thomas, This looks like it requires more investigation than I expected. What's the current status ? Did the crashed mds come back and become active ? Increase the debug log level to 20 and share the mds logs. I will create a tracker and share it here. You can upload the mds logs there. Thanks, Kotresh H R On Tue, Jan 17, 2023 at 5:34 PM Thomas Widhalm <thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>> wrote: Another new thing that just happened: One of the MDS just crashed out of nowhere. /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc: In function 'void EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)' thread 7fccc7153700 time 2023-01-17T10:05:15.420191+0000 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc: 1625: FAILED ceph_assert(g_conf()->mds_wipe_sessions) ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7fccd759943f] 2: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605] 3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c) [0x55fb2b98e89c] 4: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0] 5: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443] 6: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31] 7: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca] 8: clone() and *** Caught signal (Aborted) ** in thread 7fccc7153700 thread_name:md_log_replay ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable) 1: /lib64/libpthread.so.0(+0x12cf0) [0x7fccd6593cf0] 2: gsignal() 3: abort() 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x7fccd7599499] 5: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605] 6: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c) [0x55fb2b98e89c] 7: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0] 8: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443] 9: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31] 10: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca] 11: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Is what I found in the logs. Since it's referring to log replaying, could this be related to my issue? On 17.01.23 10:54, Thomas Widhalm wrote:

quite a

while now. I didn't see any changes in the logs but the CPU on

the hosts

that usually don't run MDS just spiked. So high I had to kill the MDS again because otherwise they kept killing OSD containers. So I don't really have any new information, but maybe that could be a hint

of some

kind? Cheers, Thomas On 17.01.23 10:13, Thomas Widhalm wrote: > Hi, > > Thanks again. :-) > > Ok, that seems like an error to me. I never configured an extra

rank for

> MDS. Maybe that's where my knowledge failed me but I guess, MDS is > waiting for something that was never there. > > Yes, there are two filesystems. Due to "budget restrictions"

(it's my

> personal system at home, I configured a second CephFS with only one > replica for data that could be easily restored. > > Here's what I got when turning up the debug level: > > Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread

waiting

> interval 1.000000000s > Jan 17 10:08:17 ceph05 ceph-mds[1209]:

mds.beacon.mds01.ceph05.pqxmvt

> Sending beacon up:replay seq 11107 > Jan 17 10:08:17 ceph05 ceph-mds[1209]:

mds.beacon.mds01.ceph05.pqxmvt

> sender thread waiting interval 4s > Jan 17 10:08:17 ceph05 ceph-mds[1209]:

mds.beacon.mds01.ceph05.pqxmvt

> received beacon reply up:replay seq 11107 rtt 0.00200002 > Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status > Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167 > schedule_update_timer_task > Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache Memory

usage: total

> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes

have caps,

> 0 caps, 0 caps per inode > Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache cache not

ready for

> trimming > Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread

waiting

> interval 1.000000000s > Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache Memory

usage: total

> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes

have caps,

> 0 caps, 0 caps per inode > Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache cache not

ready for

> trimming > Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread

waiting

usage: total

> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes

have caps,

> 0 caps, 0 caps per inode > Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache cache not

ready for

> trimming > Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread

waiting

> interval 1.000000000s > Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache Memory

usage: total

> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes

have caps,

> 0 caps, 0 caps per inode > Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache cache not

ready for

> trimming > Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread

waiting

> interval 1.000000000s > Jan 17 10:08:21 ceph05 ceph-mds[1209]:

mds.beacon.mds01.ceph05.pqxmvt

> Sending beacon up:replay seq 11108 > Jan 17 10:08:21 ceph05 ceph-mds[1209]:

mds.beacon.mds01.ceph05.pqxmvt

> sender thread waiting interval 4s > Jan 17 10:08:21 ceph05 ceph-mds[1209]:

mds.beacon.mds01.ceph05.pqxmvt

> received beacon reply up:replay seq 11108 rtt 0.00200002 > Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status > Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.158167 > schedule_update_timer_task > Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache Memory

usage: total

> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes

have caps,

> 0 caps, 0 caps per inode > Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache cache not

ready for

> trimming > Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread

waiting

> interval 1.000000000s > Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache Memory

usage: total

> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes

have caps,

> 0 caps, 0 caps per inode > Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache cache not

ready for

> trimming > Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread

waiting

usage: total

> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes

have caps,

> 0 caps, 0 caps per inode > Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache cache not

ready for

> trimming > Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread

waiting

> interval 1.000000000s > Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache Memory

usage: total

> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes

have caps,

> 0 caps, 0 caps per inode > Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache cache not

ready for

> trimming > Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread

waiting

> interval 1.000000000s > Jan 17 10:08:25 ceph05 ceph-mds[1209]:

mds.beacon.mds01.ceph05.pqxmvt

> Sending beacon up:replay seq 11109 > Jan 17 10:08:25 ceph05 ceph-mds[1209]:

mds.beacon.mds01.ceph05.pqxmvt

> sender thread waiting interval 4s > Jan 17 10:08:25 ceph05 ceph-mds[1209]:

mds.beacon.mds01.ceph05.pqxmvt

> received beacon reply up:replay seq 11109 rtt 0.00600006 > Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status > Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.158167 > schedule_update_timer_task > Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache Memory

usage: total

> 372640, rss 57344, heap 207124, baseline 182548, 0 / 3 inodes

have caps,

> 0 caps, 0 caps per inode > Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache cache not

ready for

> trimming > Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache releasing

free memory

> Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread

waiting

> interval 1.000000000s > Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache Memory

usage: total

> 372640, rss 57272, heap 207124, baseline 182548, 0 / 3 inodes

have caps,

> 0 caps, 0 caps per inode > Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache cache not

ready for

> trimming > Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread

waiting

usage: total

> 372640, rss 57040, heap 207124, baseline 182548, 0 / 3 inodes

have caps,

> 0 caps, 0 caps per inode > Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache cache not

ready for

> trimming > Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread

waiting

state. The

>> mds is stuck in 'up:replay' which means the MDS taking over a

failed

>> rank. >> This state represents that the MDS is recovering its journal

and other

>> metadata. >> >> I notice that there are two filesystems 'cephfs' and

'cephfs_insecure'

>> and the active mds for both filesystems are stuck in

'up:replay'. The

>> mds >> logs shared are not providing much information to infer anything. >> >> Could you please enable the debug logs and pass on the mds logs ? >> >> Thanks, >> Kotresh H R >> >> On Mon, Jan 16, 2023 at 2:38 PM Thomas Widhalm >> <thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>

<mailto:thomas.widhalm@netways.de <mailto:thomas.widhalm@netways.de>>> wrote:

>> >> Hi Kotresh, >> >> Thanks for your reply! >> >> I only have one rank. Here's the output of all MDS I have: >> >> ################### >> >> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph05.pqxmvt status >> 2023-01-16T08:55:26.055+0000 7f3412ffd700 0 client.61249926 >> ms_handle_reset on v2:192.168.23.65:6800/2680651694

<http://192.168.23.65:6800/2680651694>

>> <http://192.168.23.65:6800/2680651694

<http://192.168.23.65:6800/2680651694>>

>> 2023-01-16T08:55:26.084+0000 7f3412ffd700 0 client.61299199 >> ms_handle_reset on v2:192.168.23.65:6800/2680651694

<http://192.168.23.65:6800/2680651694>

>> <http://192.168.23.65:6800/2680651694

<http://192.168.23.65:6800/2680651694>>

>> { >> "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", >> "whoami": 0, >> "id": 60984167, >> "want_state": "up:replay", >> "state": "up:replay", >> "fs_name": "cephfs", >> "replay_status": { >> "journal_read_pos": 0, >> "journal_write_pos": 0, >> "journal_expire_pos": 0, >> "num_events": 0, >> "num_segments": 0 >> }, >> "rank_uptime": 150224.982558844, >> "mdsmap_epoch": 143757, >> "osdmap_epoch": 12395, >> "osdmap_epoch_barrier": 0, >> "uptime": 150225.39968057699 >> } >> >> ######################## >> >> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph04.cvdhsx status >> 2023-01-16T08:59:05.434+0000 7fdb82ff5700 0 client.61299598 >> ms_handle_reset on v2:192.168.23.64:6800/3930607515

<http://192.168.23.64:6800/3930607515>

>> <http://192.168.23.64:6800/3930607515

<http://192.168.23.64:6800/3930607515>>

>> 2023-01-16T08:59:05.466+0000 7fdb82ff5700 0 client.61299604 >> ms_handle_reset on v2:192.168.23.64:6800/3930607515

<http://192.168.23.64:6800/3930607515>

>> <http://192.168.23.64:6800/3930607515

<http://192.168.23.64:6800/3930607515>>

>> { >> "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", >> "whoami": 0, >> "id": 60984134, >> "want_state": "up:replay", >> "state": "up:replay", >> "fs_name": "cephfs_insecure", >> "replay_status": { >> "journal_read_pos": 0, >> "journal_write_pos": 0, >> "journal_expire_pos": 0, >> "num_events": 0, >> "num_segments": 0 >> }, >> "rank_uptime": 150450.96934037199, >> "mdsmap_epoch": 143815, >> "osdmap_epoch": 12395, >> "osdmap_epoch_barrier": 0, >> "uptime": 150451.93533502301 >> } >> >> ########################### >> >> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph06.wcfdom status >> 2023-01-16T08:59:28.572+0000 7f16538c0b80 -1 client.61250376 >> resolve_mds: no MDS daemons found by name `mds01.ceph06.wcfdom' >> 2023-01-16T08:59:28.583+0000 7f16538c0b80 -1

client.61250376 FSMap:

>> cephfs:1/1 cephfs_insecure:1/1 >> >>

{cephfs:0=mds01.ceph05.pqxmvt=up:replay,cephfs_insecure:0=mds01.ceph04.cvdhsx=up:replay}

<http://192.168.23.67:6800/942898192>

>> <http://192.168.23.67:6800/942898192

<http://192.168.23.67:6800/942898192>>

>> 2023-01-16T09:00:02.831+0000 7fb7affff700 0 client.61299751 >> ms_handle_reset on v2:192.168.23.67:6800/942898192

<http://192.168.23.67:6800/942898192>

>> <http://192.168.23.67:6800/942898192

<http://192.168.23.67:6800/942898192>>

>> { >> "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", >> "whoami": -1, >> "id": 60984161, >> "want_state": "up:standby", >> "state": "up:standby", >> "mdsmap_epoch": 97687, >> "osdmap_epoch": 0, >> "osdmap_epoch_barrier": 0, >> "uptime": 150508.29091721401 >> } >> >> The error message from ceph06 is new to me. That didn't

happen the

>> last >> times. >> >> [ceph: root@ceph06 /]# ceph fs dump >> e143850 >> enable_multiple, ever_enabled_multiple: 1,1 >> default compat: compat={},rocompat={},incompat={1=base >> v0.20,2=client >> writeable ranges,3=default file layouts on dirs,4=dir inode in >> separate >> object,5=mds uses versioned encoding,6=dirfrag is stored in >> omap,8=no >> anchor table,9=file layout v2,10=snaprealm v2} >> legacy client fscid: 2 >> >> Filesystem 'cephfs' (2) >> fs_name cephfs >> epoch 143850 >> flags 12 joinable allow_snaps allow_multimds_snaps >> created 2023-01-14T14:30:05.723421+0000 >> modified 2023-01-16T09:00:53.663007+0000 >> tableserver 0 >> root 0 >> session_timeout 60 >> session_autoclose 300 >> max_file_size 1099511627776 >> required_client_features {} >> last_failure 0 >> last_failure_osd_epoch 12321 >> compat compat={},rocompat={},incompat={1=base v0.20,2=client >> writeable >> ranges,3=default file layouts on dirs,4=dir inode in separate >> object,5=mds uses versioned encoding,6=dirfrag is stored in >> omap,7=mds >> uses inline data,8=no anchor table,9=file layout

v2,10=snaprealm v2}

37637 addr

[v2:192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694 <http://192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694>

<http://192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694 <http://192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694>>] >>> compat {c=[1],r=[1],i=[7ff]}]

>>> Filesystem 'cephfs_insecure' (3) >>> fs_name cephfs_insecure >>> epoch 143849 >>> flags 12 joinable allow_snaps allow_multimds_snaps >>> created 2023-01-14T14:22:46.360062+0000 >>> modified 2023-01-16T09:00:52.632163+0000 >>> tableserver 0 >>> root 0 >>> session_timeout 60 >>> session_autoclose 300 >>> max_file_size 1099511627776 >>> required_client_features {} >>> last_failure 0 >>> last_failure_osd_epoch 12319 >>> compat compat={},rocompat={},incompat={1=base v0.20,2=client >>> writeable >>> ranges,3=default file layouts on dirs,4=dir inode in separate >>> object,5=mds uses versioned encoding,6=dirfrag is stored in >>> omap,7=mds >>> uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}

37639 addr

[v2:192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515 <http://192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515>

<http://192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515 <http://192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515>>] >>> compat {c=[1],r=[1],i=[7ff]}]

>>> Standby daemons:

>>> [mds.mds01.ceph07.omdisd{-1:60984161} state up:standby seq 2 addr

[v2:192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192 <http://192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192>

<http://192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192 <http://192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192>>]

>> compat >> {c=[1],r=[1],i=[7ff]}] >> [mds.mds01.ceph06.hsuhqd{-1:60984828} state up:standby seq

1 addr

[v2:192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518 <http://192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518>

<http://192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518 <http://192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518>>] >>> compat {c=[1],r=[1],i=[7ff]}] >>> dumped fsmap epoch 143850

>>> #############################

>>> [ceph: root@ceph06 /]# ceph fs status

>>> (doesn't come back)

>>> #############################

>>> All MDS show log lines similar to this one:

>>> Jan 16 10:05:00 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >>> Updating >>> MDS map to version 143927 from mon.1 >>> Jan 16 10:05:05 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >>> Updating >>> MDS map to version 143929 from mon.1 >>> Jan 16 10:05:09 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >>> Updating >>> MDS map to version 143930 from mon.1 >>> Jan 16 10:05:13 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >>> Updating >>> MDS map to version 143931 from mon.1 >>> Jan 16 10:05:20 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >>> Updating >>> MDS map to version 143933 from mon.1 >>> Jan 16 10:05:24 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >>> Updating >>> MDS map to version 143935 from mon.1 >>> Jan 16 10:05:29 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >>> Updating >>> MDS map to version 143936 from mon.1 >>> Jan 16 10:05:33 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >>> Updating >>> MDS map to version 143937 from mon.1 >>> Jan 16 10:05:40 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >>> Updating >>> MDS map to version 143939 from mon.1 >>> Jan 16 10:05:44 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >>> Updating >>> MDS map to version 143941 from mon.1 >>> Jan 16 10:05:49 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >>> Updating >>> MDS map to version 143942 from mon.1

>>> Anything else, I can provide?

>>> Cheers and thanks again! >>> Thomas

>>> On 16.01.23 06:01, Kotresh Hiremath Ravishankar wrote: >>> > Hi Thomas, >>> > >>> > As the documentation says, the MDS enters up:resolve from >>> |up:replay| if >>> > the Ceph file system has multiple ranks (including this one),

>> i.e. it’s >> > not a single active MDS cluster. >> > The MDS is resolving any uncommitted inter-MDS

operations. All

>> ranks in >> > the file system must be in this state or later for

progress to be

fs dump'

>> and >> > 'ceph fs status' >> > >> > Thanks, >> > Kotresh H R >> > >> > On Sat, Jan 14, 2023 at 9:07 PM Thomas Widhalm >> > <thomas.widhalm(a)netways.de

<mailto:thomas.widhalm@netways.de> <mailto:thomas.widhalm@netways.de <mailto:thomas.widhalm@netways.de>>

>> <mailto:thomas.widhalm@netways.de

<mailto:thomas.widhalm@netways.de>

>> <mailto:thomas.widhalm@netways.de

<mailto:thomas.widhalm@netways.de>>>> wrote:

>> > >> > Hi, >> > >> > I'm really lost with my Ceph system. I built a small

cluster

>> for home >> > usage which has two uses for me: I want to replace

an old NAS

>> and I want >> > to learn about Ceph so that I have hands-on

experience. We're

>> using it >> > in our company but I need some real-life experience

without

>> risking any >> > company or customers data. That's my preferred way of >> learning. >> > >> > The cluster consists of 3 Raspberry Pis plus a few VMs >> running on >> > Proxmox. I'm not using Proxmox' built in Ceph

because I want

>> to focus on >> > Ceph and not just use it as a preconfigured tool. >> > >> > All hosts are running Fedora (x86_64 and arm64) and

during an

>> Upgrade >> > from F36 to F37 my cluster suddenly showed all PGs as >> unavailable. I >> > worked nearly a week to get it back online and I

learned a

>> lot about >> > Ceph management and recovery. The cluster is back

but I still

client.60986454

>> > ms_handle_reset on v2:192.168.23.65:6800/2680651694

<http://192.168.23.65:6800/2680651694>

>> <http://192.168.23.65:6800/2680651694

<http://192.168.23.65:6800/2680651694>>

>> > <http://192.168.23.65:6800/2680651694

<http://192.168.23.65:6800/2680651694>

>> <http://192.168.23.65:6800/2680651694

<http://192.168.23.65:6800/2680651694>>>

>> > 2023-01-14T15:30:28.640+0000 7fb9e17fa700 0

client.60986460

>> > ms_handle_reset on v2:192.168.23.65:6800/2680651694

<http://192.168.23.65:6800/2680651694>

>> <http://192.168.23.65:6800/2680651694

<http://192.168.23.65:6800/2680651694>>

>> > <http://192.168.23.65:6800/2680651694

<http://192.168.23.65:6800/2680651694>

>> <http://192.168.23.65:6800/2680651694

<http://192.168.23.65:6800/2680651694>>>

>> > { >> > "cluster_fsid":

"ff6e50de-ed72-11ec-881c-dca6325c2cc4",

counter

>> moving, I >> > just would wait but it doesn't change anything and

alle stats

>> says, the >> > MDS aren't working at all. >> > >> > The symptom I have is that Dashboard and all other

tools I

>> use say, it's >> > more or less ok. (Some old messages about failed

daemons and

>> scrubbing >> > aside). But I can't mount anything. When I try to

start a VM

>> that's on >> > RDS I just get a timeout. And when I try to mount a

CephFS,

>> mount just >> > hangs forever. >> > >> > Whatever command I give MDS or journal, it just

hangs. The

>> only thing I >> > could do, was take all CephFS offline, kill the

MDS's and do

>> a "ceph fs >> > reset <fs name> --yes-i-really-mean-it". After that I >> rebooted all >> > nodes, just to be sure but I still have no access to

data.

>> > >> > Could you please help me? I'm kinda desperate. If

you need

>> any more >> > information, just let me know. >> > >> > Cheers, >> > Thomas >> > >> > -- >> > Thomas Widhalm >> > Lead Systems Engineer >> > >> > NETWAYS Professional Services GmbH |

Deutschherrnstr. 15-19 |

>> > D-90429 Nuernberg >> > Tel: +49 911 92885-0 | Fax: +49 911 92885-77 >> > CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 >> > https://www.netways.de <https://www.netways.de>

<https://www.netways.de <https://www.netways.de>>

>> <https://www.netways.de <https://www.netways.de>

<https://www.netways.de <https://www.netways.de>>> |

>> > thomas.widhalm(a)netways.de

<mailto:thomas.widhalm@netways.de> <mailto:thomas.widhalm@netways.de <mailto:thomas.widhalm@netways.de>>

>> <mailto:thomas.widhalm@netways.de

<mailto:thomas.widhalm@netways.de>

>> <mailto:thomas.widhalm@netways.de

<mailto:thomas.widhalm@netways.de>>>

>> > >> > ** stackconf 2023 - September - https://stackconf.eu

<https://stackconf.eu>

>> <https://stackconf.eu <https://stackconf.eu>> >> > <https://stackconf.eu <https://stackconf.eu>

<https://stackconf.eu <https://stackconf.eu>>> **

>> > ** OSMC 2023 - November - https://osmc.de

<https://osmc.de> <https://osmc.de <https://osmc.de>>

>> <https://osmc.de <https://osmc.de> <https://osmc.de

<https://osmc.de>>> **

>> > ** New at NWS: Managed Database - >> > https://nws.netways.de/managed-database

<https://nws.netways.de/managed-database>

>> <https://nws.netways.de/managed-database

<https://nws.netways.de/managed-database>>

>> > <https://nws.netways.de/managed-database

<https://nws.netways.de/managed-database>

>> <https://nws.netways.de/managed-database

<https://nws.netways.de/managed-database>>> **

>> > ** NETWAYS Web Services - https://nws.netways.de

<https://nws.netways.de>

>> <https://nws.netways.de <https://nws.netways.de>> >> > <https://nws.netways.de <https://nws.netways.de>

<https://nws.netways.de <https://nws.netways.de>>> **

>> > _______________________________________________ >> > ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

>> <mailto:ceph-users@ceph.io <mailto:ceph-users@ceph.io>> >> > <mailto:ceph-users@ceph.io

<mailto:ceph-users@ceph.io> <mailto:ceph-users@ceph.io <mailto:ceph-users@ceph.io>>>

>> > To unsubscribe send an email to

ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

>> <mailto:ceph-users-leave@ceph.io

<mailto:ceph-users-leave@ceph.io>>

>> > <mailto:ceph-users-leave@ceph.io

<mailto:ceph-users-leave@ceph.io>

>> <mailto:ceph-users-leave@ceph.io

<mailto:ceph-users-leave@ceph.io>>> >>> >

>>> -- >>> Thomas Widhalm >>> Lead Systems Engineer

>>> NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | >>> D-90429 Nuernberg >>> Tel: +49 911 92885-0 | Fax: +49 911 92885-77 >>> CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 >>> https://www.netways.de <https://www.netways.de> <https://www.netways.de <https://www.netways.de>> |

>> thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>

<mailto:thomas.widhalm@netways.de <mailto:thomas.widhalm@netways.de>>

>>> ** stackconf 2023 - September - https://stackconf.eu <https://stackconf.eu>

>> <https://stackconf.eu <https://stackconf.eu>> ** >> ** OSMC 2023 - November - https://osmc.de <https://osmc.de>

<https://osmc.de <https://osmc.de>> **

>> ** New at NWS: Managed Database - >> https://nws.netways.de/managed-database

<https://nws.netways.de/managed-database>

>> <https://nws.netways.de/managed-database

<https://nws.netways.de/managed-database>> **

>> ** NETWAYS Web Services - https://nws.netways.de

<https://nws.netways.de> >>> <https://nws.netways.de <https://nws.netways.de>> **

>> >> -- >> Thomas Widhalm >> Lead Systems Engineer >> >> NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429

> Nuernberg > Tel: +49 911 92885-0 | Fax: +49 911 92885-77 > CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 > https://www.netways.de <https://www.netways.de> |

thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>

> > ** stackconf 2023 - September - https://stackconf.eu

<https://stackconf.eu> **

> ** OSMC 2023 - November - https://osmc.de <https://osmc.de> ** > ** New at NWS: Managed Database - > https://nws.netways.de/managed-database

<https://nws.netways.de/managed-database> **

> ** NETWAYS Web Services - https://nws.netways.de

<https://nws.netways.de> **

> _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

> To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io>

thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>

** stackconf 2023 - September - https://stackconf.eu

<https://stackconf.eu> **

** OSMC 2023 - November - https://osmc.de <https://osmc.de> ** ** New at NWS: Managed Database - https://nws.netways.de/managed-database

<https://nws.netways.de/managed-database> **

** NETWAYS Web Services - https://nws.netways.de

<https://nws.netways.de> **

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io> -- Thomas Widhalm Lead Systems Engineer NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg Tel: +49 911 92885-0 | Fax: +49 911 92885-77 CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 https://www.netways.de <https://www.netways.de> | thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de> ** stackconf 2023 - September - https://stackconf.eu <https://stackconf.eu> ** ** OSMC 2023 - November - https://osmc.de <https://osmc.de> ** ** New at NWS: Managed Database - https://nws.netways.de/managed-database <https://nws.netways.de/managed-database> ** ** NETWAYS Web Services - https://nws.netways.de <https://nws.netways.de> ** _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

Venky Shankar

19 Jan 19 Jan

5:01 a.m.

Hi Thomas, On Tue, Jan 17, 2023 at 5:34 PM Thomas Widhalm <thomas.widhalm(a)netways.de> wrote:

...

To workaround this (for now) till the bug is fixed, set mds_wipe_sessions = true in ceph.conf, allow the MDS to transition to `active` state. Once done, flush the journal: ceph tell mds.<> flush journal then you can safely remove the config.

...

Hi Thomas, Sorry, I misread the mds state to be stuck in 'up:resolve' state. The mds is stuck in 'up:replay' which means the MDS taking over a failed rank. This state represents that the MDS is recovering its journal and other metadata. I notice that there are two filesystems 'cephfs' and 'cephfs_insecure' and the active mds for both filesystems are stuck in 'up:replay'. The mds logs shared are not providing much information to infer anything. Could you please enable the debug logs and pass on the mds logs ? Thanks, Kotresh H R On Mon, Jan 16, 2023 at 2:38 PM Thomas Widhalm <thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>> wrote: Hi Kotresh, Thanks for your reply! I only have one rank. Here's the output of all MDS I have: ################### [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph05.pqxmvt status 2023-01-16T08:55:26.055+0000 7f3412ffd700 0 client.61249926 ms_handle_reset on v2:192.168.23.65:6800/2680651694 <http://192.168.23.65:6800/2680651694> 2023-01-16T08:55:26.084+0000 7f3412ffd700 0 client.61299199 ms_handle_reset on v2:192.168.23.65:6800/2680651694 <http://192.168.23.65:6800/2680651694> { "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", "whoami": 0, "id": 60984167, "want_state": "up:replay", "state": "up:replay", "fs_name": "cephfs", "replay_status": { "journal_read_pos": 0, "journal_write_pos": 0, "journal_expire_pos": 0, "num_events": 0, "num_segments": 0 }, "rank_uptime": 150224.982558844, "mdsmap_epoch": 143757, "osdmap_epoch": 12395, "osdmap_epoch_barrier": 0, "uptime": 150225.39968057699 } ######################## [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph04.cvdhsx status 2023-01-16T08:59:05.434+0000 7fdb82ff5700 0 client.61299598 ms_handle_reset on v2:192.168.23.64:6800/3930607515 <http://192.168.23.64:6800/3930607515> 2023-01-16T08:59:05.466+0000 7fdb82ff5700 0 client.61299604 ms_handle_reset on v2:192.168.23.64:6800/3930607515 <http://192.168.23.64:6800/3930607515> { "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", "whoami": 0, "id": 60984134, "want_state": "up:replay", "state": "up:replay", "fs_name": "cephfs_insecure", "replay_status": { "journal_read_pos": 0, "journal_write_pos": 0, "journal_expire_pos": 0, "num_events": 0, "num_segments": 0 }, "rank_uptime": 150450.96934037199, "mdsmap_epoch": 143815, "osdmap_epoch": 12395, "osdmap_epoch_barrier": 0, "uptime": 150451.93533502301 } ########################### [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph06.wcfdom status 2023-01-16T08:59:28.572+0000 7f16538c0b80 -1 client.61250376 resolve_mds: no MDS daemons found by name `mds01.ceph06.wcfdom' 2023-01-16T08:59:28.583+0000 7f16538c0b80 -1 client.61250376 FSMap: cephfs:1/1 cephfs_insecure:1/1 {cephfs:0=mds01.ceph05.pqxmvt=up:replay,cephfs_insecure:0=mds01.ceph04.cvdhsx=up:replay} 2 up:standby Error ENOENT: problem getting command descriptions from mds.mds01.ceph06.wcfdom ############################ [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph07.omdisd status 2023-01-16T09:00:02.802+0000 7fb7affff700 0 client.61250454 ms_handle_reset on v2:192.168.23.67:6800/942898192 <http://192.168.23.67:6800/942898192> 2023-01-16T09:00:02.831+0000 7fb7affff700 0 client.61299751 ms_handle_reset on v2:192.168.23.67:6800/942898192 <http://192.168.23.67:6800/942898192> { "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", "whoami": -1, "id": 60984161, "want_state": "up:standby", "state": "up:standby", "mdsmap_epoch": 97687, "osdmap_epoch": 0, "osdmap_epoch_barrier": 0, "uptime": 150508.29091721401 } The error message from ceph06 is new to me. That didn't happen the last times. [ceph: root@ceph06 /]# ceph fs dump e143850 enable_multiple, ever_enabled_multiple: 1,1 default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} legacy client fscid: 2 Filesystem 'cephfs' (2) fs_name cephfs epoch 143850 flags 12 joinable allow_snaps allow_multimds_snaps created 2023-01-14T14:30:05.723421+0000 modified 2023-01-16T09:00:53.663007+0000 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 required_client_features {} last_failure 0 last_failure_osd_epoch 12321 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 1 in 0 up {0=60984167} failed damaged stopped data_pools [4] metadata_pool 5 inline_data disabled balancer standby_count_wanted 1 [mds.mds01.ceph05.pqxmvt{0:60984167} state up:replay seq 37637 addr [v2:192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694 <http://192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694>] compat {c=[1],r=[1],i=[7ff]}] Filesystem 'cephfs_insecure' (3) fs_name cephfs_insecure epoch 143849 flags 12 joinable allow_snaps allow_multimds_snaps created 2023-01-14T14:22:46.360062+0000 modified 2023-01-16T09:00:52.632163+0000 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 required_client_features {} last_failure 0 last_failure_osd_epoch 12319 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 1 in 0 up {0=60984134} failed damaged stopped data_pools [7] metadata_pool 6 inline_data disabled balancer standby_count_wanted 1 [mds.mds01.ceph04.cvdhsx{0:60984134} state up:replay seq 37639 addr [v2:192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515 <http://192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515>] compat {c=[1],r=[1],i=[7ff]}] Standby daemons: [mds.mds01.ceph07.omdisd{-1:60984161} state up:standby seq 2 addr [v2:192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192 <http://192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192>] compat {c=[1],r=[1],i=[7ff]}] [mds.mds01.ceph06.hsuhqd{-1:60984828} state up:standby seq 1 addr [v2:192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518 <http://192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518>] compat {c=[1],r=[1],i=[7ff]}] dumped fsmap epoch 143850 ############################# [ceph: root@ceph06 /]# ceph fs status (doesn't come back) ############################# All MDS show log lines similar to this one: Jan 16 10:05:00 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating MDS map to version 143927 from mon.1 Jan 16 10:05:05 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating MDS map to version 143929 from mon.1 Jan 16 10:05:09 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating MDS map to version 143930 from mon.1 Jan 16 10:05:13 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating MDS map to version 143931 from mon.1 Jan 16 10:05:20 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating MDS map to version 143933 from mon.1 Jan 16 10:05:24 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating MDS map to version 143935 from mon.1 Jan 16 10:05:29 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating MDS map to version 143936 from mon.1 Jan 16 10:05:33 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating MDS map to version 143937 from mon.1 Jan 16 10:05:40 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating MDS map to version 143939 from mon.1 Jan 16 10:05:44 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating MDS map to version 143941 from mon.1 Jan 16 10:05:49 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating MDS map to version 143942 from mon.1 Anything else, I can provide? Cheers and thanks again! Thomas On 16.01.23 06:01, Kotresh Hiremath Ravishankar wrote: > Hi Thomas, > > As the documentation says, the MDS enters up:resolve from |up:replay| if > the Ceph file system has multiple ranks (including this one), i.e. it’s > not a single active MDS cluster. > The MDS is resolving any uncommitted inter-MDS operations. All ranks in > the file system must be in this state or later for progress to be made, > i.e. no rank can be failed/damaged or |up:replay|. > > So please check the status of the other active mds if it's failed. > > Also please share the mds logs and the output of 'ceph fs dump' and > 'ceph fs status' > > Thanks, > Kotresh H R > > On Sat, Jan 14, 2023 at 9:07 PM Thomas Widhalm > <thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de> <mailto:thomas.widhalm@netways.de <mailto:thomas.widhalm@netways.de>>> wrote: > > Hi, > > I'm really lost with my Ceph system. I built a small cluster for home > usage which has two uses for me: I want to replace an old NAS and I want > to learn about Ceph so that I have hands-on experience. We're using it > in our company but I need some real-life experience without risking any > company or customers data. That's my preferred way of learning. > > The cluster consists of 3 Raspberry Pis plus a few VMs running on > Proxmox. I'm not using Proxmox' built in Ceph because I want to focus on > Ceph and not just use it as a preconfigured tool. > > All hosts are running Fedora (x86_64 and arm64) and during an Upgrade > from F36 to F37 my cluster suddenly showed all PGs as unavailable. I > worked nearly a week to get it back online and I learned a lot about > Ceph management and recovery. The cluster is back but I still can't > access my data. Maybe you can help me? > > Here are my versions: > > [ceph: root@ceph04 /]# ceph versions > { > "mon": { > "ceph version 17.2.5 > (98318ae89f1a893a6ded3a640405cdbb33e08757) > quincy (stable)": 3 > }, > "mgr": { > "ceph version 17.2.5 > (98318ae89f1a893a6ded3a640405cdbb33e08757) > quincy (stable)": 3 > }, > "osd": { > "ceph version 17.2.5 > (98318ae89f1a893a6ded3a640405cdbb33e08757) > quincy (stable)": 5 > }, > "mds": { > "ceph version 17.2.5 > (98318ae89f1a893a6ded3a640405cdbb33e08757) > quincy (stable)": 4 > }, > "overall": { > "ceph version 17.2.5 > (98318ae89f1a893a6ded3a640405cdbb33e08757) > quincy (stable)": 15 > } > } > > > Here's MDS status output of one MDS: > [ceph: root@ceph04 /]# ceph tell mds.mds01.ceph05.pqxmvt status > 2023-01-14T15:30:28.607+0000 7fb9e17fa700 0 client.60986454 > ms_handle_reset on v2:192.168.23.65:6800/2680651694 <http://192.168.23.65:6800/2680651694> > <http://192.168.23.65:6800/2680651694 <http://192.168.23.65:6800/2680651694>> > 2023-01-14T15:30:28.640+0000 7fb9e17fa700 0 client.60986460 > ms_handle_reset on v2:192.168.23.65:6800/2680651694 <http://192.168.23.65:6800/2680651694> > <http://192.168.23.65:6800/2680651694 <http://192.168.23.65:6800/2680651694>> > { > "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", > "whoami": 0, > "id": 60984167, > "want_state": "up:replay", > "state": "up:replay", > "fs_name": "cephfs", > "replay_status": { > "journal_read_pos": 0, > "journal_write_pos": 0, > "journal_expire_pos": 0, > "num_events": 0, > "num_segments": 0 > }, > "rank_uptime": 1127.54018615, > "mdsmap_epoch": 98056, > "osdmap_epoch": 12362, > "osdmap_epoch_barrier": 0, > "uptime": 1127.957307273 > } > > It's staying like that for days now. If there was a counter moving, I > just would wait but it doesn't change anything and alle stats says, the > MDS aren't working at all. > > The symptom I have is that Dashboard and all other tools I use say, it's > more or less ok. (Some old messages about failed daemons and scrubbing > aside). But I can't mount anything. When I try to start a VM that's on > RDS I just get a timeout. And when I try to mount a CephFS, mount just > hangs forever. > > Whatever command I give MDS or journal, it just hangs. The only thing I > could do, was take all CephFS offline, kill the MDS's and do a "ceph fs > reset <fs name> --yes-i-really-mean-it". After that I rebooted all > nodes, just to be sure but I still have no access to data. > > Could you please help me? I'm kinda desperate. If you need any more > information, just let me know. > > Cheers, > Thomas > > -- > Thomas Widhalm > Lead Systems Engineer > > NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | > D-90429 Nuernberg > Tel: +49 911 92885-0 | Fax: +49 911 92885-77 > CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 > https://www.netways.de <https://www.netways.de> <https://www.netways.de <https://www.netways.de>> | > thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de> <mailto:thomas.widhalm@netways.de <mailto:thomas.widhalm@netways.de>> > > ** stackconf 2023 - September - https://stackconf.eu <https://stackconf.eu> > <https://stackconf.eu <https://stackconf.eu>> ** > ** OSMC 2023 - November - https://osmc.de <https://osmc.de> <https://osmc.de <https://osmc.de>> ** > ** New at NWS: Managed Database - > https://nws.netways.de/managed-database <https://nws.netways.de/managed-database> > <https://nws.netways.de/managed-database <https://nws.netways.de/managed-database>> ** > ** NETWAYS Web Services - https://nws.netways.de <https://nws.netways.de> > <https://nws.netways.de <https://nws.netways.de>> ** > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> > <mailto:ceph-users@ceph.io <mailto:ceph-users@ceph.io>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io> > <mailto:ceph-users-leave@ceph.io <mailto:ceph-users-leave@ceph.io>> > -- Thomas Widhalm Lead Systems Engineer NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg Tel: +49 911 92885-0 | Fax: +49 911 92885-77 CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 https://www.netways.de <https://www.netways.de> | thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de> ** stackconf 2023 - September - https://stackconf.eu <https://stackconf.eu> ** ** OSMC 2023 - November - https://osmc.de <https://osmc.de> ** ** New at NWS: Managed Database - https://nws.netways.de/managed-database <https://nws.netways.de/managed-database> ** ** NETWAYS Web Services - https://nws.netways.de <https://nws.netways.de> **

-- Cheers, Venky

Thomas Widhalm

5:17 a.m.

Hi Venky, Thanks. I just uploaded my logs to the tracker. I'll try what you suggested and will let you know how it went. Cheers, Thomas On 19.01.23 14:01, Venky Shankar wrote:

...

Hi Thomas, On Tue, Jan 17, 2023 at 5:34 PM Thomas Widhalm <thomas.widhalm(a)netways.de> wrote:

Hi, Thanks again. :-) Ok, that seems like an error to me. I never configured an extra rank for MDS. Maybe that's where my knowledge failed me but I guess, MDS is waiting for something that was never there. Yes, there are two filesystems. Due to "budget restrictions" (it's my personal system at home, I configured a second CephFS with only one replica for data that could be easily restored. Here's what I got when turning up the debug level: Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting interval 1.000000000s Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt Sending beacon up:replay seq 11107 Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt sender thread waiting interval 4s Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt received beacon reply up:replay seq 11107 rtt 0.00200002 Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167 schedule_update_timer_task Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, 0 caps, 0 caps per inode Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for trimming Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting interval 1.000000000s Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, 0 caps, 0 caps per inode Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for trimming Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting interval 1.000000000s Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.158167 schedule_update_timer_task Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, 0 caps, 0 caps per inode Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for trimming Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting interval 1.000000000s Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, 0 caps, 0 caps per inode Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for trimming Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting interval 1.000000000s Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt Sending beacon up:replay seq 11108 Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt sender thread waiting interval 4s Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt received beacon reply up:replay seq 11108 rtt 0.00200002 Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.158167 schedule_update_timer_task Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, 0 caps, 0 caps per inode Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for trimming Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting interval 1.000000000s Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, 0 caps, 0 caps per inode Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for trimming Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting interval 1.000000000s Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.158167 schedule_update_timer_task Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, 0 caps, 0 caps per inode Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for trimming Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting interval 1.000000000s Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, 0 caps, 0 caps per inode Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for trimming Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting interval 1.000000000s Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt Sending beacon up:replay seq 11109 Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt sender thread waiting interval 4s Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt received beacon reply up:replay seq 11109 rtt 0.00600006 Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.158167 schedule_update_timer_task Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total 372640, rss 57344, heap 207124, baseline 182548, 0 / 3 inodes have caps, 0 caps, 0 caps per inode Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for trimming Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache releasing free memory Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting interval 1.000000000s Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total 372640, rss 57272, heap 207124, baseline 182548, 0 / 3 inodes have caps, 0 caps, 0 caps per inode Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for trimming Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting interval 1.000000000s Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.158167 schedule_update_timer_task Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total 372640, rss 57040, heap 207124, baseline 182548, 0 / 3 inodes have caps, 0 caps, 0 caps per inode Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for trimming Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting interval 1.000000000s The only thing that gives me hope here is that the line mds.beacon.mds01.ceph05.pqxmvt Sending beacon up:replay seq 11109 is chaning its sequence number. Anything else I can provide? Cheers, Thomas On 17.01.23 06:27, Kotresh Hiremath Ravishankar wrote: > Hi Thomas, > > Sorry, I misread the mds state to be stuck in 'up:resolve' state. The > mds is stuck in 'up:replay' which means the MDS taking over a failed > rank. > This state represents that the MDS is recovering its journal and other > metadata. > > I notice that there are two filesystems 'cephfs' and 'cephfs_insecure' > and the active mds for both filesystems are stuck in 'up:replay'. The > mds > logs shared are not providing much information to infer anything. > > Could you please enable the debug logs and pass on the mds logs ? > > Thanks, > Kotresh H R > > On Mon, Jan 16, 2023 at 2:38 PM Thomas Widhalm > <thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>> wrote: > > Hi Kotresh, > > Thanks for your reply! > > I only have one rank. Here's the output of all MDS I have: > > ################### > > [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph05.pqxmvt status > 2023-01-16T08:55:26.055+0000 7f3412ffd700 0 client.61249926 > ms_handle_reset on v2:192.168.23.65:6800/2680651694 > <http://192.168.23.65:6800/2680651694> > 2023-01-16T08:55:26.084+0000 7f3412ffd700 0 client.61299199 > ms_handle_reset on v2:192.168.23.65:6800/2680651694 > <http://192.168.23.65:6800/2680651694> > { > "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", > "whoami": 0, > "id": 60984167, > "want_state": "up:replay", > "state": "up:replay", > "fs_name": "cephfs", > "replay_status": { > "journal_read_pos": 0, > "journal_write_pos": 0, > "journal_expire_pos": 0, > "num_events": 0, > "num_segments": 0 > }, > "rank_uptime": 150224.982558844, > "mdsmap_epoch": 143757, > "osdmap_epoch": 12395, > "osdmap_epoch_barrier": 0, > "uptime": 150225.39968057699 > } > > ######################## > > [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph04.cvdhsx status > 2023-01-16T08:59:05.434+0000 7fdb82ff5700 0 client.61299598 > ms_handle_reset on v2:192.168.23.64:6800/3930607515 > <http://192.168.23.64:6800/3930607515> > 2023-01-16T08:59:05.466+0000 7fdb82ff5700 0 client.61299604 > ms_handle_reset on v2:192.168.23.64:6800/3930607515 > <http://192.168.23.64:6800/3930607515> > { > "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", > "whoami": 0, > "id": 60984134, > "want_state": "up:replay", > "state": "up:replay", > "fs_name": "cephfs_insecure", > "replay_status": { > "journal_read_pos": 0, > "journal_write_pos": 0, > "journal_expire_pos": 0, > "num_events": 0, > "num_segments": 0 > }, > "rank_uptime": 150450.96934037199, > "mdsmap_epoch": 143815, > "osdmap_epoch": 12395, > "osdmap_epoch_barrier": 0, > "uptime": 150451.93533502301 > } > > ########################### > > [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph06.wcfdom status > 2023-01-16T08:59:28.572+0000 7f16538c0b80 -1 client.61250376 > resolve_mds: no MDS daemons found by name `mds01.ceph06.wcfdom' > 2023-01-16T08:59:28.583+0000 7f16538c0b80 -1 client.61250376 FSMap: > cephfs:1/1 cephfs_insecure:1/1 > > {cephfs:0=mds01.ceph05.pqxmvt=up:replay,cephfs_insecure:0=mds01.ceph04.cvdhsx=up:replay} > 2 up:standby > Error ENOENT: problem getting command descriptions from > mds.mds01.ceph06.wcfdom > > ############################ > > [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph07.omdisd status > 2023-01-16T09:00:02.802+0000 7fb7affff700 0 client.61250454 > ms_handle_reset on v2:192.168.23.67:6800/942898192 > <http://192.168.23.67:6800/942898192> > 2023-01-16T09:00:02.831+0000 7fb7affff700 0 client.61299751 > ms_handle_reset on v2:192.168.23.67:6800/942898192 > <http://192.168.23.67:6800/942898192> > { > "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", > "whoami": -1, > "id": 60984161, > "want_state": "up:standby", > "state": "up:standby", > "mdsmap_epoch": 97687, > "osdmap_epoch": 0, > "osdmap_epoch_barrier": 0, > "uptime": 150508.29091721401 > } > > The error message from ceph06 is new to me. That didn't happen the > last > times. > > [ceph: root@ceph06 /]# ceph fs dump > e143850 > enable_multiple, ever_enabled_multiple: 1,1 > default compat: compat={},rocompat={},incompat={1=base > v0.20,2=client > writeable ranges,3=default file layouts on dirs,4=dir inode in > separate > object,5=mds uses versioned encoding,6=dirfrag is stored in > omap,8=no > anchor table,9=file layout v2,10=snaprealm v2} > legacy client fscid: 2 > > Filesystem 'cephfs' (2) > fs_name cephfs > epoch 143850 > flags 12 joinable allow_snaps allow_multimds_snaps > created 2023-01-14T14:30:05.723421+0000 > modified 2023-01-16T09:00:53.663007+0000 > tableserver 0 > root 0 > session_timeout 60 > session_autoclose 300 > max_file_size 1099511627776 > required_client_features {} > last_failure 0 > last_failure_osd_epoch 12321 > compat compat={},rocompat={},incompat={1=base v0.20,2=client > writeable > ranges,3=default file layouts on dirs,4=dir inode in separate > object,5=mds uses versioned encoding,6=dirfrag is stored in > omap,7=mds > uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} > max_mds 1 > in 0 > up {0=60984167} > failed > damaged > stopped > data_pools [4] > metadata_pool 5 > inline_data disabled > balancer > standby_count_wanted 1 > [mds.mds01.ceph05.pqxmvt{0:60984167} state up:replay seq 37637 addr > [v2:192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694 > > <http://192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694>] > compat {c=[1],r=[1],i=[7ff]}] > > > Filesystem 'cephfs_insecure' (3) > fs_name cephfs_insecure > epoch 143849 > flags 12 joinable allow_snaps allow_multimds_snaps > created 2023-01-14T14:22:46.360062+0000 > modified 2023-01-16T09:00:52.632163+0000 > tableserver 0 > root 0 > session_timeout 60 > session_autoclose 300 > max_file_size 1099511627776 > required_client_features {} > last_failure 0 > last_failure_osd_epoch 12319 > compat compat={},rocompat={},incompat={1=base v0.20,2=client > writeable > ranges,3=default file layouts on dirs,4=dir inode in separate > object,5=mds uses versioned encoding,6=dirfrag is stored in > omap,7=mds > uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} > max_mds 1 > in 0 > up {0=60984134} > failed > damaged > stopped > data_pools [7] > metadata_pool 6 > inline_data disabled > balancer > standby_count_wanted 1 > [mds.mds01.ceph04.cvdhsx{0:60984134} state up:replay seq 37639 addr > [v2:192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515 > > <http://192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515>] > compat {c=[1],r=[1],i=[7ff]}] > > > Standby daemons: > > [mds.mds01.ceph07.omdisd{-1:60984161} state up:standby seq 2 addr > [v2:192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192 > > <http://192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192>] > compat > {c=[1],r=[1],i=[7ff]}] > [mds.mds01.ceph06.hsuhqd{-1:60984828} state up:standby seq 1 addr > [v2:192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518 > > <http://192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518>] > compat {c=[1],r=[1],i=[7ff]}] > dumped fsmap epoch 143850 > > ############################# > > [ceph: root@ceph06 /]# ceph fs status > > (doesn't come back) > > ############################# > > All MDS show log lines similar to this one: > > Jan 16 10:05:00 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx > Updating > MDS map to version 143927 from mon.1 > Jan 16 10:05:05 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx > Updating > MDS map to version 143929 from mon.1 > Jan 16 10:05:09 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx > Updating > MDS map to version 143930 from mon.1 > Jan 16 10:05:13 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx > Updating > MDS map to version 143931 from mon.1 > Jan 16 10:05:20 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx > Updating > MDS map to version 143933 from mon.1 > Jan 16 10:05:24 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx > Updating > MDS map to version 143935 from mon.1 > Jan 16 10:05:29 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx > Updating > MDS map to version 143936 from mon.1 > Jan 16 10:05:33 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx > Updating > MDS map to version 143937 from mon.1 > Jan 16 10:05:40 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx > Updating > MDS map to version 143939 from mon.1 > Jan 16 10:05:44 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx > Updating > MDS map to version 143941 from mon.1 > Jan 16 10:05:49 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx > Updating > MDS map to version 143942 from mon.1 > > Anything else, I can provide? > > Cheers and thanks again! > Thomas > > On 16.01.23 06:01, Kotresh Hiremath Ravishankar wrote: > > Hi Thomas, > > > > As the documentation says, the MDS enters up:resolve from > |up:replay| if > > the Ceph file system has multiple ranks (including this one), > i.e. it’s > > not a single active MDS cluster. > > The MDS is resolving any uncommitted inter-MDS operations. All > ranks in > > the file system must be in this state or later for progress to be > made, > > i.e. no rank can be failed/damaged or |up:replay|. > > > > So please check the status of the other active mds if it's > failed. > > > > Also please share the mds logs and the output of 'ceph fs dump' > and > > 'ceph fs status' > > > > Thanks, > > Kotresh H R > > > > On Sat, Jan 14, 2023 at 9:07 PM Thomas Widhalm > > <thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de> > <mailto:thomas.widhalm@netways.de > <mailto:thomas.widhalm@netways.de>>> wrote: > > > > Hi, > > > > I'm really lost with my Ceph system. I built a small cluster > for home > > usage which has two uses for me: I want to replace an old NAS > and I want > > to learn about Ceph so that I have hands-on experience. We're > using it > > in our company but I need some real-life experience without > risking any > > company or customers data. That's my preferred way of > learning. > > > > The cluster consists of 3 Raspberry Pis plus a few VMs > running on > > Proxmox. I'm not using Proxmox' built in Ceph because I want > to focus on > > Ceph and not just use it as a preconfigured tool. > > > > All hosts are running Fedora (x86_64 and arm64) and during an > Upgrade > > from F36 to F37 my cluster suddenly showed all PGs as > unavailable. I > > worked nearly a week to get it back online and I learned a > lot about > > Ceph management and recovery. The cluster is back but I still > can't > > access my data. Maybe you can help me? > > > > Here are my versions: > > > > [ceph: root@ceph04 /]# ceph versions > > { > > "mon": { > > "ceph version 17.2.5 > > (98318ae89f1a893a6ded3a640405cdbb33e08757) > > quincy (stable)": 3 > > }, > > "mgr": { > > "ceph version 17.2.5 > > (98318ae89f1a893a6ded3a640405cdbb33e08757) > > quincy (stable)": 3 > > }, > > "osd": { > > "ceph version 17.2.5 > > (98318ae89f1a893a6ded3a640405cdbb33e08757) > > quincy (stable)": 5 > > }, > > "mds": { > > "ceph version 17.2.5 > > (98318ae89f1a893a6ded3a640405cdbb33e08757) > > quincy (stable)": 4 > > }, > > "overall": { > > "ceph version 17.2.5 > > (98318ae89f1a893a6ded3a640405cdbb33e08757) > > quincy (stable)": 15 > > } > > } > > > > > > Here's MDS status output of one MDS: > > [ceph: root@ceph04 /]# ceph tell mds.mds01.ceph05.pqxmvt > status > > 2023-01-14T15:30:28.607+0000 7fb9e17fa700 0 client.60986454 > > ms_handle_reset on v2:192.168.23.65:6800/2680651694 > <http://192.168.23.65:6800/2680651694> > > <http://192.168.23.65:6800/2680651694 > <http://192.168.23.65:6800/2680651694>> > > 2023-01-14T15:30:28.640+0000 7fb9e17fa700 0 client.60986460 > > ms_handle_reset on v2:192.168.23.65:6800/2680651694 > <http://192.168.23.65:6800/2680651694> > > <http://192.168.23.65:6800/2680651694 > <http://192.168.23.65:6800/2680651694>> > > { > > "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", > > "whoami": 0, > > "id": 60984167, > > "want_state": "up:replay", > > "state": "up:replay", > > "fs_name": "cephfs", > > "replay_status": { > > "journal_read_pos": 0, > > "journal_write_pos": 0, > > "journal_expire_pos": 0, > > "num_events": 0, > > "num_segments": 0 > > }, > > "rank_uptime": 1127.54018615, > > "mdsmap_epoch": 98056, > > "osdmap_epoch": 12362, > > "osdmap_epoch_barrier": 0, > > "uptime": 1127.957307273 > > } > > > > It's staying like that for days now. If there was a counter > moving, I > > just would wait but it doesn't change anything and alle stats > says, the > > MDS aren't working at all. > > > > The symptom I have is that Dashboard and all other tools I > use say, it's > > more or less ok. (Some old messages about failed daemons and > scrubbing > > aside). But I can't mount anything. When I try to start a VM > that's on > > RDS I just get a timeout. And when I try to mount a CephFS, > mount just > > hangs forever. > > > > Whatever command I give MDS or journal, it just hangs. The > only thing I > > could do, was take all CephFS offline, kill the MDS's and do > a "ceph fs > > reset <fs name> --yes-i-really-mean-it". After that I > rebooted all > > nodes, just to be sure but I still have no access to data. > > > > Could you please help me? I'm kinda desperate. If you need > any more > > information, just let me know. > > > > Cheers, > > Thomas > > > > -- > > Thomas Widhalm > > Lead Systems Engineer > > > > NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | > > D-90429 Nuernberg > > Tel: +49 911 92885-0 | Fax: +49 911 92885-77 > > CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 > > https://www.netways.de <https://www.netways.de> > <https://www.netways.de <https://www.netways.de>> | > > thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de> > <mailto:thomas.widhalm@netways.de > <mailto:thomas.widhalm@netways.de>> > > > > ** stackconf 2023 - September - https://stackconf.eu > <https://stackconf.eu> > > <https://stackconf.eu <https://stackconf.eu>> ** > > ** OSMC 2023 - November - https://osmc.de <https://osmc.de> > <https://osmc.de <https://osmc.de>> ** > > ** New at NWS: Managed Database - > > https://nws.netways.de/managed-database > <https://nws.netways.de/managed-database> > > <https://nws.netways.de/managed-database > <https://nws.netways.de/managed-database>> ** > > ** NETWAYS Web Services - https://nws.netways.de > <https://nws.netways.de> > > <https://nws.netways.de <https://nws.netways.de>> ** > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > <mailto:ceph-users@ceph.io> > > <mailto:ceph-users@ceph.io <mailto:ceph-users@ceph.io>> > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > <mailto:ceph-users-leave@ceph.io> > > <mailto:ceph-users-leave@ceph.io > <mailto:ceph-users-leave@ceph.io>> > > > > -- > Thomas Widhalm > Lead Systems Engineer > > NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | > D-90429 Nuernberg > Tel: +49 911 92885-0 | Fax: +49 911 92885-77 > CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 > https://www.netways.de <https://www.netways.de> | > thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de> > > ** stackconf 2023 - September - https://stackconf.eu > <https://stackconf.eu> ** > ** OSMC 2023 - November - https://osmc.de <https://osmc.de> ** > ** New at NWS: Managed Database - > https://nws.netways.de/managed-database > <https://nws.netways.de/managed-database> ** > ** NETWAYS Web Services - https://nws.netways.de > <https://nws.netways.de> ** > -- Thomas Widhalm Lead Systems Engineer NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg Tel: +49 911 92885-0 | Fax: +49 911 92885-77 CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 https://www.netways.de | thomas.widhalm(a)netways.de ** stackconf 2023 - September - https://stackconf.eu ** ** OSMC 2023 - November - https://osmc.de ** ** New at NWS: Managed Database - https://nws.netways.de/managed-database ** ** NETWAYS Web Services - https://nws.netways.de ** _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Thomas Widhalm

5:45 a.m.

Hi, Unfortunately the workaround didn't work out: [ceph: root@ceph05 /]# ceph config show mds.mds01.ceph06.hsuhqd | grep mds_wipe mds_wipe_sessions true mon [ceph: root@ceph05 /]# ceph config show mds.mds01.ceph04.cvdhsx | grep mds_wipe mds_wipe_sessions true mon [ceph: root@ceph05 /]# ceph config show mds.mds01.ceph05.pqxmvt | grep mds_wipe mds_wipe_sessions true mon [ceph: root@ceph05 /]# ceph tell mds.mds01.ceph05.pqxmvt flush journal 2023-01-19T13:38:07.403+0000 7ff94e7fc700 0 client.61855055 ms_handle_reset on v2:192.168.23.65:6800/957802673 2023-01-19T13:38:07.427+0000 7ff94e7fc700 0 client.61855061 ms_handle_reset on v2:192.168.23.65:6800/957802673 Error ENOSYS: [ceph: root@ceph05 /]# ceph tell mds.mds01.ceph06.hsuhqd flush journal 2023-01-19T13:38:34.694+0000 7f789effd700 0 client.61855142 ms_handle_reset on v2:192.168.23.66:6810/2868317045 2023-01-19T13:38:34.728+0000 7f789effd700 0 client.61855148 ms_handle_reset on v2:192.168.23.66:6810/2868317045 { "message": "", "return_code": 0 } [ceph: root@ceph05 /]# ceph tell mds.mds01.ceph04.cvdhsx flush journal 2023-01-19T13:38:46.402+0000 7fdee77fe700 0 client.61855172 ms_handle_reset on v2:192.168.23.64:6800/1605877585 2023-01-19T13:38:46.435+0000 7fdee77fe700 0 client.61855178 ms_handle_reset on v2:192.168.23.64:6800/1605877585 { "message": "", "return_code": 0 } [ceph: root@ceph05 /]# ceph fs dump e198622 enable_multiple, ever_enabled_multiple: 1,1 default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} legacy client fscid: 2 Filesystem 'cephfs' (2) fs_name cephfs epoch 198622 flags 12 joinable allow_snaps allow_multimds_snaps created 2023-01-14T14:30:05.723421+0000 modified 2023-01-19T13:39:25.239395+0000 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 required_client_features {} last_failure 0 last_failure_osd_epoch 13541 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 1 in 0 up {0=61834171} failed damaged stopped data_pools [4] metadata_pool 5 inline_data disabled balancer standby_count_wanted 1 [mds.mds01.ceph04.cvdhsx{0:61834171} state up:replay seq 240 addr [v2:192.168.23.64:6800/1605877585,v1:192.168.23.64:6801/1605877585] compat {c=[1],r=[1],i=[7ff]}] Filesystem 'cephfs_insecure' (3) fs_name cephfs_insecure epoch 198621 flags 12 joinable allow_snaps allow_multimds_snaps created 2023-01-14T14:22:46.360062+0000 modified 2023-01-19T13:39:22.799446+0000 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 required_client_features {} last_failure 0 last_failure_osd_epoch 13539 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 1 in 0 up {0=61834120} failed damaged stopped data_pools [7] metadata_pool 6 inline_data disabled balancer standby_count_wanted 1 [mds.mds01.ceph06.hsuhqd{0:61834120} state up:replay seq 241 addr [v2:192.168.23.66:6810/2868317045,v1:192.168.23.66:6811/2868317045] compat {c=[1],r=[1],i=[7ff]}] Standby daemons: [mds.mds01.ceph05.pqxmvt{-1:61834887} state up:standby seq 1 addr [v2:192.168.23.65:6800/957802673,v1:192.168.23.65:6801/957802673] compat {c=[1],r=[1],i=[7ff]}] dumped fsmap epoch 198622 On 19.01.23 14:01, Venky Shankar wrote:

...

Hi Thomas, On Tue, Jan 17, 2023 at 5:34 PM Thomas Widhalm <thomas.widhalm(a)netways.de> wrote:

Venky Shankar

7:37 p.m.

Hi Thomas, On Thu, Jan 19, 2023 at 7:15 PM Thomas Widhalm <thomas.widhalm(a)netways.de> wrote:

...

The journal flush did work which is an indication that the mds made progress.

...

[ceph: root@ceph05 /]# ceph fs dump e198622 enable_multiple, ever_enabled_multiple: 1,1 default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} legacy client fscid: 2 Filesystem 'cephfs' (2) fs_name cephfs epoch 198622 flags 12 joinable allow_snaps allow_multimds_snaps created 2023-01-14T14:30:05.723421+0000 modified 2023-01-19T13:39:25.239395+0000 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 required_client_features {} last_failure 0 last_failure_osd_epoch 13541 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 1 in 0 up {0=61834171} failed damaged stopped data_pools [4] metadata_pool 5 inline_data disabled balancer standby_count_wanted 1 [mds.mds01.ceph04.cvdhsx{0:61834171} state up:replay seq 240 addr [v2:192.168.23.64:6800/1605877585,v1:192.168.23.64:6801/1605877585] compat {c=[1],r=[1],i=[7ff]}] Filesystem 'cephfs_insecure' (3) fs_name cephfs_insecure epoch 198621 flags 12 joinable allow_snaps allow_multimds_snaps created 2023-01-14T14:22:46.360062+0000 modified 2023-01-19T13:39:22.799446+0000 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 required_client_features {} last_failure 0 last_failure_osd_epoch 13539 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 1 in 0 up {0=61834120} failed damaged stopped data_pools [7] metadata_pool 6 inline_data disabled balancer standby_count_wanted 1 [mds.mds01.ceph06.hsuhqd{0:61834120} state up:replay seq 241 addr [v2:192.168.23.66:6810/2868317045,v1:192.168.23.66:6811/2868317045] compat {c=[1],r=[1],i=[7ff]}] Standby daemons: [mds.mds01.ceph05.pqxmvt{-1:61834887} state up:standby seq 1 addr [v2:192.168.23.65:6800/957802673,v1:192.168.23.65:6801/957802673] compat {c=[1],r=[1],i=[7ff]}] dumped fsmap epoch 198622

Hmmm.. the MDSs are going through the replay state which is expected. Do you see the MDSs crashing again?

...

On 19.01.23 14:01, Venky Shankar wrote:

Hi Thomas, On Tue, Jan 17, 2023 at 5:34 PM Thomas Widhalm <thomas.widhalm(a)netways.de> wrote:

Hi again, Another thing I found: Out of pure desperation, I started MDS on all nodes. I had them configured in the past so I was hoping, they could help with bringing in missing data even when they were down for quite a while now. I didn't see any changes in the logs but the CPU on the hosts that usually don't run MDS just spiked. So high I had to kill the MDS again because otherwise they kept killing OSD containers. So I don't really have any new information, but maybe that could be a hint of some kind? Cheers, Thomas On 17.01.23 10:13, Thomas Widhalm wrote: > Hi, > > Thanks again. :-) > > Ok, that seems like an error to me. I never configured an extra rank for > MDS. Maybe that's where my knowledge failed me but I guess, MDS is > waiting for something that was never there. > > Yes, there are two filesystems. Due to "budget restrictions" (it's my > personal system at home, I configured a second CephFS with only one > replica for data that could be easily restored. > > Here's what I got when turning up the debug level: > > Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting > interval 1.000000000s > Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > Sending beacon up:replay seq 11107 > Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > sender thread waiting interval 4s > Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > received beacon reply up:replay seq 11107 rtt 0.00200002 > Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status > Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167 > schedule_update_timer_task > Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, > 0 caps, 0 caps per inode > Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > trimming > Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting > interval 1.000000000s > Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, > 0 caps, 0 caps per inode > Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > trimming > Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting > interval 1.000000000s > Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status > Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.158167 > schedule_update_timer_task > Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, > 0 caps, 0 caps per inode > Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > trimming > Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting > interval 1.000000000s > Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, > 0 caps, 0 caps per inode > Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > trimming > Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting > interval 1.000000000s > Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > Sending beacon up:replay seq 11108 > Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > sender thread waiting interval 4s > Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > received beacon reply up:replay seq 11108 rtt 0.00200002 > Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status > Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.158167 > schedule_update_timer_task > Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, > 0 caps, 0 caps per inode > Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > trimming > Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting > interval 1.000000000s > Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, > 0 caps, 0 caps per inode > Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > trimming > Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting > interval 1.000000000s > Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status > Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.158167 > schedule_update_timer_task > Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, > 0 caps, 0 caps per inode > Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > trimming > Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting > interval 1.000000000s > Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, > 0 caps, 0 caps per inode > Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > trimming > Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting > interval 1.000000000s > Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > Sending beacon up:replay seq 11109 > Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > sender thread waiting interval 4s > Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt > received beacon reply up:replay seq 11109 rtt 0.00600006 > Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status > Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.158167 > schedule_update_timer_task > Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > 372640, rss 57344, heap 207124, baseline 182548, 0 / 3 inodes have caps, > 0 caps, 0 caps per inode > Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > trimming > Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache releasing free memory > Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting > interval 1.000000000s > Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > 372640, rss 57272, heap 207124, baseline 182548, 0 / 3 inodes have caps, > 0 caps, 0 caps per inode > Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > trimming > Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting > interval 1.000000000s > Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status > Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.158167 > schedule_update_timer_task > Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total > 372640, rss 57040, heap 207124, baseline 182548, 0 / 3 inodes have caps, > 0 caps, 0 caps per inode > Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for > trimming > Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting > interval 1.000000000s > > > The only thing that gives me hope here is that the line > mds.beacon.mds01.ceph05.pqxmvt Sending beacon up:replay seq 11109 is > chaning its sequence number. > > Anything else I can provide? > > Cheers, > Thomas > > On 17.01.23 06:27, Kotresh Hiremath Ravishankar wrote: >> Hi Thomas, >> >> Sorry, I misread the mds state to be stuck in 'up:resolve' state. The >> mds is stuck in 'up:replay' which means the MDS taking over a failed >> rank. >> This state represents that the MDS is recovering its journal and other >> metadata. >> >> I notice that there are two filesystems 'cephfs' and 'cephfs_insecure' >> and the active mds for both filesystems are stuck in 'up:replay'. The >> mds >> logs shared are not providing much information to infer anything. >> >> Could you please enable the debug logs and pass on the mds logs ? >> >> Thanks, >> Kotresh H R >> >> On Mon, Jan 16, 2023 at 2:38 PM Thomas Widhalm >> <thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>> wrote: >> >> Hi Kotresh, >> >> Thanks for your reply! >> >> I only have one rank. Here's the output of all MDS I have: >> >> ################### >> >> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph05.pqxmvt status >> 2023-01-16T08:55:26.055+0000 7f3412ffd700 0 client.61249926 >> ms_handle_reset on v2:192.168.23.65:6800/2680651694 >> <http://192.168.23.65:6800/2680651694> >> 2023-01-16T08:55:26.084+0000 7f3412ffd700 0 client.61299199 >> ms_handle_reset on v2:192.168.23.65:6800/2680651694 >> <http://192.168.23.65:6800/2680651694> >> { >> "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", >> "whoami": 0, >> "id": 60984167, >> "want_state": "up:replay", >> "state": "up:replay", >> "fs_name": "cephfs", >> "replay_status": { >> "journal_read_pos": 0, >> "journal_write_pos": 0, >> "journal_expire_pos": 0, >> "num_events": 0, >> "num_segments": 0 >> }, >> "rank_uptime": 150224.982558844, >> "mdsmap_epoch": 143757, >> "osdmap_epoch": 12395, >> "osdmap_epoch_barrier": 0, >> "uptime": 150225.39968057699 >> } >> >> ######################## >> >> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph04.cvdhsx status >> 2023-01-16T08:59:05.434+0000 7fdb82ff5700 0 client.61299598 >> ms_handle_reset on v2:192.168.23.64:6800/3930607515 >> <http://192.168.23.64:6800/3930607515> >> 2023-01-16T08:59:05.466+0000 7fdb82ff5700 0 client.61299604 >> ms_handle_reset on v2:192.168.23.64:6800/3930607515 >> <http://192.168.23.64:6800/3930607515> >> { >> "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", >> "whoami": 0, >> "id": 60984134, >> "want_state": "up:replay", >> "state": "up:replay", >> "fs_name": "cephfs_insecure", >> "replay_status": { >> "journal_read_pos": 0, >> "journal_write_pos": 0, >> "journal_expire_pos": 0, >> "num_events": 0, >> "num_segments": 0 >> }, >> "rank_uptime": 150450.96934037199, >> "mdsmap_epoch": 143815, >> "osdmap_epoch": 12395, >> "osdmap_epoch_barrier": 0, >> "uptime": 150451.93533502301 >> } >> >> ########################### >> >> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph06.wcfdom status >> 2023-01-16T08:59:28.572+0000 7f16538c0b80 -1 client.61250376 >> resolve_mds: no MDS daemons found by name `mds01.ceph06.wcfdom' >> 2023-01-16T08:59:28.583+0000 7f16538c0b80 -1 client.61250376 FSMap: >> cephfs:1/1 cephfs_insecure:1/1 >> >> {cephfs:0=mds01.ceph05.pqxmvt=up:replay,cephfs_insecure:0=mds01.ceph04.cvdhsx=up:replay} >> 2 up:standby >> Error ENOENT: problem getting command descriptions from >> mds.mds01.ceph06.wcfdom >> >> ############################ >> >> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph07.omdisd status >> 2023-01-16T09:00:02.802+0000 7fb7affff700 0 client.61250454 >> ms_handle_reset on v2:192.168.23.67:6800/942898192 >> <http://192.168.23.67:6800/942898192> >> 2023-01-16T09:00:02.831+0000 7fb7affff700 0 client.61299751 >> ms_handle_reset on v2:192.168.23.67:6800/942898192 >> <http://192.168.23.67:6800/942898192> >> { >> "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", >> "whoami": -1, >> "id": 60984161, >> "want_state": "up:standby", >> "state": "up:standby", >> "mdsmap_epoch": 97687, >> "osdmap_epoch": 0, >> "osdmap_epoch_barrier": 0, >> "uptime": 150508.29091721401 >> } >> >> The error message from ceph06 is new to me. That didn't happen the >> last >> times. >> >> [ceph: root@ceph06 /]# ceph fs dump >> e143850 >> enable_multiple, ever_enabled_multiple: 1,1 >> default compat: compat={},rocompat={},incompat={1=base >> v0.20,2=client >> writeable ranges,3=default file layouts on dirs,4=dir inode in >> separate >> object,5=mds uses versioned encoding,6=dirfrag is stored in >> omap,8=no >> anchor table,9=file layout v2,10=snaprealm v2} >> legacy client fscid: 2 >> >> Filesystem 'cephfs' (2) >> fs_name cephfs >> epoch 143850 >> flags 12 joinable allow_snaps allow_multimds_snaps >> created 2023-01-14T14:30:05.723421+0000 >> modified 2023-01-16T09:00:53.663007+0000 >> tableserver 0 >> root 0 >> session_timeout 60 >> session_autoclose 300 >> max_file_size 1099511627776 >> required_client_features {} >> last_failure 0 >> last_failure_osd_epoch 12321 >> compat compat={},rocompat={},incompat={1=base v0.20,2=client >> writeable >> ranges,3=default file layouts on dirs,4=dir inode in separate >> object,5=mds uses versioned encoding,6=dirfrag is stored in >> omap,7=mds >> uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} >> max_mds 1 >> in 0 >> up {0=60984167} >> failed >> damaged >> stopped >> data_pools [4] >> metadata_pool 5 >> inline_data disabled >> balancer >> standby_count_wanted 1 >> [mds.mds01.ceph05.pqxmvt{0:60984167} state up:replay seq 37637 addr >> [v2:192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694 >> >> <http://192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694>] >> compat {c=[1],r=[1],i=[7ff]}] >> >> >> Filesystem 'cephfs_insecure' (3) >> fs_name cephfs_insecure >> epoch 143849 >> flags 12 joinable allow_snaps allow_multimds_snaps >> created 2023-01-14T14:22:46.360062+0000 >> modified 2023-01-16T09:00:52.632163+0000 >> tableserver 0 >> root 0 >> session_timeout 60 >> session_autoclose 300 >> max_file_size 1099511627776 >> required_client_features {} >> last_failure 0 >> last_failure_osd_epoch 12319 >> compat compat={},rocompat={},incompat={1=base v0.20,2=client >> writeable >> ranges,3=default file layouts on dirs,4=dir inode in separate >> object,5=mds uses versioned encoding,6=dirfrag is stored in >> omap,7=mds >> uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} >> max_mds 1 >> in 0 >> up {0=60984134} >> failed >> damaged >> stopped >> data_pools [7] >> metadata_pool 6 >> inline_data disabled >> balancer >> standby_count_wanted 1 >> [mds.mds01.ceph04.cvdhsx{0:60984134} state up:replay seq 37639 addr >> [v2:192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515 >> >> <http://192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515>] >> compat {c=[1],r=[1],i=[7ff]}] >> >> >> Standby daemons: >> >> [mds.mds01.ceph07.omdisd{-1:60984161} state up:standby seq 2 addr >> [v2:192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192 >> >> <http://192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192>] >> compat >> {c=[1],r=[1],i=[7ff]}] >> [mds.mds01.ceph06.hsuhqd{-1:60984828} state up:standby seq 1 addr >> [v2:192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518 >> >> <http://192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518>] >> compat {c=[1],r=[1],i=[7ff]}] >> dumped fsmap epoch 143850 >> >> ############################# >> >> [ceph: root@ceph06 /]# ceph fs status >> >> (doesn't come back) >> >> ############################# >> >> All MDS show log lines similar to this one: >> >> Jan 16 10:05:00 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143927 from mon.1 >> Jan 16 10:05:05 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143929 from mon.1 >> Jan 16 10:05:09 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143930 from mon.1 >> Jan 16 10:05:13 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143931 from mon.1 >> Jan 16 10:05:20 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143933 from mon.1 >> Jan 16 10:05:24 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143935 from mon.1 >> Jan 16 10:05:29 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143936 from mon.1 >> Jan 16 10:05:33 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143937 from mon.1 >> Jan 16 10:05:40 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143939 from mon.1 >> Jan 16 10:05:44 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143941 from mon.1 >> Jan 16 10:05:49 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >> Updating >> MDS map to version 143942 from mon.1 >> >> Anything else, I can provide? >> >> Cheers and thanks again! >> Thomas >> >> On 16.01.23 06:01, Kotresh Hiremath Ravishankar wrote: >> > Hi Thomas, >> > >> > As the documentation says, the MDS enters up:resolve from >> |up:replay| if >> > the Ceph file system has multiple ranks (including this one), >> i.e. it’s >> > not a single active MDS cluster. >> > The MDS is resolving any uncommitted inter-MDS operations. All >> ranks in >> > the file system must be in this state or later for progress to be >> made, >> > i.e. no rank can be failed/damaged or |up:replay|. >> > >> > So please check the status of the other active mds if it's >> failed. >> > >> > Also please share the mds logs and the output of 'ceph fs dump' >> and >> > 'ceph fs status' >> > >> > Thanks, >> > Kotresh H R >> > >> > On Sat, Jan 14, 2023 at 9:07 PM Thomas Widhalm >> > <thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de> >> <mailto:thomas.widhalm@netways.de >> <mailto:thomas.widhalm@netways.de>>> wrote: >> > >> > Hi, >> > >> > I'm really lost with my Ceph system. I built a small cluster >> for home >> > usage which has two uses for me: I want to replace an old NAS >> and I want >> > to learn about Ceph so that I have hands-on experience. We're >> using it >> > in our company but I need some real-life experience without >> risking any >> > company or customers data. That's my preferred way of >> learning. >> > >> > The cluster consists of 3 Raspberry Pis plus a few VMs >> running on >> > Proxmox. I'm not using Proxmox' built in Ceph because I want >> to focus on >> > Ceph and not just use it as a preconfigured tool. >> > >> > All hosts are running Fedora (x86_64 and arm64) and during an >> Upgrade >> > from F36 to F37 my cluster suddenly showed all PGs as >> unavailable. I >> > worked nearly a week to get it back online and I learned a >> lot about >> > Ceph management and recovery. The cluster is back but I still >> can't >> > access my data. Maybe you can help me? >> > >> > Here are my versions: >> > >> > [ceph: root@ceph04 /]# ceph versions >> > { >> > "mon": { >> > "ceph version 17.2.5 >> > (98318ae89f1a893a6ded3a640405cdbb33e08757) >> > quincy (stable)": 3 >> > }, >> > "mgr": { >> > "ceph version 17.2.5 >> > (98318ae89f1a893a6ded3a640405cdbb33e08757) >> > quincy (stable)": 3 >> > }, >> > "osd": { >> > "ceph version 17.2.5 >> > (98318ae89f1a893a6ded3a640405cdbb33e08757) >> > quincy (stable)": 5 >> > }, >> > "mds": { >> > "ceph version 17.2.5 >> > (98318ae89f1a893a6ded3a640405cdbb33e08757) >> > quincy (stable)": 4 >> > }, >> > "overall": { >> > "ceph version 17.2.5 >> > (98318ae89f1a893a6ded3a640405cdbb33e08757) >> > quincy (stable)": 15 >> > } >> > } >> > >> > >> > Here's MDS status output of one MDS: >> > [ceph: root@ceph04 /]# ceph tell mds.mds01.ceph05.pqxmvt >> status >> > 2023-01-14T15:30:28.607+0000 7fb9e17fa700 0 client.60986454 >> > ms_handle_reset on v2:192.168.23.65:6800/2680651694 >> <http://192.168.23.65:6800/2680651694> >> > <http://192.168.23.65:6800/2680651694 >> <http://192.168.23.65:6800/2680651694>> >> > 2023-01-14T15:30:28.640+0000 7fb9e17fa700 0 client.60986460 >> > ms_handle_reset on v2:192.168.23.65:6800/2680651694 >> <http://192.168.23.65:6800/2680651694> >> > <http://192.168.23.65:6800/2680651694 >> <http://192.168.23.65:6800/2680651694>> >> > { >> > "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", >> > "whoami": 0, >> > "id": 60984167, >> > "want_state": "up:replay", >> > "state": "up:replay", >> > "fs_name": "cephfs", >> > "replay_status": { >> > "journal_read_pos": 0, >> > "journal_write_pos": 0, >> > "journal_expire_pos": 0, >> > "num_events": 0, >> > "num_segments": 0 >> > }, >> > "rank_uptime": 1127.54018615, >> > "mdsmap_epoch": 98056, >> > "osdmap_epoch": 12362, >> > "osdmap_epoch_barrier": 0, >> > "uptime": 1127.957307273 >> > } >> > >> > It's staying like that for days now. If there was a counter >> moving, I >> > just would wait but it doesn't change anything and alle stats >> says, the >> > MDS aren't working at all. >> > >> > The symptom I have is that Dashboard and all other tools I >> use say, it's >> > more or less ok. (Some old messages about failed daemons and >> scrubbing >> > aside). But I can't mount anything. When I try to start a VM >> that's on >> > RDS I just get a timeout. And when I try to mount a CephFS, >> mount just >> > hangs forever. >> > >> > Whatever command I give MDS or journal, it just hangs. The >> only thing I >> > could do, was take all CephFS offline, kill the MDS's and do >> a "ceph fs >> > reset <fs name> --yes-i-really-mean-it". After that I >> rebooted all >> > nodes, just to be sure but I still have no access to data. >> > >> > Could you please help me? I'm kinda desperate. If you need >> any more >> > information, just let me know. >> > >> > Cheers, >> > Thomas >> > >> > -- >> > Thomas Widhalm >> > Lead Systems Engineer >> > >> > NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | >> > D-90429 Nuernberg >> > Tel: +49 911 92885-0 | Fax: +49 911 92885-77 >> > CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 >> > https://www.netways.de <https://www.netways.de> >> <https://www.netways.de <https://www.netways.de>> | >> > thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de> >> <mailto:thomas.widhalm@netways.de >> <mailto:thomas.widhalm@netways.de>> >> > >> > ** stackconf 2023 - September - https://stackconf.eu >> <https://stackconf.eu> >> > <https://stackconf.eu <https://stackconf.eu>> ** >> > ** OSMC 2023 - November - https://osmc.de <https://osmc.de> >> <https://osmc.de <https://osmc.de>> ** >> > ** New at NWS: Managed Database - >> > https://nws.netways.de/managed-database >> <https://nws.netways.de/managed-database> >> > <https://nws.netways.de/managed-database >> <https://nws.netways.de/managed-database>> ** >> > ** NETWAYS Web Services - https://nws.netways.de >> <https://nws.netways.de> >> > <https://nws.netways.de <https://nws.netways.de>> ** >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users(a)ceph.io >> <mailto:ceph-users@ceph.io> >> > <mailto:ceph-users@ceph.io <mailto:ceph-users@ceph.io>> >> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >> <mailto:ceph-users-leave@ceph.io> >> > <mailto:ceph-users-leave@ceph.io >> <mailto:ceph-users-leave@ceph.io>> >> > >> >> -- >> Thomas Widhalm >> Lead Systems Engineer >> >> NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | >> D-90429 Nuernberg >> Tel: +49 911 92885-0 | Fax: +49 911 92885-77 >> CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 >> https://www.netways.de <https://www.netways.de> | >> thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de> >> >> ** stackconf 2023 - September - https://stackconf.eu >> <https://stackconf.eu> ** >> ** OSMC 2023 - November - https://osmc.de <https://osmc.de> ** >> ** New at NWS: Managed Database - >> https://nws.netways.de/managed-database >> <https://nws.netways.de/managed-database> ** >> ** NETWAYS Web Services - https://nws.netways.de >> <https://nws.netways.de> ** >> > > -- > Thomas Widhalm > Lead Systems Engineer > > NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 > Nuernberg > Tel: +49 911 92885-0 | Fax: +49 911 92885-77 > CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 > https://www.netways.de | thomas.widhalm(a)netways.de > > ** stackconf 2023 - September - https://stackconf.eu ** > ** OSMC 2023 - November - https://osmc.de ** > ** New at NWS: Managed Database - > https://nws.netways.de/managed-database ** > ** NETWAYS Web Services - https://nws.netways.de ** > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io -- Thomas Widhalm Lead Systems Engineer NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg Tel: +49 911 92885-0 | Fax: +49 911 92885-77 CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 https://www.netways.de | thomas.widhalm(a)netways.de ** stackconf 2023 - September - https://stackconf.eu ** ** OSMC 2023 - November - https://osmc.de ** ** New at NWS: Managed Database - https://nws.netways.de/managed-database ** ** NETWAYS Web Services - https://nws.netways.de ** _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

-- Cheers, Venky

adjurdjevic＠gmail.com

23 Jan 23 Jan

5:50 a.m.

Hello Thomas, I have same issue with mds like you describe, and ceph version is a same. Did state up:replay ever finish in your case? Thx Aleksandar

Venky Shankar

11:06 p.m.

On Tue, Jan 24, 2023 at 1:34 AM <adjurdjevic(a)gmail.com> wrote:

...

Hello Thomas, I have same issue with mds like you describe, and ceph version is a same. Did state up:replay ever finish in your case?

There is probably much going on with Thomas's cluster which is blocking the mds to make progress. Could you upload logs here? - https://tracker.ceph.com/issues/58489

...

Thx Aleksandar _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

-- Cheers, Venky

Thomas Widhalm

25 Jan 25 Jan

12:33 p.m.

Hi, Sorry for the delay. As I told Venky directly, there seems to be a problem with DMARC handling of the Ceph users list. So it was blocked by the company I work for. So I'm writing from my personal e-mail address, now. Did I miss something? Venky, you said, that, as soon as the underlying issue is solved, my filesystems should come up again. Is there anything I can do to help with solving? Or do I need to wait for the bug to be solved and then upgrade my Ceph while CephFS is still broken? I'm still seeing both MDS counting up seq numbers for days now. That really puzzles me because at least one of them hasn't seen changes for weeks before the crash. Cheers, Thomas On 20.01.23 04:37, Venky Shankar wrote:

...

Hi Thomas, On Thu, Jan 19, 2023 at 7:15 PM Thomas Widhalm <thomas.widhalm(a)netways.de> wrote:

The journal flush did work which is an indication that the mds made progress.

Hmmm.. the MDSs are going through the replay state which is expected. Do you see the MDSs crashing again?

On 19.01.23 14:01, Venky Shankar wrote:

Hi Thomas, On Tue, Jan 17, 2023 at 5:34 PM Thomas Widhalm <thomas.widhalm(a)netways.de> wrote:

and *** Caught signal (Aborted) ** in thread 7fccc7153700 thread_name:md_log_replay ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable) 1: /lib64/libpthread.so.0(+0x12cf0) [0x7fccd6593cf0] 2: gsignal() 3: abort() 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x7fccd7599499] 5: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605] 6: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c) [0x55fb2b98e89c] 7: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0] 8: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443] 9: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31] 10: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca] 11: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Is what I found in the logs. Since it's referring to log replaying, could this be related to my issue? On 17.01.23 10:54, Thomas Widhalm wrote: > Hi again, > > Another thing I found: Out of pure desperation, I started MDS on all > nodes. I had them configured in the past so I was hoping, they could > help with bringing in missing data even when they were down for quite a > while now. I didn't see any changes in the logs but the CPU on the hosts > that usually don't run MDS just spiked. So high I had to kill the MDS > again because otherwise they kept killing OSD containers. So I don't > really have any new information, but maybe that could be a hint of some > kind? > > Cheers, > Thomas > > On 17.01.23 10:13, Thomas Widhalm wrote: >> Hi, >> >> Thanks again. :-) >> >> Ok, that seems like an error to me. I never configured an extra rank for >> MDS. Maybe that's where my knowledge failed me but I guess, MDS is >> waiting for something that was never there. >> >> Yes, there are two filesystems. Due to "budget restrictions" (it's my >> personal system at home, I configured a second CephFS with only one >> replica for data that could be easily restored. >> >> Here's what I got when turning up the debug level: >> >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting >> interval 1.000000000s >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt >> Sending beacon up:replay seq 11107 >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt >> sender thread waiting interval 4s >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt >> received beacon reply up:replay seq 11107 rtt 0.00200002 >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167 >> schedule_update_timer_task >> Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, >> 0 caps, 0 caps per inode >> Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for >> trimming >> Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting >> interval 1.000000000s >> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, >> 0 caps, 0 caps per inode >> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for >> trimming >> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting >> interval 1.000000000s >> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status >> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.158167 >> schedule_update_timer_task >> Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, >> 0 caps, 0 caps per inode >> Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for >> trimming >> Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting >> interval 1.000000000s >> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, >> 0 caps, 0 caps per inode >> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for >> trimming >> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting >> interval 1.000000000s >> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt >> Sending beacon up:replay seq 11108 >> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt >> sender thread waiting interval 4s >> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt >> received beacon reply up:replay seq 11108 rtt 0.00200002 >> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status >> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.158167 >> schedule_update_timer_task >> Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, >> 0 caps, 0 caps per inode >> Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for >> trimming >> Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting >> interval 1.000000000s >> Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, >> 0 caps, 0 caps per inode >> Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for >> trimming >> Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting >> interval 1.000000000s >> Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status >> Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.158167 >> schedule_update_timer_task >> Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, >> 0 caps, 0 caps per inode >> Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for >> trimming >> Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting >> interval 1.000000000s >> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps, >> 0 caps, 0 caps per inode >> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for >> trimming >> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting >> interval 1.000000000s >> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt >> Sending beacon up:replay seq 11109 >> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt >> sender thread waiting interval 4s >> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt >> received beacon reply up:replay seq 11109 rtt 0.00600006 >> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status >> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.158167 >> schedule_update_timer_task >> Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total >> 372640, rss 57344, heap 207124, baseline 182548, 0 / 3 inodes have caps, >> 0 caps, 0 caps per inode >> Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for >> trimming >> Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache releasing free memory >> Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting >> interval 1.000000000s >> Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total >> 372640, rss 57272, heap 207124, baseline 182548, 0 / 3 inodes have caps, >> 0 caps, 0 caps per inode >> Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for >> trimming >> Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting >> interval 1.000000000s >> Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status >> Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.158167 >> schedule_update_timer_task >> Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total >> 372640, rss 57040, heap 207124, baseline 182548, 0 / 3 inodes have caps, >> 0 caps, 0 caps per inode >> Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for >> trimming >> Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting >> interval 1.000000000s >> >> >> The only thing that gives me hope here is that the line >> mds.beacon.mds01.ceph05.pqxmvt Sending beacon up:replay seq 11109 is >> chaning its sequence number. >> >> Anything else I can provide? >> >> Cheers, >> Thomas >> >> On 17.01.23 06:27, Kotresh Hiremath Ravishankar wrote: >>> Hi Thomas, >>> >>> Sorry, I misread the mds state to be stuck in 'up:resolve' state. The >>> mds is stuck in 'up:replay' which means the MDS taking over a failed >>> rank. >>> This state represents that the MDS is recovering its journal and other >>> metadata. >>> >>> I notice that there are two filesystems 'cephfs' and 'cephfs_insecure' >>> and the active mds for both filesystems are stuck in 'up:replay'. The >>> mds >>> logs shared are not providing much information to infer anything. >>> >>> Could you please enable the debug logs and pass on the mds logs ? >>> >>> Thanks, >>> Kotresh H R >>> >>> On Mon, Jan 16, 2023 at 2:38 PM Thomas Widhalm >>> <thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>> wrote: >>> >>> Hi Kotresh, >>> >>> Thanks for your reply! >>> >>> I only have one rank. Here's the output of all MDS I have: >>> >>> ################### >>> >>> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph05.pqxmvt status >>> 2023-01-16T08:55:26.055+0000 7f3412ffd700 0 client.61249926 >>> ms_handle_reset on v2:192.168.23.65:6800/2680651694 >>> <http://192.168.23.65:6800/2680651694> >>> 2023-01-16T08:55:26.084+0000 7f3412ffd700 0 client.61299199 >>> ms_handle_reset on v2:192.168.23.65:6800/2680651694 >>> <http://192.168.23.65:6800/2680651694> >>> { >>> "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", >>> "whoami": 0, >>> "id": 60984167, >>> "want_state": "up:replay", >>> "state": "up:replay", >>> "fs_name": "cephfs", >>> "replay_status": { >>> "journal_read_pos": 0, >>> "journal_write_pos": 0, >>> "journal_expire_pos": 0, >>> "num_events": 0, >>> "num_segments": 0 >>> }, >>> "rank_uptime": 150224.982558844, >>> "mdsmap_epoch": 143757, >>> "osdmap_epoch": 12395, >>> "osdmap_epoch_barrier": 0, >>> "uptime": 150225.39968057699 >>> } >>> >>> ######################## >>> >>> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph04.cvdhsx status >>> 2023-01-16T08:59:05.434+0000 7fdb82ff5700 0 client.61299598 >>> ms_handle_reset on v2:192.168.23.64:6800/3930607515 >>> <http://192.168.23.64:6800/3930607515> >>> 2023-01-16T08:59:05.466+0000 7fdb82ff5700 0 client.61299604 >>> ms_handle_reset on v2:192.168.23.64:6800/3930607515 >>> <http://192.168.23.64:6800/3930607515> >>> { >>> "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", >>> "whoami": 0, >>> "id": 60984134, >>> "want_state": "up:replay", >>> "state": "up:replay", >>> "fs_name": "cephfs_insecure", >>> "replay_status": { >>> "journal_read_pos": 0, >>> "journal_write_pos": 0, >>> "journal_expire_pos": 0, >>> "num_events": 0, >>> "num_segments": 0 >>> }, >>> "rank_uptime": 150450.96934037199, >>> "mdsmap_epoch": 143815, >>> "osdmap_epoch": 12395, >>> "osdmap_epoch_barrier": 0, >>> "uptime": 150451.93533502301 >>> } >>> >>> ########################### >>> >>> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph06.wcfdom status >>> 2023-01-16T08:59:28.572+0000 7f16538c0b80 -1 client.61250376 >>> resolve_mds: no MDS daemons found by name `mds01.ceph06.wcfdom' >>> 2023-01-16T08:59:28.583+0000 7f16538c0b80 -1 client.61250376 FSMap: >>> cephfs:1/1 cephfs_insecure:1/1 >>> >>> {cephfs:0=mds01.ceph05.pqxmvt=up:replay,cephfs_insecure:0=mds01.ceph04.cvdhsx=up:replay} >>> 2 up:standby >>> Error ENOENT: problem getting command descriptions from >>> mds.mds01.ceph06.wcfdom >>> >>> ############################ >>> >>> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph07.omdisd status >>> 2023-01-16T09:00:02.802+0000 7fb7affff700 0 client.61250454 >>> ms_handle_reset on v2:192.168.23.67:6800/942898192 >>> <http://192.168.23.67:6800/942898192> >>> 2023-01-16T09:00:02.831+0000 7fb7affff700 0 client.61299751 >>> ms_handle_reset on v2:192.168.23.67:6800/942898192 >>> <http://192.168.23.67:6800/942898192> >>> { >>> "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", >>> "whoami": -1, >>> "id": 60984161, >>> "want_state": "up:standby", >>> "state": "up:standby", >>> "mdsmap_epoch": 97687, >>> "osdmap_epoch": 0, >>> "osdmap_epoch_barrier": 0, >>> "uptime": 150508.29091721401 >>> } >>> >>> The error message from ceph06 is new to me. That didn't happen the >>> last >>> times. >>> >>> [ceph: root@ceph06 /]# ceph fs dump >>> e143850 >>> enable_multiple, ever_enabled_multiple: 1,1 >>> default compat: compat={},rocompat={},incompat={1=base >>> v0.20,2=client >>> writeable ranges,3=default file layouts on dirs,4=dir inode in >>> separate >>> object,5=mds uses versioned encoding,6=dirfrag is stored in >>> omap,8=no >>> anchor table,9=file layout v2,10=snaprealm v2} >>> legacy client fscid: 2 >>> >>> Filesystem 'cephfs' (2) >>> fs_name cephfs >>> epoch 143850 >>> flags 12 joinable allow_snaps allow_multimds_snaps >>> created 2023-01-14T14:30:05.723421+0000 >>> modified 2023-01-16T09:00:53.663007+0000 >>> tableserver 0 >>> root 0 >>> session_timeout 60 >>> session_autoclose 300 >>> max_file_size 1099511627776 >>> required_client_features {} >>> last_failure 0 >>> last_failure_osd_epoch 12321 >>> compat compat={},rocompat={},incompat={1=base v0.20,2=client >>> writeable >>> ranges,3=default file layouts on dirs,4=dir inode in separate >>> object,5=mds uses versioned encoding,6=dirfrag is stored in >>> omap,7=mds >>> uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} >>> max_mds 1 >>> in 0 >>> up {0=60984167} >>> failed >>> damaged >>> stopped >>> data_pools [4] >>> metadata_pool 5 >>> inline_data disabled >>> balancer >>> standby_count_wanted 1 >>> [mds.mds01.ceph05.pqxmvt{0:60984167} state up:replay seq 37637 addr >>> [v2:192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694 >>> >>> <http://192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694>] >>> compat {c=[1],r=[1],i=[7ff]}] >>> >>> >>> Filesystem 'cephfs_insecure' (3) >>> fs_name cephfs_insecure >>> epoch 143849 >>> flags 12 joinable allow_snaps allow_multimds_snaps >>> created 2023-01-14T14:22:46.360062+0000 >>> modified 2023-01-16T09:00:52.632163+0000 >>> tableserver 0 >>> root 0 >>> session_timeout 60 >>> session_autoclose 300 >>> max_file_size 1099511627776 >>> required_client_features {} >>> last_failure 0 >>> last_failure_osd_epoch 12319 >>> compat compat={},rocompat={},incompat={1=base v0.20,2=client >>> writeable >>> ranges,3=default file layouts on dirs,4=dir inode in separate >>> object,5=mds uses versioned encoding,6=dirfrag is stored in >>> omap,7=mds >>> uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} >>> max_mds 1 >>> in 0 >>> up {0=60984134} >>> failed >>> damaged >>> stopped >>> data_pools [7] >>> metadata_pool 6 >>> inline_data disabled >>> balancer >>> standby_count_wanted 1 >>> [mds.mds01.ceph04.cvdhsx{0:60984134} state up:replay seq 37639 addr >>> [v2:192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515 >>> >>> <http://192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515>] >>> compat {c=[1],r=[1],i=[7ff]}] >>> >>> >>> Standby daemons: >>> >>> [mds.mds01.ceph07.omdisd{-1:60984161} state up:standby seq 2 addr >>> [v2:192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192 >>> >>> <http://192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192>] >>> compat >>> {c=[1],r=[1],i=[7ff]}] >>> [mds.mds01.ceph06.hsuhqd{-1:60984828} state up:standby seq 1 addr >>> [v2:192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518 >>> >>> <http://192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518>] >>> compat {c=[1],r=[1],i=[7ff]}] >>> dumped fsmap epoch 143850 >>> >>> ############################# >>> >>> [ceph: root@ceph06 /]# ceph fs status >>> >>> (doesn't come back) >>> >>> ############################# >>> >>> All MDS show log lines similar to this one: >>> >>> Jan 16 10:05:00 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >>> Updating >>> MDS map to version 143927 from mon.1 >>> Jan 16 10:05:05 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >>> Updating >>> MDS map to version 143929 from mon.1 >>> Jan 16 10:05:09 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >>> Updating >>> MDS map to version 143930 from mon.1 >>> Jan 16 10:05:13 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >>> Updating >>> MDS map to version 143931 from mon.1 >>> Jan 16 10:05:20 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >>> Updating >>> MDS map to version 143933 from mon.1 >>> Jan 16 10:05:24 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >>> Updating >>> MDS map to version 143935 from mon.1 >>> Jan 16 10:05:29 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >>> Updating >>> MDS map to version 143936 from mon.1 >>> Jan 16 10:05:33 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >>> Updating >>> MDS map to version 143937 from mon.1 >>> Jan 16 10:05:40 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >>> Updating >>> MDS map to version 143939 from mon.1 >>> Jan 16 10:05:44 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >>> Updating >>> MDS map to version 143941 from mon.1 >>> Jan 16 10:05:49 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx >>> Updating >>> MDS map to version 143942 from mon.1 >>> >>> Anything else, I can provide? >>> >>> Cheers and thanks again! >>> Thomas >>> >>> On 16.01.23 06:01, Kotresh Hiremath Ravishankar wrote: >>> > Hi Thomas, >>> > >>> > As the documentation says, the MDS enters up:resolve from >>> |up:replay| if >>> > the Ceph file system has multiple ranks (including this one), >>> i.e. it’s >>> > not a single active MDS cluster. >>> > The MDS is resolving any uncommitted inter-MDS operations. All >>> ranks in >>> > the file system must be in this state or later for progress to be >>> made, >>> > i.e. no rank can be failed/damaged or |up:replay|. >>> > >>> > So please check the status of the other active mds if it's >>> failed. >>> > >>> > Also please share the mds logs and the output of 'ceph fs dump' >>> and >>> > 'ceph fs status' >>> > >>> > Thanks, >>> > Kotresh H R >>> > >>> > On Sat, Jan 14, 2023 at 9:07 PM Thomas Widhalm >>> > <thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de> >>> <mailto:thomas.widhalm@netways.de >>> <mailto:thomas.widhalm@netways.de>>> wrote: >>> > >>> > Hi, >>> > >>> > I'm really lost with my Ceph system. I built a small cluster >>> for home >>> > usage which has two uses for me: I want to replace an old NAS >>> and I want >>> > to learn about Ceph so that I have hands-on experience. We're >>> using it >>> > in our company but I need some real-life experience without >>> risking any >>> > company or customers data. That's my preferred way of >>> learning. >>> > >>> > The cluster consists of 3 Raspberry Pis plus a few VMs >>> running on >>> > Proxmox. I'm not using Proxmox' built in Ceph because I want >>> to focus on >>> > Ceph and not just use it as a preconfigured tool. >>> > >>> > All hosts are running Fedora (x86_64 and arm64) and during an >>> Upgrade >>> > from F36 to F37 my cluster suddenly showed all PGs as >>> unavailable. I >>> > worked nearly a week to get it back online and I learned a >>> lot about >>> > Ceph management and recovery. The cluster is back but I still >>> can't >>> > access my data. Maybe you can help me? >>> > >>> > Here are my versions: >>> > >>> > [ceph: root@ceph04 /]# ceph versions >>> > { >>> > "mon": { >>> > "ceph version 17.2.5 >>> > (98318ae89f1a893a6ded3a640405cdbb33e08757) >>> > quincy (stable)": 3 >>> > }, >>> > "mgr": { >>> > "ceph version 17.2.5 >>> > (98318ae89f1a893a6ded3a640405cdbb33e08757) >>> > quincy (stable)": 3 >>> > }, >>> > "osd": { >>> > "ceph version 17.2.5 >>> > (98318ae89f1a893a6ded3a640405cdbb33e08757) >>> > quincy (stable)": 5 >>> > }, >>> > "mds": { >>> > "ceph version 17.2.5 >>> > (98318ae89f1a893a6ded3a640405cdbb33e08757) >>> > quincy (stable)": 4 >>> > }, >>> > "overall": { >>> > "ceph version 17.2.5 >>> > (98318ae89f1a893a6ded3a640405cdbb33e08757) >>> > quincy (stable)": 15 >>> > } >>> > } >>> > >>> > >>> > Here's MDS status output of one MDS: >>> > [ceph: root@ceph04 /]# ceph tell mds.mds01.ceph05.pqxmvt >>> status >>> > 2023-01-14T15:30:28.607+0000 7fb9e17fa700 0 client.60986454 >>> > ms_handle_reset on v2:192.168.23.65:6800/2680651694 >>> <http://192.168.23.65:6800/2680651694> >>> > <http://192.168.23.65:6800/2680651694 >>> <http://192.168.23.65:6800/2680651694>> >>> > 2023-01-14T15:30:28.640+0000 7fb9e17fa700 0 client.60986460 >>> > ms_handle_reset on v2:192.168.23.65:6800/2680651694 >>> <http://192.168.23.65:6800/2680651694> >>> > <http://192.168.23.65:6800/2680651694 >>> <http://192.168.23.65:6800/2680651694>> >>> > { >>> > "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", >>> > "whoami": 0, >>> > "id": 60984167, >>> > "want_state": "up:replay", >>> > "state": "up:replay", >>> > "fs_name": "cephfs", >>> > "replay_status": { >>> > "journal_read_pos": 0, >>> > "journal_write_pos": 0, >>> > "journal_expire_pos": 0, >>> > "num_events": 0, >>> > "num_segments": 0 >>> > }, >>> > "rank_uptime": 1127.54018615, >>> > "mdsmap_epoch": 98056, >>> > "osdmap_epoch": 12362, >>> > "osdmap_epoch_barrier": 0, >>> > "uptime": 1127.957307273 >>> > } >>> > >>> > It's staying like that for days now. If there was a counter >>> moving, I >>> > just would wait but it doesn't change anything and alle stats >>> says, the >>> > MDS aren't working at all. >>> > >>> > The symptom I have is that Dashboard and all other tools I >>> use say, it's >>> > more or less ok. (Some old messages about failed daemons and >>> scrubbing >>> > aside). But I can't mount anything. When I try to start a VM >>> that's on >>> > RDS I just get a timeout. And when I try to mount a CephFS, >>> mount just >>> > hangs forever. >>> > >>> > Whatever command I give MDS or journal, it just hangs. The >>> only thing I >>> > could do, was take all CephFS offline, kill the MDS's and do >>> a "ceph fs >>> > reset <fs name> --yes-i-really-mean-it". After that I >>> rebooted all >>> > nodes, just to be sure but I still have no access to data. >>> > >>> > Could you please help me? I'm kinda desperate. If you need >>> any more >>> > information, just let me know. >>> > >>> > Cheers, >>> > Thomas >>> > >>> > -- >>> > Thomas Widhalm >>> > Lead Systems Engineer >>> > >>> > NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | >>> > D-90429 Nuernberg >>> > Tel: +49 911 92885-0 | Fax: +49 911 92885-77 >>> > CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 >>> > https://www.netways.de <https://www.netways.de> >>> <https://www.netways.de <https://www.netways.de>> | >>> > thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de> >>> <mailto:thomas.widhalm@netways.de >>> <mailto:thomas.widhalm@netways.de>> >>> > >>> > ** stackconf 2023 - September - https://stackconf.eu >>> <https://stackconf.eu> >>> > <https://stackconf.eu <https://stackconf.eu>> ** >>> > ** OSMC 2023 - November - https://osmc.de <https://osmc.de> >>> <https://osmc.de <https://osmc.de>> ** >>> > ** New at NWS: Managed Database - >>> > https://nws.netways.de/managed-database >>> <https://nws.netways.de/managed-database> >>> > <https://nws.netways.de/managed-database >>> <https://nws.netways.de/managed-database>> ** >>> > ** NETWAYS Web Services - https://nws.netways.de >>> <https://nws.netways.de> >>> > <https://nws.netways.de <https://nws.netways.de>> ** >>> > _______________________________________________ >>> > ceph-users mailing list -- ceph-users(a)ceph.io >>> <mailto:ceph-users@ceph.io> >>> > <mailto:ceph-users@ceph.io <mailto:ceph-users@ceph.io>> >>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> <mailto:ceph-users-leave@ceph.io> >>> > <mailto:ceph-users-leave@ceph.io >>> <mailto:ceph-users-leave@ceph.io>> >>> > >>> >>> -- >>> Thomas Widhalm >>> Lead Systems Engineer >>> >>> NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | >>> D-90429 Nuernberg >>> Tel: +49 911 92885-0 | Fax: +49 911 92885-77 >>> CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 >>> https://www.netways.de <https://www.netways.de> | >>> thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de> >>> >>> ** stackconf 2023 - September - https://stackconf.eu >>> <https://stackconf.eu> ** >>> ** OSMC 2023 - November - https://osmc.de <https://osmc.de> ** >>> ** New at NWS: Managed Database - >>> https://nws.netways.de/managed-database >>> <https://nws.netways.de/managed-database> ** >>> ** NETWAYS Web Services - https://nws.netways.de >>> <https://nws.netways.de> ** >>> >> >> -- >> Thomas Widhalm >> Lead Systems Engineer >> >> NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 >> Nuernberg >> Tel: +49 911 92885-0 | Fax: +49 911 92885-77 >> CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 >> https://www.netways.de | thomas.widhalm(a)netways.de >> >> ** stackconf 2023 - September - https://stackconf.eu ** >> ** OSMC 2023 - November - https://osmc.de ** >> ** New at NWS: Managed Database - >> https://nws.netways.de/managed-database ** >> ** NETWAYS Web Services - https://nws.netways.de ** >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > > -- > Thomas Widhalm > Lead Systems Engineer > > NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 > Nuernberg > Tel: +49 911 92885-0 | Fax: +49 911 92885-77 > CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 > https://www.netways.de | thomas.widhalm(a)netways.de > > ** stackconf 2023 - September - https://stackconf.eu ** > ** OSMC 2023 - November - https://osmc.de ** > ** New at NWS: Managed Database - > https://nws.netways.de/managed-database ** > ** NETWAYS Web Services - https://nws.netways.de ** > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io -- Thomas Widhalm Lead Systems Engineer NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg Tel: +49 911 92885-0 | Fax: +49 911 92885-77 CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 https://www.netways.de | thomas.widhalm(a)netways.de ** stackconf 2023 - September - https://stackconf.eu ** ** OSMC 2023 - November - https://osmc.de ** ** New at NWS: Managed Database - https://nws.netways.de/managed-database ** ** NETWAYS Web Services - https://nws.netways.de ** _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Patrick Donnelly

22 Feb 22 Feb

8:58 a.m.

On Wed, Jan 25, 2023 at 3:36 PM Thomas Widhalm <widhalmt(a)widhalm.or.at> wrote:

...

It is likely that the MDS is not able to communicate with the OSDs if it's stuck in up:replay. Use: ceph config set mds debug_ms 5 ceph config set mds debug_mds 10 and ceph fs fail X ceph fs set X joinable true to get fresh logs from the MDS to see what's going with the messages to the OSDs. -- Patrick Donnelly, Ph.D. He / Him / His Red Hat Partner Engineer IBM, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

Thomas Widhalm

9:07 a.m.

Hi, Thanks for the idea! I tried it immediately but still, MDS are in up:replay mode. So far they haven't crashed but this usually takes a few minutes. So no effect so far. :-( Cheers, Thomas On 22.02.23 17:58, Patrick Donnelly wrote:

...

On Wed, Jan 25, 2023 at 3:36 PM Thomas Widhalm <widhalmt(a)widhalm.or.at> wrote:

Patrick Donnelly

9:36 a.m.

On Wed, Feb 22, 2023 at 12:10 PM Thomas Widhalm <widhalmt(a)widhalm.or.at> wrote:

...

Hi, Thanks for the idea! I tried it immediately but still, MDS are in up:replay mode. So far they haven't crashed but this usually takes a few minutes. So no effect so far. :-(

The commands I gave were for producing hopefully useful debug logs. Not intended to fix the problem for you. -- Patrick Donnelly, Ph.D. He / Him / His Red Hat Partner Engineer IBM, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

Thomas Widhalm

1:56 p.m.

...

On Wed, Feb 22, 2023 at 12:10 PM Thomas Widhalm <widhalmt(a)widhalm.or.at> wrote:

Hi, Thanks for the idea! I tried it immediately but still, MDS are in up:replay mode. So far they haven't crashed but this usually takes a few minutes. So no effect so far. :-(

The commands I gave were for producing hopefully useful debug logs. Not intended to fix the problem for you.

Xiubo Li

4:34 p.m.

On 23/02/2023 05:56, Thomas Widhalm wrote:

...

Ah, sorry. My bad. The MDS crashed and I restarted them. And I'm waiting for them to crash again. There's a tracker for this or a related issue: https://tracker.ceph.com/issues/58489

Is the call trace the same with this tracker ? Thanks,

...

Is there any place I can upload you anything from the logs? I'm still a bit new to Ceph but I guess, you'd like to have the crash logs? Thank you in advance. Any help is really appreciated. My filesystems are still completely down. Cheers, Thomas On 22.02.23 18:36, Patrick Donnelly wrote:

On Wed, Feb 22, 2023 at 12:10 PM Thomas Widhalm <widhalmt(a)widhalm.or.at> wrote:

Hi, Thanks for the idea! I tried it immediately but still, MDS are in up:replay mode. So far they haven't crashed but this usually takes a few minutes. So no effect so far. :-(

The commands I gave were for producing hopefully useful debug logs. Not intended to fix the problem for you.

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

-- Best Regards, Xiubo Li (李秀波) Email: xiubli@redhat.com/xiubli@ibm.com Slack: @Xiubo Li

Thomas Widhalm

23 Feb 23 Feb

3:53 a.m.

Yes, it's still: ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f6bf079e43f] 2: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7f6bf079e605] 3: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x56474700ece5] 4: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x59cd) [0x56474731140d] 5: (EUpdate::replay(MDSRank*)+0x40) [0x5647473125a0] 6: (MDLog::_replay_thread()+0x9b3) [0x564747298443] 7: (MDLog::ReplayThread::entry()+0x11) [0x564746f54e31] 8: /lib64/libpthread.so.0(+0x81ca) [0x7f6bef78e1ca] 9: clone() 0> 2023-02-22T17:07:28.647+0000 7f6be0358700 -1 *** Caught signal (Aborted) ** in thread 7f6be0358700 thread_name:md_log_replay ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable) 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f6bef798cf0] 2: gsignal() 3: abort() 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x7f6bf079e499] 5: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7f6bf079e605] 6: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x56474700ece5] 7: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x59cd) [0x56474731140d] 8: (EUpdate::replay(MDSRank*)+0x40) [0x5647473125a0] 9: (MDLog::_replay_thread()+0x9b3) [0x564747298443] 10: (MDLog::ReplayThread::entry()+0x11) [0x564746f54e31] 11: /lib64/libpthread.so.0(+0x81ca) [0x7f6bef78e1ca] 12: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. If you need more, just let me know, please. On 23.02.23 01:34, Xiubo Li wrote:

...

On 23/02/2023 05:56, Thomas Widhalm wrote:

Ah, sorry. My bad. The MDS crashed and I restarted them. And I'm waiting for them to crash again. There's a tracker for this or a related issue: https://tracker.ceph.com/issues/58489

Is the call trace the same with this tracker ? Thanks,

On Wed, Feb 22, 2023 at 12:10 PM Thomas Widhalm <widhalmt(a)widhalm.or.at> wrote:

Hi, Thanks for the idea! I tried it immediately but still, MDS are in up:replay mode. So far they haven't crashed but this usually takes a few minutes. So no effect so far. :-(

The commands I gave were for producing hopefully useful debug logs. Not intended to fix the problem for you.

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Patrick Donnelly

12:03 p.m.

Please use: https://docs.ceph.com/en/quincy/man/8/ceph-post-file/ to share debug logs from the MDS. On Wed, Feb 22, 2023 at 4:56 PM Thomas Widhalm <widhalmt(a)widhalm.or.at> wrote:

...

Ah, sorry. My bad. The MDS crashed and I restarted them. And I'm waiting for them to crash again. There's a tracker for this or a related issue: https://tracker.ceph.com/issues/58489 Is there any place I can upload you anything from the logs? I'm still a bit new to Ceph but I guess, you'd like to have the crash logs? Thank you in advance. Any help is really appreciated. My filesystems are still completely down. Cheers, Thomas On 22.02.23 18:36, Patrick Donnelly wrote: > On Wed, Feb 22, 2023 at 12:10 PM Thomas Widhalm <widhalmt(a)widhalm.or.at> wrote: >> >> Hi, >> >> Thanks for the idea! >> >> I tried it immediately but still, MDS are in up:replay mode. So far they >> haven't crashed but this usually takes a few minutes. >> >> So no effect so far. :-( > > The commands I gave were for producing hopefully useful debug logs. > Not intended to fix the problem for you. >

-- Patrick Donnelly, Ph.D. He / Him / His Red Hat Partner Engineer IBM, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

428

days inactive

468

days old

ceph-users@ceph.io

Manage subscription

23 comments

7 participants

tags (0)

participants (7)

adjurdjevic＠gmail.com
Kotresh Hiremath Ravishankar
Patrick Donnelly
Thomas Widhalm
Thomas Widhalm
Venky Shankar
Xiubo Li