Thank you. I'm setting the debug level and await authorization for Tracker.
I'll upload the logs as soon as I can collect them.
Thank you so much for your help
On 18.01.23 12:26, Kotresh Hiremath Ravishankar wrote:
Hi Thomas,
This looks like it requires more investigation than I expected. What's
the current status ?
Did the crashed mds come back and become active ?
Increase the debug log level to 20 and share the mds logs. I will create
a tracker and share it here.
You can upload the mds logs there.
Thanks,
Kotresh H R
On Tue, Jan 17, 2023 at 5:34 PM Thomas Widhalm
<thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>> wrote:
Another new thing that just happened:
One of the MDS just crashed out of nowhere.
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc:
In function 'void EMetaBlob::replay(MDSRank*, LogSegment*,
MDPeerUpdate*)' thread 7fccc7153700 time 2023-01-17T10:05:15.420191+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc:
1625: FAILED ceph_assert(g_conf()->mds_wipe_sessions)
ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy
(stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x135) [0x7fccd759943f]
2: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605]
3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c)
[0x55fb2b98e89c]
4: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0]
5: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443]
6: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31]
7: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca]
8: clone()
and
*** Caught signal (Aborted) **
in thread 7fccc7153700 thread_name:md_log_replay
ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy
(stable)
1: /lib64/libpthread.so.0(+0x12cf0) [0x7fccd6593cf0]
2: gsignal()
3: abort()
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x18f) [0x7fccd7599499]
5: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605]
6: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c)
[0x55fb2b98e89c]
7: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0]
8: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443]
9: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31]
10: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca]
11: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
Is what I found in the logs. Since it's referring to log replaying,
could this be related to my issue?
On 17.01.23 10:54, Thomas Widhalm wrote:
Hi again,
Another thing I found: Out of pure desperation, I started MDS on all
nodes. I had them configured in the past so I was hoping, they could
help with bringing in missing data even when they were down for
quite a
while now. I didn't see any changes in the
logs but the CPU on
the hosts
that usually don't run MDS just spiked. So
high I had to kill the MDS
again because otherwise they kept killing OSD containers. So I don't
really have any new information, but maybe that could be a hint
of some
kind?
Cheers,
Thomas
On 17.01.23 10:13, Thomas Widhalm wrote:
> Hi,
>
> Thanks again. :-)
>
> Ok, that seems like an error to me. I never configured an extra
rank for
> MDS. Maybe that's where my knowledge
failed me but I guess, MDS is
> waiting for something that was never there.
>
> Yes, there are two filesystems. Due to "budget restrictions"
(it's my
> personal system at home, I configured a
second CephFS with only one
> replica for data that could be easily restored.
>
> Here's what I got when turning up the debug level:
>
> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread
waiting
> interval 1.000000000s
> Jan 17 10:08:17 ceph05 ceph-mds[1209]:
mds.beacon.mds01.ceph05.pqxmvt
> Sending beacon up:replay seq 11107
> Jan 17 10:08:17 ceph05 ceph-mds[1209]:
mds.beacon.mds01.ceph05.pqxmvt
> sender thread waiting interval 4s
> Jan 17 10:08:17 ceph05 ceph-mds[1209]:
mds.beacon.mds01.ceph05.pqxmvt
> received beacon reply up:replay seq 11107 rtt
0.00200002
> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status
> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167
> schedule_update_timer_task
> Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache Memory
usage: total
> 372640, rss 57628, heap 207124, baseline
182548, 0 / 3 inodes
have caps,
> 0 caps, 0 caps per inode
> Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache cache not
ready for
> trimming
> Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread
waiting
> interval 1.000000000s
> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache Memory
usage: total
> 372640, rss 57628, heap 207124, baseline
182548, 0 / 3 inodes
have caps,
> 0 caps, 0 caps per inode
> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache cache not
ready for
> trimming
> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread
waiting
> interval 1.000000000s
> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status
> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.158167
> schedule_update_timer_task
> Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache Memory
usage: total
> 372640, rss 57628, heap 207124, baseline
182548, 0 / 3 inodes
have caps,
> 0 caps, 0 caps per inode
> Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache cache not
ready for
> trimming
> Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread
waiting
> interval 1.000000000s
> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache Memory
usage: total
> 372640, rss 57628, heap 207124, baseline
182548, 0 / 3 inodes
have caps,
> 0 caps, 0 caps per inode
> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache cache not
ready for
> trimming
> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread
waiting
> interval 1.000000000s
> Jan 17 10:08:21 ceph05 ceph-mds[1209]:
mds.beacon.mds01.ceph05.pqxmvt
> Sending beacon up:replay seq 11108
> Jan 17 10:08:21 ceph05 ceph-mds[1209]:
mds.beacon.mds01.ceph05.pqxmvt
> sender thread waiting interval 4s
> Jan 17 10:08:21 ceph05 ceph-mds[1209]:
mds.beacon.mds01.ceph05.pqxmvt
> received beacon reply up:replay seq 11108 rtt
0.00200002
> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status
> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.158167
> schedule_update_timer_task
> Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache Memory
usage: total
> 372640, rss 57628, heap 207124, baseline
182548, 0 / 3 inodes
have caps,
> 0 caps, 0 caps per inode
> Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache cache not
ready for
> trimming
> Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread
waiting
> interval 1.000000000s
> Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache Memory
usage: total
> 372640, rss 57628, heap 207124, baseline
182548, 0 / 3 inodes
have caps,
> 0 caps, 0 caps per inode
> Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache cache not
ready for
> trimming
> Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread
waiting
> interval 1.000000000s
> Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status
> Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.158167
> schedule_update_timer_task
> Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache Memory
usage: total
> 372640, rss 57628, heap 207124, baseline
182548, 0 / 3 inodes
have caps,
> 0 caps, 0 caps per inode
> Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache cache not
ready for
> trimming
> Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread
waiting
> interval 1.000000000s
> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache Memory
usage: total
> 372640, rss 57628, heap 207124, baseline
182548, 0 / 3 inodes
have caps,
> 0 caps, 0 caps per inode
> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache cache not
ready for
> trimming
> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread
waiting
> interval 1.000000000s
> Jan 17 10:08:25 ceph05 ceph-mds[1209]:
mds.beacon.mds01.ceph05.pqxmvt
> Sending beacon up:replay seq 11109
> Jan 17 10:08:25 ceph05 ceph-mds[1209]:
mds.beacon.mds01.ceph05.pqxmvt
> sender thread waiting interval 4s
> Jan 17 10:08:25 ceph05 ceph-mds[1209]:
mds.beacon.mds01.ceph05.pqxmvt
> received beacon reply up:replay seq 11109 rtt
0.00600006
> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status
> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.158167
> schedule_update_timer_task
> Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache Memory
usage: total
> 372640, rss 57344, heap 207124, baseline
182548, 0 / 3 inodes
have caps,
> 0 caps, 0 caps per inode
> Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache cache not
ready for
> trimming
> Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache releasing
free memory
> Jan 17 10:08:26 ceph05 ceph-mds[1209]:
mds.0.cache upkeep thread
waiting
> interval 1.000000000s
> Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache Memory
usage: total
> 372640, rss 57272, heap 207124, baseline
182548, 0 / 3 inodes
have caps,
> 0 caps, 0 caps per inode
> Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache cache not
ready for
> trimming
> Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread
waiting
> interval 1.000000000s
> Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status
> Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.158167
> schedule_update_timer_task
> Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache Memory
usage: total
> 372640, rss 57040, heap 207124, baseline
182548, 0 / 3 inodes
have caps,
> 0 caps, 0 caps per inode
> Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache cache not
ready for
> trimming
> Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread
waiting
> interval 1.000000000s
>
>
> The only thing that gives me hope here is that the line
> mds.beacon.mds01.ceph05.pqxmvt Sending beacon up:replay seq 11109 is
> chaning its sequence number.
>
> Anything else I can provide?
>
> Cheers,
> Thomas
>
> On 17.01.23 06:27, Kotresh Hiremath Ravishankar wrote:
>> Hi Thomas,
>>
>> Sorry, I misread the mds state to be stuck in 'up:resolve'
state. The
>> mds is stuck in 'up:replay' which
means the MDS taking over a
failed
>> rank.
>> This state represents that the MDS is recovering its journal
and
other
>> metadata.
>>
>> I notice that there are two filesystems 'cephfs' and
'cephfs_insecure'
>> and the active mds for both filesystems
are stuck in
'up:replay'. The
>> mds
>> logs shared are not providing much information to infer anything.
>>
>> Could you please enable the debug logs and pass on the mds logs ?
>>
>> Thanks,
>> Kotresh H R
>>
>> On Mon, Jan 16, 2023 at 2:38 PM Thomas Widhalm
>> <thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>
<mailto:thomas.widhalm@netways.de
<mailto:thomas.widhalm@netways.de>>> wrote:
>>
>> Hi Kotresh,
>>
>> Thanks for your reply!
>>
>> I only have one rank. Here's the output of all MDS I have:
>>
>> ###################
>>
>> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph05.pqxmvt status
>> 2023-01-16T08:55:26.055+0000 7f3412ffd700 0 client.61249926
>> ms_handle_reset on v2:192.168.23.65:6800/2680651694
<http://192.168.23.65:6800/2680651694>
>>
<http://192.168.23.65:6800/2680651694
<http://192.168.23.65:6800/2680651694>>
>> 2023-01-16T08:55:26.084+0000
7f3412ffd700 0 client.61299199
>> ms_handle_reset on v2:192.168.23.65:6800/2680651694
<http://192.168.23.65:6800/2680651694>
>>
<http://192.168.23.65:6800/2680651694
<http://192.168.23.65:6800/2680651694>>
>> {
>> "cluster_fsid":
"ff6e50de-ed72-11ec-881c-dca6325c2cc4",
>> "whoami": 0,
>> "id": 60984167,
>> "want_state": "up:replay",
>> "state": "up:replay",
>> "fs_name": "cephfs",
>> "replay_status": {
>> "journal_read_pos": 0,
>> "journal_write_pos": 0,
>> "journal_expire_pos": 0,
>> "num_events": 0,
>> "num_segments": 0
>> },
>> "rank_uptime": 150224.982558844,
>> "mdsmap_epoch": 143757,
>> "osdmap_epoch": 12395,
>> "osdmap_epoch_barrier": 0,
>> "uptime": 150225.39968057699
>> }
>>
>> ########################
>>
>> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph04.cvdhsx status
>> 2023-01-16T08:59:05.434+0000 7fdb82ff5700 0 client.61299598
>> ms_handle_reset on v2:192.168.23.64:6800/3930607515
<http://192.168.23.64:6800/3930607515>
>>
<http://192.168.23.64:6800/3930607515
<http://192.168.23.64:6800/3930607515>>
>> 2023-01-16T08:59:05.466+0000
7fdb82ff5700 0 client.61299604
>> ms_handle_reset on v2:192.168.23.64:6800/3930607515
<http://192.168.23.64:6800/3930607515>
>>
<http://192.168.23.64:6800/3930607515
<http://192.168.23.64:6800/3930607515>>
>> {
>> "cluster_fsid":
"ff6e50de-ed72-11ec-881c-dca6325c2cc4",
>> "whoami": 0,
>> "id": 60984134,
>> "want_state": "up:replay",
>> "state": "up:replay",
>> "fs_name": "cephfs_insecure",
>> "replay_status": {
>> "journal_read_pos": 0,
>> "journal_write_pos": 0,
>> "journal_expire_pos": 0,
>> "num_events": 0,
>> "num_segments": 0
>> },
>> "rank_uptime": 150450.96934037199,
>> "mdsmap_epoch": 143815,
>> "osdmap_epoch": 12395,
>> "osdmap_epoch_barrier": 0,
>> "uptime": 150451.93533502301
>> }
>>
>> ###########################
>>
>> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph06.wcfdom status
>> 2023-01-16T08:59:28.572+0000 7f16538c0b80 -1 client.61250376
>> resolve_mds: no MDS daemons found by name `mds01.ceph06.wcfdom'
>> 2023-01-16T08:59:28.583+0000 7f16538c0b80 -1
client.61250376
FSMap:
>> cephfs:1/1 cephfs_insecure:1/1
>>
>>
{cephfs:0=mds01.ceph05.pqxmvt=up:replay,cephfs_insecure:0=mds01.ceph04.cvdhsx=up:replay}
>> 2 up:standby
>> Error ENOENT: problem getting command descriptions from
>> mds.mds01.ceph06.wcfdom
>>
>> ############################
>>
>> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph07.omdisd status
>> 2023-01-16T09:00:02.802+0000 7fb7affff700 0 client.61250454
>> ms_handle_reset on v2:192.168.23.67:6800/942898192
<http://192.168.23.67:6800/942898192>
>>
<http://192.168.23.67:6800/942898192
<http://192.168.23.67:6800/942898192>>
>> 2023-01-16T09:00:02.831+0000
7fb7affff700 0 client.61299751
>> ms_handle_reset on v2:192.168.23.67:6800/942898192
<http://192.168.23.67:6800/942898192>
>>
<http://192.168.23.67:6800/942898192
<http://192.168.23.67:6800/942898192>>
>> {
>> "cluster_fsid":
"ff6e50de-ed72-11ec-881c-dca6325c2cc4",
>> "whoami": -1,
>> "id": 60984161,
>> "want_state": "up:standby",
>> "state": "up:standby",
>> "mdsmap_epoch": 97687,
>> "osdmap_epoch": 0,
>> "osdmap_epoch_barrier": 0,
>> "uptime": 150508.29091721401
>> }
>>
>> The error message from ceph06 is new to me. That didn't
happen
the
>> last
>> times.
>>
>> [ceph: root@ceph06 /]# ceph fs dump
>> e143850
>> enable_multiple, ever_enabled_multiple: 1,1
>> default compat: compat={},rocompat={},incompat={1=base
>> v0.20,2=client
>> writeable ranges,3=default file layouts on dirs,4=dir inode in
>> separate
>> object,5=mds uses versioned encoding,6=dirfrag is stored in
>> omap,8=no
>> anchor table,9=file layout v2,10=snaprealm v2}
>> legacy client fscid: 2
>>
>> Filesystem 'cephfs' (2)
>> fs_name cephfs
>> epoch 143850
>> flags 12 joinable allow_snaps allow_multimds_snaps
>> created 2023-01-14T14:30:05.723421+0000
>> modified 2023-01-16T09:00:53.663007+0000
>> tableserver 0
>> root 0
>> session_timeout 60
>> session_autoclose 300
>> max_file_size 1099511627776
>> required_client_features {}
>> last_failure 0
>> last_failure_osd_epoch 12321
>> compat compat={},rocompat={},incompat={1=base v0.20,2=client
>> writeable
>> ranges,3=default file layouts on dirs,4=dir inode in separate
>> object,5=mds uses versioned encoding,6=dirfrag is stored in
>> omap,7=mds
>> uses inline data,8=no anchor table,9=file layout
v2,10=snaprealm
v2}
>> max_mds 1
>> in 0
>> up {0=60984167}
>> failed
>> damaged
>> stopped
>> data_pools [4]
>> metadata_pool 5
>> inline_data disabled
>> balancer
>> standby_count_wanted 1
>> [mds.mds01.ceph05.pqxmvt{0:60984167} state up:replay seq
37637
addr
>>
[v2:192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694
<http://192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694>
>>
>>
<http://192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694
<http://192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694>>]
>>> compat {c=[1],r=[1],i=[7ff]}]
>>
>>
>>> Filesystem 'cephfs_insecure'
(3)
>>> fs_name cephfs_insecure
>>> epoch 143849
>>> flags 12 joinable allow_snaps allow_multimds_snaps
>>> created 2023-01-14T14:22:46.360062+0000
>>> modified 2023-01-16T09:00:52.632163+0000
>>> tableserver 0
>>> root 0
>>> session_timeout 60
>>> session_autoclose 300
>>> max_file_size 1099511627776
>>> required_client_features {}
>>> last_failure 0
>>> last_failure_osd_epoch 12319
>>> compat compat={},rocompat={},incompat={1=base v0.20,2=client
>>> writeable
>>> ranges,3=default file layouts on dirs,4=dir inode in separate
>>> object,5=mds uses versioned encoding,6=dirfrag is stored in
>>> omap,7=mds
>>> uses inline data,8=no anchor table,9=file layout
v2,10=snaprealm v2}
>> max_mds 1
>> in 0
>> up {0=60984134}
>> failed
>> damaged
>> stopped
>> data_pools [7]
>> metadata_pool 6
>> inline_data disabled
>> balancer
>> standby_count_wanted 1
>> [mds.mds01.ceph04.cvdhsx{0:60984134} state up:replay seq
37639
addr
>>
[v2:192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515
<http://192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515>
>>
>>
<http://192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515
<http://192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515>>]
>>> compat {c=[1],r=[1],i=[7ff]}]
>>
>>
>>> Standby daemons:
>>
>>>
[mds.mds01.ceph07.omdisd{-1:60984161} state up:standby seq
2 addr
>>
[v2:192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192
<http://192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192>
>>
>>
<http://192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192
<http://192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192>>]
>> compat
>> {c=[1],r=[1],i=[7ff]}]
>> [mds.mds01.ceph06.hsuhqd{-1:60984828} state up:standby seq
1 addr
>>
[v2:192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518
<http://192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518>
>>
>>
<http://192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518
<http://192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518>>]
>>> compat {c=[1],r=[1],i=[7ff]}]
>>> dumped fsmap epoch 143850
>>
>>>
#############################
>>
>>> [ceph:
root@ceph06 /]# ceph fs status
>>
>>> (doesn't
come back)
>>
>>>
#############################
>>
>>> All MDS show
log lines similar to this one:
>>
>>> Jan 16
10:05:00 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
>>> Updating
>>> MDS map to version 143927 from mon.1
>>> Jan 16 10:05:05 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
>>> Updating
>>> MDS map to version 143929 from mon.1
>>> Jan 16 10:05:09 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
>>> Updating
>>> MDS map to version 143930 from mon.1
>>> Jan 16 10:05:13 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
>>> Updating
>>> MDS map to version 143931 from mon.1
>>> Jan 16 10:05:20 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
>>> Updating
>>> MDS map to version 143933 from mon.1
>>> Jan 16 10:05:24 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
>>> Updating
>>> MDS map to version 143935 from mon.1
>>> Jan 16 10:05:29 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
>>> Updating
>>> MDS map to version 143936 from mon.1
>>> Jan 16 10:05:33 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
>>> Updating
>>> MDS map to version 143937 from mon.1
>>> Jan 16 10:05:40 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
>>> Updating
>>> MDS map to version 143939 from mon.1
>>> Jan 16 10:05:44 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
>>> Updating
>>> MDS map to version 143941 from mon.1
>>> Jan 16 10:05:49 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
>>> Updating
>>> MDS map to version 143942 from mon.1
>>
>>> Anything
else, I can provide?
>>
>>> Cheers and
thanks again!
>>> Thomas
>>
>>> On 16.01.23
06:01, Kotresh Hiremath Ravishankar wrote:
>>> > Hi Thomas,
>>> >
>>> > As the documentation says, the MDS enters up:resolve from
>>> |up:replay| if
>>> > the Ceph file system has multiple ranks (including this
one),
>> i.e. it’s
>> > not a single active MDS cluster.
>> > The MDS is resolving any uncommitted inter-MDS
operations.
All
>> ranks in
>> > the file system must be in this state or later for
progress
to be
>> made,
>> > i.e. no rank can be failed/damaged or |up:replay|.
>> >
>> > So please check the status of the other active mds if it's
>> failed.
>> >
>> > Also please share the mds logs and the output of 'ceph
fs dump'
>> and
>> > 'ceph fs status'
>> >
>> > Thanks,
>> > Kotresh H R
>> >
>> > On Sat, Jan 14, 2023 at 9:07 PM Thomas Widhalm
>> > <thomas.widhalm(a)netways.de
<mailto:thomas.widhalm@netways.de> <mailto:thomas.widhalm@netways.de
<mailto:thomas.widhalm@netways.de>>
>> <mailto:thomas.widhalm@netways.de
<mailto:thomas.widhalm@netways.de>
>> <mailto:thomas.widhalm@netways.de
<mailto:thomas.widhalm@netways.de>>>> wrote:
>> >
>> > Hi,
>> >
>> > I'm really lost with my Ceph system. I built a small
cluster
>> for home
>> > usage which has two uses for me: I want to replace
an
old NAS
>> and I want
>> > to learn about Ceph so that I have hands-on
experience.
We're
>> using it
>> > in our company but I need some real-life experience
without
>> risking any
>> > company or customers data. That's my preferred way of
>> learning.
>> >
>> > The cluster consists of 3 Raspberry Pis plus a few VMs
>> running on
>> > Proxmox. I'm not using Proxmox' built in Ceph
because I want
>> to focus on
>> > Ceph and not just use it as a preconfigured tool.
>> >
>> > All hosts are running Fedora (x86_64 and arm64) and
during an
>> Upgrade
>> > from F36 to F37 my cluster suddenly showed all PGs as
>> unavailable. I
>> > worked nearly a week to get it back online and I
learned
a
>> lot about
>> > Ceph management and recovery. The cluster is back
but I
still
>> can't
>> > access my data. Maybe you can help me?
>> >
>> > Here are my versions:
>> >
>> > [ceph: root@ceph04 /]# ceph versions
>> > {
>> > "mon": {
>> > "ceph version 17.2.5
>> > (98318ae89f1a893a6ded3a640405cdbb33e08757)
>> > quincy (stable)": 3
>> > },
>> > "mgr": {
>> > "ceph version 17.2.5
>> > (98318ae89f1a893a6ded3a640405cdbb33e08757)
>> > quincy (stable)": 3
>> > },
>> > "osd": {
>> > "ceph version 17.2.5
>> > (98318ae89f1a893a6ded3a640405cdbb33e08757)
>> > quincy (stable)": 5
>> > },
>> > "mds": {
>> > "ceph version 17.2.5
>> > (98318ae89f1a893a6ded3a640405cdbb33e08757)
>> > quincy (stable)": 4
>> > },
>> > "overall": {
>> > "ceph version 17.2.5
>> > (98318ae89f1a893a6ded3a640405cdbb33e08757)
>> > quincy (stable)": 15
>> > }
>> > }
>> >
>> >
>> > Here's MDS status output of one MDS:
>> > [ceph: root@ceph04 /]# ceph tell mds.mds01.ceph05.pqxmvt
>> status
>> > 2023-01-14T15:30:28.607+0000 7fb9e17fa700 0
client.60986454
>> > ms_handle_reset on
v2:192.168.23.65:6800/2680651694
<http://192.168.23.65:6800/2680651694>
>>
<http://192.168.23.65:6800/2680651694
<http://192.168.23.65:6800/2680651694>>
>> >
<http://192.168.23.65:6800/2680651694
<http://192.168.23.65:6800/2680651694>
>>
<http://192.168.23.65:6800/2680651694
<http://192.168.23.65:6800/2680651694>>>
>> >
2023-01-14T15:30:28.640+0000 7fb9e17fa700 0
client.60986460
>> > ms_handle_reset on
v2:192.168.23.65:6800/2680651694
<http://192.168.23.65:6800/2680651694>
>>
<http://192.168.23.65:6800/2680651694
<http://192.168.23.65:6800/2680651694>>
>> >
<http://192.168.23.65:6800/2680651694
<http://192.168.23.65:6800/2680651694>
>>
<http://192.168.23.65:6800/2680651694
<http://192.168.23.65:6800/2680651694>>>
>> > {
>> > "cluster_fsid":
"ff6e50de-ed72-11ec-881c-dca6325c2cc4",
>> > "whoami":
0,
>> > "id": 60984167,
>> > "want_state": "up:replay",
>> > "state": "up:replay",
>> > "fs_name": "cephfs",
>> > "replay_status": {
>> > "journal_read_pos": 0,
>> > "journal_write_pos": 0,
>> > "journal_expire_pos": 0,
>> > "num_events": 0,
>> > "num_segments": 0
>> > },
>> > "rank_uptime": 1127.54018615,
>> > "mdsmap_epoch": 98056,
>> > "osdmap_epoch": 12362,
>> > "osdmap_epoch_barrier": 0,
>> > "uptime": 1127.957307273
>> > }
>> >
>> > It's staying like that for days now. If there was a
counter
>> moving, I
>> > just would wait but it doesn't change anything and
alle stats
>> says, the
>> > MDS aren't working at all.
>> >
>> > The symptom I have is that Dashboard and all other
tools
I
>> use say, it's
>> > more or less ok. (Some old messages about failed
daemons
and
>> scrubbing
>> > aside). But I can't mount anything. When I try to
start a VM
>> that's on
>> > RDS I just get a timeout. And when I try to mount a
CephFS,
>> mount just
>> > hangs forever.
>> >
>> > Whatever command I give MDS or journal, it just
hangs.
The
>> only thing I
>> > could do, was take all CephFS offline, kill the
MDS's and do
>> a "ceph fs
>> > reset <fs name> --yes-i-really-mean-it". After that I
>> rebooted all
>> > nodes, just to be sure but I still have no access to
data.
>> >
>> > Could you please help me? I'm kinda desperate. If
you need
>> any more
>> > information, just let me know.
>> >
>> > Cheers,
>> > Thomas
>> >
>> > --
>> > Thomas Widhalm
>> > Lead Systems Engineer
>> >
>> > NETWAYS Professional Services GmbH |
Deutschherrnstr.
15-19 |
>> > D-90429 Nuernberg
>> > Tel: +49 911 92885-0 | Fax: +49 911 92885-77
>> > CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510
>> >
https://www.netways.de <https://www.netways.de>
<https://www.netways.de <https://www.netways.de>>
>> <https://www.netways.de
<https://www.netways.de>
<https://www.netways.de
<https://www.netways.de>>> |
>> > thomas.widhalm(a)netways.de
<mailto:thomas.widhalm@netways.de> <mailto:thomas.widhalm@netways.de
<mailto:thomas.widhalm@netways.de>>
>> <mailto:thomas.widhalm@netways.de
<mailto:thomas.widhalm@netways.de>
>> <mailto:thomas.widhalm@netways.de
<mailto:thomas.widhalm@netways.de>>>
>> >
>> > ** stackconf 2023 - September -
https://stackconf.eu
<https://stackconf.eu>
>> <https://stackconf.eu
<https://stackconf.eu>>
>> > <https://stackconf.eu <https://stackconf.eu>
<https://stackconf.eu <https://stackconf.eu>>> **
>> > ** OSMC 2023 - November -
https://osmc.de <https://osmc.de> <https://osmc.de
<https://osmc.de>>
>> <https://osmc.de
<https://osmc.de> <https://osmc.de
<https://osmc.de>>> **
>> > ** New at NWS: Managed
Database -
>> >
https://nws.netways.de/managed-database
<https://nws.netways.de/managed-database>
>>
<https://nws.netways.de/managed-database
<https://nws.netways.de/managed-database>>
>> >
<https://nws.netways.de/managed-database
<https://nws.netways.de/managed-database>
>>
<https://nws.netways.de/managed-database
<https://nws.netways.de/managed-database>>> **
>> > ** NETWAYS Web Services -
https://nws.netways.de <https://nws.netways.de>
>> <https://nws.netways.de
<https://nws.netways.de>>
>> > <https://nws.netways.de <https://nws.netways.de>
<https://nws.netways.de <https://nws.netways.de>>> **
>> >
_______________________________________________
>> > ceph-users mailing list -- ceph-users(a)ceph.io
<mailto:ceph-users@ceph.io>
>> <mailto:ceph-users@ceph.io
<mailto:ceph-users@ceph.io>>
>> > <mailto:ceph-users@ceph.io
<mailto:ceph-users@ceph.io> <mailto:ceph-users@ceph.io
<mailto:ceph-users@ceph.io>>>
>> > To unsubscribe send an
email to
ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>
>> <mailto:ceph-users-leave@ceph.io
<mailto:ceph-users-leave@ceph.io>>
>> >
<mailto:ceph-users-leave@ceph.io
<mailto:ceph-users-leave@ceph.io>
>> <mailto:ceph-users-leave@ceph.io
<mailto:ceph-users-leave@ceph.io>>>
>>> >
>>
>>> --
>>> Thomas Widhalm
>>> Lead Systems Engineer
>>
>>> NETWAYS
Professional Services GmbH | Deutschherrnstr. 15-19 |
>>> D-90429 Nuernberg
>>> Tel: +49 911 92885-0 | Fax: +49 911 92885-77
>>> CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510
>>>
https://www.netways.de <https://www.netways.de>
<https://www.netways.de <https://www.netways.de>> |
>> thomas.widhalm(a)netways.de
<mailto:thomas.widhalm@netways.de>
<mailto:thomas.widhalm@netways.de
<mailto:thomas.widhalm@netways.de>>
>>
>>> ** stackconf
2023 - September -
https://stackconf.eu
<https://stackconf.eu>
>> <https://stackconf.eu
<https://stackconf.eu>> **
>> ** OSMC 2023 - November -
https://osmc.de <https://osmc.de>
<https://osmc.de <https://osmc.de>> **
>> ** New at NWS: Managed Database -
>>
https://nws.netways.de/managed-database
<https://nws.netways.de/managed-database>
>>
<https://nws.netways.de/managed-database
<https://nws.netways.de/managed-database>> **
>> ** NETWAYS Web Services -
https://nws.netways.de <https://nws.netways.de>
>>> <https://nws.netways.de <https://nws.netways.de>> **
>>
>>
>> --
>> Thomas Widhalm
>> Lead Systems Engineer
>>
>> NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 |
D-90429
> Nuernberg
> Tel: +49 911 92885-0 | Fax: +49 911 92885-77
> CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510
>
https://www.netways.de <https://www.netways.de> |
thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>
>
> ** stackconf 2023 - September -
https://stackconf.eu
<https://stackconf.eu> **
> ** OSMC 2023 - November -
https://osmc.de
<https://osmc.de> **
> ** New at NWS: Managed Database -
>
https://nws.netways.de/managed-database
<https://nws.netways.de/managed-database> **
> ** NETWAYS Web Services -
https://nws.netways.de <https://nws.netways.de> **
>
_______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
<mailto:ceph-users@ceph.io>
> To unsubscribe send an email to
ceph-users-leave(a)ceph.io
<mailto:ceph-users-leave@ceph.io>
--
Thomas Widhalm
Lead Systems Engineer
NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429
Nuernberg
Tel: +49 911 92885-0 | Fax: +49 911 92885-77
CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510
https://www.netways.de <https://www.netways.de> |
thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>
** stackconf 2023 - September -
https://stackconf.eu
<https://stackconf.eu> **
** OSMC 2023 - November -
https://osmc.de
<https://osmc.de> **
** New at NWS: Managed Database -
https://nws.netways.de/managed-database
<https://nws.netways.de/managed-database> **
** NETWAYS Web Services -
https://nws.netways.de
<https://nws.netways.de> **
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
<mailto:ceph-users@ceph.io>
To unsubscribe send an email to
ceph-users-leave(a)ceph.io
<mailto:ceph-users-leave@ceph.io>
--
Thomas Widhalm
Lead Systems Engineer
NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 |
D-90429 Nuernberg
Tel: +49 911 92885-0 | Fax: +49 911 92885-77
CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510
https://www.netways.de <https://www.netways.de> |
thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>
** stackconf 2023 - September -
https://stackconf.eu
<https://stackconf.eu> **
** OSMC 2023 - November -
https://osmc.de <https://osmc.de> **
** New at NWS: Managed Database -
https://nws.netways.de/managed-database
<https://nws.netways.de/managed-database> **
** NETWAYS Web Services -
https://nws.netways.de
<https://nws.netways.de> **
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
<mailto:ceph-users@ceph.io>
To unsubscribe send an email to ceph-users-leave(a)ceph.io
<mailto:ceph-users-leave@ceph.io>
--
Thomas Widhalm
Lead Systems Engineer
NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg
Tel: +49 911 92885-0 | Fax: +49 911 92885-77
CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510