[ceph-users] Re: MDS stuck in "up:replay"

19 Jan 2023

Hi,

Unfortunately the workaround didn't work out:

[ceph: root@ceph05 /]# ceph config show mds.mds01.ceph06.hsuhqd | grep
mds_wipe
mds_wipe_sessions     true

                             mon
[ceph: root@ceph05 /]# ceph config show mds.mds01.ceph04.cvdhsx | grep
mds_wipe
mds_wipe_sessions     true

                             mon
[ceph: root@ceph05 /]# ceph config show mds.mds01.ceph05.pqxmvt | grep
mds_wipe
mds_wipe_sessions     true

                             mon
[ceph: root@ceph05 /]# ceph tell mds.mds01.ceph05.pqxmvt flush journal
2023-01-19T13:38:07.403+0000 7ff94e7fc700  0 client.61855055
ms_handle_reset on v2:192.168.23.65:6800/957802673
2023-01-19T13:38:07.427+0000 7ff94e7fc700  0 client.61855061
ms_handle_reset on v2:192.168.23.65:6800/957802673
Error ENOSYS:
[ceph: root@ceph05 /]# ceph tell mds.mds01.ceph06.hsuhqd flush journal
2023-01-19T13:38:34.694+0000 7f789effd700  0 client.61855142
ms_handle_reset on v2:192.168.23.66:6810/2868317045
2023-01-19T13:38:34.728+0000 7f789effd700  0 client.61855148
ms_handle_reset on v2:192.168.23.66:6810/2868317045
{
     "message": "",
     "return_code": 0
}
[ceph: root@ceph05 /]# ceph tell mds.mds01.ceph04.cvdhsx flush journal
2023-01-19T13:38:46.402+0000 7fdee77fe700  0 client.61855172
ms_handle_reset on v2:192.168.23.64:6800/1605877585
2023-01-19T13:38:46.435+0000 7fdee77fe700  0 client.61855178
ms_handle_reset on v2:192.168.23.64:6800/1605877585
{
     "message": "",
     "return_code": 0
}
[ceph: root@ceph05 /]# ceph fs dump
e198622
enable_multiple, ever_enabled_multiple: 1,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client
writeable ranges,3=default file layouts on dirs,4=dir inode in separate
object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 2

Filesystem 'cephfs' (2)
fs_name cephfs
epoch   198622
flags   12 joinable allow_snaps allow_multimds_snaps
created 2023-01-14T14:30:05.723421+0000
modified        2023-01-19T13:39:25.239395+0000
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
required_client_features        {}
last_failure    0
last_failure_osd_epoch  13541
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate
object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds
uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in      0
up      {0=61834171}
failed
damaged
stopped
data_pools      [4]
metadata_pool   5
inline_data     disabled
balancer
standby_count_wanted    1
[mds.mds01.ceph04.cvdhsx{0:61834171} state up:replay seq 240 addr
[v2:192.168.23.64:6800/1605877585,v1:192.168.23.64:6801/1605877585]
compat {c=[1],r=[1],i=[7ff]}]

Filesystem 'cephfs_insecure' (3)
fs_name cephfs_insecure
epoch   198621
flags   12 joinable allow_snaps allow_multimds_snaps
created 2023-01-14T14:22:46.360062+0000
modified        2023-01-19T13:39:22.799446+0000
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
required_client_features        {}
last_failure    0
last_failure_osd_epoch  13539
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate
object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds
uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in      0
up      {0=61834120}
failed
damaged
stopped
data_pools      [7]
metadata_pool   6
inline_data     disabled
balancer
standby_count_wanted    1
[mds.mds01.ceph06.hsuhqd{0:61834120} state up:replay seq 241 addr
[v2:192.168.23.66:6810/2868317045,v1:192.168.23.66:6811/2868317045]
compat {c=[1],r=[1],i=[7ff]}]

Standby daemons:

[mds.mds01.ceph05.pqxmvt{-1:61834887} state up:standby seq 1 addr
[v2:192.168.23.65:6800/957802673,v1:192.168.23.65:6801/957802673] compat
{c=[1],r=[1],i=[7ff]}]
dumped fsmap epoch 198622

On 19.01.23 14:01, Venky Shankar wrote:
...
  Hi Thomas,

 On Tue, Jan 17, 2023 at 5:34 PM Thomas Widhalm
 &lt;thomas.widhalm(a)netways.de&gt; wrote:

 Another new thing that just happened:

 One of the MDS just crashed out of nowhere.

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc:
 In function 'void EMetaBlob::replay(MDSRank*, LogSegment*,
 MDPeerUpdate*)' thread 7fccc7153700 time 2023-01-17T10:05:15.420191+0000

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc:
 1625: FAILED ceph_assert(g_conf()->mds_wipe_sessions)

    ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy
 (stable)
    1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
 const*)+0x135) [0x7fccd759943f]
    2: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605]
    3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c)
 [0x55fb2b98e89c]
    4: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0]
    5: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443]
    6: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31]
    7: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca]
    8: clone() 
 To workaround this (for now) till the bug is fixed, set

           mds_wipe_sessions = true

 in ceph.conf, allow the MDS to transition to `active` state. Once
 done, flush the journal:

          ceph tell mds.<> flush journal

 then you can safely remove the config.

 and

 *** Caught signal (Aborted) **
    in thread 7fccc7153700 thread_name:md_log_replay

    ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy
 (stable)
    1: /lib64/libpthread.so.0(+0x12cf0) [0x7fccd6593cf0]
    2: gsignal()
    3: abort()
    4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
 const*)+0x18f) [0x7fccd7599499]
    5: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605]
    6: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c)
 [0x55fb2b98e89c]
    7: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0]
    8: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443]
    9: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31]
    10: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca]
    11: clone()
    NOTE: a copy of the executable, or `objdump -rdS <executable>` is
 needed to interpret this.

 Is what I found in the logs. Since it's referring to log replaying,
 could this be related to my issue?

 On 17.01.23 10:54, Thomas Widhalm wrote:
  Hi again,

 Another thing I found: Out of pure desperation, I started MDS on all
 nodes. I had them configured in the past so I was hoping, they could
 help with bringing in missing data even when they were down for quite a
 while now. I didn't see any changes in the logs but the CPU on the hosts
 that usually don't run MDS just spiked. So high I had to kill the MDS
 again because otherwise they kept killing OSD containers. So I don't
 really have any new information, but maybe that could be a hint of some
 kind?

 Cheers,
 Thomas

 On 17.01.23 10:13, Thomas Widhalm wrote:
  Hi,

 Thanks again. :-)

 Ok, that seems like an error to me. I never configured an extra rank for
 MDS. Maybe that's where my knowledge failed me but I guess, MDS is
 waiting for something that was never there.

 Yes, there are two filesystems. Due to "budget restrictions" (it's my
 personal system at home, I configured a second CephFS with only one
 replica for data that could be easily restored.

 Here's what I got when turning up the debug level:

 Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
 interval 1.000000000s
 Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt
 Sending beacon up:replay seq 11107
 Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt
 sender thread waiting interval 4s
 Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt
 received beacon reply up:replay seq 11107 rtt 0.00200002
 Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status
 Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167
 schedule_update_timer_task
 Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache Memory usage:  total
 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps,
 0 caps, 0 caps per inode
 Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for
 trimming
 Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
 interval 1.000000000s
 Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache Memory usage:  total
 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps,
 0 caps, 0 caps per inode
 Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for
 trimming
 Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
 interval 1.000000000s
 Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status
 Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.158167
 schedule_update_timer_task
 Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache Memory usage:  total
 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps,
 0 caps, 0 caps per inode
 Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for
 trimming
 Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
 interval 1.000000000s
 Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache Memory usage:  total
 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps,
 0 caps, 0 caps per inode
 Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for
 trimming
 Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
 interval 1.000000000s
 Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt
 Sending beacon up:replay seq 11108
 Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt
 sender thread waiting interval 4s
 Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt
 received beacon reply up:replay seq 11108 rtt 0.00200002
 Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status
 Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.158167
 schedule_update_timer_task
 Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache Memory usage:  total
 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps,
 0 caps, 0 caps per inode
 Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for
 trimming
 Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
 interval 1.000000000s
 Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache Memory usage:  total
 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps,
 0 caps, 0 caps per inode
 Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for
 trimming
 Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
 interval 1.000000000s
 Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status
 Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.158167
 schedule_update_timer_task
 Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache Memory usage:  total
 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps,
 0 caps, 0 caps per inode
 Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for
 trimming
 Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
 interval 1.000000000s
 Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache Memory usage:  total
 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps,
 0 caps, 0 caps per inode
 Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for
 trimming
 Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
 interval 1.000000000s
 Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt
 Sending beacon up:replay seq 11109
 Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt
 sender thread waiting interval 4s
 Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt
 received beacon reply up:replay seq 11109 rtt 0.00600006
 Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status
 Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.158167
 schedule_update_timer_task
 Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache Memory usage:  total
 372640, rss 57344, heap 207124, baseline 182548, 0 / 3 inodes have caps,
 0 caps, 0 caps per inode
 Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for
 trimming
 Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache releasing free memory
 Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
 interval 1.000000000s
 Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache Memory usage:  total
 372640, rss 57272, heap 207124, baseline 182548, 0 / 3 inodes have caps,
 0 caps, 0 caps per inode
 Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for
 trimming
 Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
 interval 1.000000000s
 Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status
 Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.158167
 schedule_update_timer_task
 Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache Memory usage:  total
 372640, rss 57040, heap 207124, baseline 182548, 0 / 3 inodes have caps,
 0 caps, 0 caps per inode
 Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for
 trimming
 Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
 interval 1.000000000s

 The only thing that gives me hope here is that the line
 mds.beacon.mds01.ceph05.pqxmvt Sending beacon up:replay seq 11109 is
 chaning its sequence number.

 Anything else I can provide?

 Cheers,
 Thomas

 On 17.01.23 06:27, Kotresh Hiremath Ravishankar wrote:
> Hi Thomas,
>
> Sorry, I misread the mds state to be stuck in 'up:resolve' state. The
> mds is stuck in 'up:replay' which means the MDS taking over a failed
> rank.
> This state represents that the MDS is recovering its journal and other
> metadata.
>
> I notice that there are two filesystems 'cephfs' and
'cephfs_insecure'
> and the active mds for both filesystems are stuck in 'up:replay'. The
> mds
> logs shared are not providing much information to infer anything.
>
> Could you please enable the debug logs and pass on the mds logs ?
>
> Thanks,
> Kotresh H R
>
> On Mon, Jan 16, 2023 at 2:38 PM Thomas Widhalm
> &lt;thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>> wrote:
>
>      Hi Kotresh,
>
>      Thanks for your reply!
>
>      I only have one rank. Here's the output of all MDS I have:
>
>      ###################
>
>      [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph05.pqxmvt status
>      2023-01-16T08:55:26.055+0000 7f3412ffd700  0 client.61249926
>      ms_handle_reset on v2:192.168.23.65:6800/2680651694
>      <http://192.168.23.65:6800/2680651694>
>      2023-01-16T08:55:26.084+0000 7f3412ffd700  0 client.61299199
>      ms_handle_reset on v2:192.168.23.65:6800/2680651694
>      <http://192.168.23.65:6800/2680651694>
>      {
>            "cluster_fsid":
"ff6e50de-ed72-11ec-881c-dca6325c2cc4",
>            "whoami": 0,
>            "id": 60984167,
>            "want_state": "up:replay",
>            "state": "up:replay",
>            "fs_name": "cephfs",
>            "replay_status": {
>                "journal_read_pos": 0,
>                "journal_write_pos": 0,
>                "journal_expire_pos": 0,
>                "num_events": 0,
>                "num_segments": 0
>            },
>            "rank_uptime": 150224.982558844,
>            "mdsmap_epoch": 143757,
>            "osdmap_epoch": 12395,
>            "osdmap_epoch_barrier": 0,
>            "uptime": 150225.39968057699
>      }
>
>      ########################
>
>      [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph04.cvdhsx status
>      2023-01-16T08:59:05.434+0000 7fdb82ff5700  0 client.61299598
>      ms_handle_reset on v2:192.168.23.64:6800/3930607515
>      <http://192.168.23.64:6800/3930607515>
>      2023-01-16T08:59:05.466+0000 7fdb82ff5700  0 client.61299604
>      ms_handle_reset on v2:192.168.23.64:6800/3930607515
>      <http://192.168.23.64:6800/3930607515>
>      {
>            "cluster_fsid":
"ff6e50de-ed72-11ec-881c-dca6325c2cc4",
>            "whoami": 0,
>            "id": 60984134,
>            "want_state": "up:replay",
>            "state": "up:replay",
>            "fs_name": "cephfs_insecure",
>            "replay_status": {
>                "journal_read_pos": 0,
>                "journal_write_pos": 0,
>                "journal_expire_pos": 0,
>                "num_events": 0,
>                "num_segments": 0
>            },
>            "rank_uptime": 150450.96934037199,
>            "mdsmap_epoch": 143815,
>            "osdmap_epoch": 12395,
>            "osdmap_epoch_barrier": 0,
>            "uptime": 150451.93533502301
>      }
>
>      ###########################
>
>      [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph06.wcfdom status
>      2023-01-16T08:59:28.572+0000 7f16538c0b80 -1 client.61250376
>      resolve_mds: no MDS daemons found by name `mds01.ceph06.wcfdom'
>      2023-01-16T08:59:28.583+0000 7f16538c0b80 -1 client.61250376 FSMap:
>      cephfs:1/1 cephfs_insecure:1/1
>
>
{cephfs:0=mds01.ceph05.pqxmvt=up:replay,cephfs_insecure:0=mds01.ceph04.cvdhsx=up:replay}
>      2 up:standby
>      Error ENOENT: problem getting command descriptions from
>      mds.mds01.ceph06.wcfdom
>
>      ############################
>
>      [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph07.omdisd status
>      2023-01-16T09:00:02.802+0000 7fb7affff700  0 client.61250454
>      ms_handle_reset on v2:192.168.23.67:6800/942898192
>      <http://192.168.23.67:6800/942898192>
>      2023-01-16T09:00:02.831+0000 7fb7affff700  0 client.61299751
>      ms_handle_reset on v2:192.168.23.67:6800/942898192
>      <http://192.168.23.67:6800/942898192>
>      {
>            "cluster_fsid":
"ff6e50de-ed72-11ec-881c-dca6325c2cc4",
>            "whoami": -1,
>            "id": 60984161,
>            "want_state": "up:standby",
>            "state": "up:standby",
>            "mdsmap_epoch": 97687,
>            "osdmap_epoch": 0,
>            "osdmap_epoch_barrier": 0,
>            "uptime": 150508.29091721401
>      }
>
>      The error message from ceph06 is new to me. That didn't happen the
> last
>      times.
>
>      [ceph: root@ceph06 /]# ceph fs dump
>      e143850
>      enable_multiple, ever_enabled_multiple: 1,1
>      default compat: compat={},rocompat={},incompat={1=base
> v0.20,2=client
>      writeable ranges,3=default file layouts on dirs,4=dir inode in
> separate
>      object,5=mds uses versioned encoding,6=dirfrag is stored in
> omap,8=no
>      anchor table,9=file layout v2,10=snaprealm v2}
>      legacy client fscid: 2
>
>      Filesystem 'cephfs' (2)
>      fs_name cephfs
>      epoch   143850
>      flags   12 joinable allow_snaps allow_multimds_snaps
>      created 2023-01-14T14:30:05.723421+0000
>      modified        2023-01-16T09:00:53.663007+0000
>      tableserver     0
>      root    0
>      session_timeout 60
>      session_autoclose       300
>      max_file_size   1099511627776
>      required_client_features        {}
>      last_failure    0
>      last_failure_osd_epoch  12321
>      compat  compat={},rocompat={},incompat={1=base v0.20,2=client
> writeable
>      ranges,3=default file layouts on dirs,4=dir inode in separate
>      object,5=mds uses versioned encoding,6=dirfrag is stored in
> omap,7=mds
>      uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
>      max_mds 1
>      in      0
>      up      {0=60984167}
>      failed
>      damaged
>      stopped
>      data_pools      [4]
>      metadata_pool   5
>      inline_data     disabled
>      balancer
>      standby_count_wanted    1
>      [mds.mds01.ceph05.pqxmvt{0:60984167} state up:replay seq 37637 addr
>      [v2:192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694
>
> <http://192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694>]
>      compat {c=[1],r=[1],i=[7ff]}]
>
>
>      Filesystem 'cephfs_insecure' (3)
>      fs_name cephfs_insecure
>      epoch   143849
>      flags   12 joinable allow_snaps allow_multimds_snaps
>      created 2023-01-14T14:22:46.360062+0000
>      modified        2023-01-16T09:00:52.632163+0000
>      tableserver     0
>      root    0
>      session_timeout 60
>      session_autoclose       300
>      max_file_size   1099511627776
>      required_client_features        {}
>      last_failure    0
>      last_failure_osd_epoch  12319
>      compat  compat={},rocompat={},incompat={1=base v0.20,2=client
> writeable
>      ranges,3=default file layouts on dirs,4=dir inode in separate
>      object,5=mds uses versioned encoding,6=dirfrag is stored in
> omap,7=mds
>      uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
>      max_mds 1
>      in      0
>      up      {0=60984134}
>      failed
>      damaged
>      stopped
>      data_pools      [7]
>      metadata_pool   6
>      inline_data     disabled
>      balancer
>      standby_count_wanted    1
>      [mds.mds01.ceph04.cvdhsx{0:60984134} state up:replay seq 37639 addr
>      [v2:192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515
>
> <http://192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515>]
>      compat {c=[1],r=[1],i=[7ff]}]
>
>
>      Standby daemons:
>
>      [mds.mds01.ceph07.omdisd{-1:60984161} state up:standby seq 2 addr
>      [v2:192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192
>
> <http://192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192>]
> compat
>      {c=[1],r=[1],i=[7ff]}]
>      [mds.mds01.ceph06.hsuhqd{-1:60984828} state up:standby seq 1 addr
>      [v2:192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518
>
> <http://192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518>]
>      compat {c=[1],r=[1],i=[7ff]}]
>      dumped fsmap epoch 143850
>
>      #############################
>
>      [ceph: root@ceph06 /]# ceph fs status
>
>      (doesn't come back)
>
>      #############################
>
>      All MDS show log lines similar to this one:
>
>      Jan 16 10:05:00 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
> Updating
>      MDS map to version 143927 from mon.1
>      Jan 16 10:05:05 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
> Updating
>      MDS map to version 143929 from mon.1
>      Jan 16 10:05:09 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
> Updating
>      MDS map to version 143930 from mon.1
>      Jan 16 10:05:13 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
> Updating
>      MDS map to version 143931 from mon.1
>      Jan 16 10:05:20 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
> Updating
>      MDS map to version 143933 from mon.1
>      Jan 16 10:05:24 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
> Updating
>      MDS map to version 143935 from mon.1
>      Jan 16 10:05:29 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
> Updating
>      MDS map to version 143936 from mon.1
>      Jan 16 10:05:33 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
> Updating
>      MDS map to version 143937 from mon.1
>      Jan 16 10:05:40 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
> Updating
>      MDS map to version 143939 from mon.1
>      Jan 16 10:05:44 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
> Updating
>      MDS map to version 143941 from mon.1
>      Jan 16 10:05:49 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
> Updating
>      MDS map to version 143942 from mon.1
>
>      Anything else, I can provide?
>
>      Cheers and thanks again!
>      Thomas
>
>      On 16.01.23 06:01, Kotresh Hiremath Ravishankar wrote:
>       > Hi Thomas,
>       >
>       > As the documentation says, the MDS enters up:resolve from
>      |up:replay| if
>       > the Ceph file system has multiple ranks (including this one),
>      i.e. it’s
>       > not a single active MDS cluster.
>       > The MDS is resolving any uncommitted inter-MDS operations. All
>      ranks in
>       > the file system must be in this state or later for progress to be
>      made,
>       > i.e. no rank can be failed/damaged or |up:replay|.
>       >
>       > So please check the status of the other active mds if it's
> failed.
>       >
>       > Also please share the mds logs and the output of 'ceph fs dump'
> and
>       > 'ceph fs status'
>       >
>       > Thanks,
>       > Kotresh H R
>       >
>       > On Sat, Jan 14, 2023 at 9:07 PM Thomas Widhalm
>       > &lt;thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>
>      <mailto:thomas.widhalm@netways.de
>      <mailto:thomas.widhalm@netways.de>>> wrote:
>       >
>       >     Hi,
>       >
>       >     I'm really lost with my Ceph system. I built a small cluster
>      for home
>       >     usage which has two uses for me: I want to replace an old NAS
>      and I want
>       >     to learn about Ceph so that I have hands-on experience. We're
>      using it
>       >     in our company but I need some real-life experience without
>      risking any
>       >     company or customers data. That's my preferred way of
> learning.
>       >
>       >     The cluster consists of 3 Raspberry Pis plus a few VMs
> running on
>       >     Proxmox. I'm not using Proxmox' built in Ceph because I want
>      to focus on
>       >     Ceph and not just use it as a preconfigured tool.
>       >
>       >     All hosts are running Fedora (x86_64 and arm64) and during an
>      Upgrade
>       >     from F36 to F37 my cluster suddenly showed all PGs as
>      unavailable. I
>       >     worked nearly a week to get it back online and I learned a
>      lot about
>       >     Ceph management and recovery. The cluster is back but I still
>      can't
>       >     access my data. Maybe you can help me?
>       >
>       >     Here are my versions:
>       >
>       >     [ceph: root@ceph04 /]# ceph versions
>       >     {
>       >           "mon": {
>       >               "ceph version 17.2.5
>       >     (98318ae89f1a893a6ded3a640405cdbb33e08757)
>       >     quincy (stable)": 3
>       >           },
>       >           "mgr": {
>       >               "ceph version 17.2.5
>       >     (98318ae89f1a893a6ded3a640405cdbb33e08757)
>       >     quincy (stable)": 3
>       >           },
>       >           "osd": {
>       >               "ceph version 17.2.5
>       >     (98318ae89f1a893a6ded3a640405cdbb33e08757)
>       >     quincy (stable)": 5
>       >           },
>       >           "mds": {
>       >               "ceph version 17.2.5
>       >     (98318ae89f1a893a6ded3a640405cdbb33e08757)
>       >     quincy (stable)": 4
>       >           },
>       >           "overall": {
>       >               "ceph version 17.2.5
>       >     (98318ae89f1a893a6ded3a640405cdbb33e08757)
>       >     quincy (stable)": 15
>       >           }
>       >     }
>       >
>       >
>       >     Here's MDS status output of one MDS:
>       >     [ceph: root@ceph04 /]# ceph tell mds.mds01.ceph05.pqxmvt
> status
>       >     2023-01-14T15:30:28.607+0000 7fb9e17fa700  0 client.60986454
>       >     ms_handle_reset on v2:192.168.23.65:6800/2680651694
>      <http://192.168.23.65:6800/2680651694>
>       >     <http://192.168.23.65:6800/2680651694
>      <http://192.168.23.65:6800/2680651694>>
>       >     2023-01-14T15:30:28.640+0000 7fb9e17fa700  0 client.60986460
>       >     ms_handle_reset on v2:192.168.23.65:6800/2680651694
>      <http://192.168.23.65:6800/2680651694>
>       >     <http://192.168.23.65:6800/2680651694
>      <http://192.168.23.65:6800/2680651694>>
>       >     {
>       >           "cluster_fsid":
"ff6e50de-ed72-11ec-881c-dca6325c2cc4",
>       >           "whoami": 0,
>       >           "id": 60984167,
>       >           "want_state": "up:replay",
>       >           "state": "up:replay",
>       >           "fs_name": "cephfs",
>       >           "replay_status": {
>       >               "journal_read_pos": 0,
>       >               "journal_write_pos": 0,
>       >               "journal_expire_pos": 0,
>       >               "num_events": 0,
>       >               "num_segments": 0
>       >           },
>       >           "rank_uptime": 1127.54018615,
>       >           "mdsmap_epoch": 98056,
>       >           "osdmap_epoch": 12362,
>       >           "osdmap_epoch_barrier": 0,
>       >           "uptime": 1127.957307273
>       >     }
>       >
>       >     It's staying like that for days now. If there was a counter
>      moving, I
>       >     just would wait but it doesn't change anything and alle stats
>      says, the
>       >     MDS aren't working at all.
>       >
>       >     The symptom I have is that Dashboard and all other tools I
>      use say, it's
>       >     more or less ok. (Some old messages about failed daemons and
>      scrubbing
>       >     aside). But I can't mount anything. When I try to start a VM
>      that's on
>       >     RDS I just get a timeout. And when I try to mount a CephFS,
>      mount just
>       >     hangs forever.
>       >
>       >     Whatever command I give MDS or journal, it just hangs. The
>      only thing I
>       >     could do, was take all CephFS offline, kill the MDS's and do
>      a "ceph fs
>       >     reset <fs name> --yes-i-really-mean-it". After that I
>      rebooted all
>       >     nodes, just to be sure but I still have no access to data.
>       >
>       >     Could you please help me? I'm kinda desperate. If you need
>      any more
>       >     information, just let me know.
>       >
>       >     Cheers,
>       >     Thomas
>       >
>       >     --
>       >     Thomas Widhalm
>       >     Lead Systems Engineer
>       >
>       >     NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 |
>       >     D-90429 Nuernberg
>       >     Tel: +49 911 92885-0 | Fax: +49 911 92885-77
>       >     CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510
>       > https://www.netways.de <https://www.netways.de>
>      <https://www.netways.de <https://www.netways.de>> |
>       > thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>
>      <mailto:thomas.widhalm@netways.de
> <mailto:thomas.widhalm@netways.de>>
>       >
>       >     ** stackconf 2023 - September - https://stackconf.eu
>      <https://stackconf.eu>
>       >     <https://stackconf.eu <https://stackconf.eu>> **
>       >     ** OSMC 2023 - November - https://osmc.de <https://osmc.de>
>      <https://osmc.de <https://osmc.de>> **
>       >     ** New at NWS: Managed Database -
>       > https://nws.netways.de/managed-database
>      <https://nws.netways.de/managed-database>
>       >     <https://nws.netways.de/managed-database
>      <https://nws.netways.de/managed-database>> **
>       >     ** NETWAYS Web Services - https://nws.netways.de
>      <https://nws.netways.de>
>       >     <https://nws.netways.de <https://nws.netways.de>> **
>       >     _______________________________________________
>       >     ceph-users mailing list -- ceph-users(a)ceph.io
>      <mailto:ceph-users@ceph.io>
>       >     <mailto:ceph-users@ceph.io <mailto:ceph-users@ceph.io>>
>       >     To unsubscribe send an email to ceph-users-leave(a)ceph.io
>      <mailto:ceph-users-leave@ceph.io>
>       >     <mailto:ceph-users-leave@ceph.io
>      <mailto:ceph-users-leave@ceph.io>>
>       >
>
>      --
>      Thomas Widhalm
>      Lead Systems Engineer
>
>      NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 |
>      D-90429 Nuernberg
>      Tel: +49 911 92885-0 | Fax: +49 911 92885-77
>      CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510
>      https://www.netways.de <https://www.netways.de> |
>      thomas.widhalm(a)netways.de <mailto:thomas.widhalm@netways.de>
>
>      ** stackconf 2023 - September - https://stackconf.eu
>      <https://stackconf.eu> **
>      ** OSMC 2023 - November - https://osmc.de <https://osmc.de> **
>      ** New at NWS: Managed Database -
>      https://nws.netways.de/managed-database
>      <https://nws.netways.de/managed-database> **
>      ** NETWAYS Web Services - https://nws.netways.de
>      <https://nws.netways.de> **
>

 --
 Thomas Widhalm
 Lead Systems Engineer

 NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429
 Nuernberg
 Tel: +49 911 92885-0 | Fax: +49 911 92885-77
 CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510
 https://www.netways.de | thomas.widhalm(a)netways.de

 ** stackconf 2023 - September - https://stackconf.eu **
 ** OSMC 2023 - November - https://osmc.de **
 ** New at NWS: Managed Database -
 https://nws.netways.de/managed-database **
 ** NETWAYS Web Services - https://nws.netways.de **
 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 
 --
 Thomas Widhalm
 Lead Systems Engineer

 NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429
 Nuernberg
 Tel: +49 911 92885-0 | Fax: +49 911 92885-77
 CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510
 https://www.netways.de | thomas.widhalm(a)netways.de

 ** stackconf 2023 - September - https://stackconf.eu **
 ** OSMC 2023 - November - https://osmc.de **
 ** New at NWS: Managed Database -
 https://nws.netways.de/managed-database **
 ** NETWAYS Web Services - https://nws.netways.de **
 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 
 --
 Thomas Widhalm
 Lead Systems Engineer

 NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg
 Tel: +49 911 92885-0 | Fax: +49 911 92885-77
 CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510
 https://www.netways.de | thomas.widhalm(a)netways.de

 ** stackconf 2023 - September - https://stackconf.eu **
 ** OSMC 2023 - November - https://osmc.de **
 ** New at NWS: Managed Database - https://nws.netways.de/managed-database **
 ** NETWAYS Web Services - https://nws.netways.de **
 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 

-- 
Thomas Widhalm
Lead Systems Engineer

NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg
Tel: +49 911 92885-0 | Fax: +49 911 92885-77
CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510
https://www.netways.de | thomas.widhalm(a)netways.de

** stackconf 2023 - September - https://stackconf.eu **
** OSMC 2023 - November - https://osmc.de **
** New at NWS: Managed Database - https://nws.netways.de/managed-database **
** NETWAYS Web Services - https://nws.netways.de **

2024

2023

2022

2021

2020

2019

[ceph-users] Re: MDS stuck in "up:replay"