[ceph-users] Re: MDS crash

22 Apr 2024

Ok, we will create the ticket.

Eugen Block - ceph tell  command needs to communicate with the MDS daemon running, but it
is crashed. So, I just have the information about the impossibility to receive the
information from daemon:

ceph tell mds.0 damage ls
Error ENOENT: problem getting command descriptions from mds.0

---
Best regards,

Alexey Gerasimov
System Manager

www.opencascade.com
www.capgemini.com

-----Original Message-----
From: Xiubo Li &lt;xiubli(a)redhat.com&gt; 
Sent: Monday, April 22, 2024 2:21 AM
To: Alexey GERASIMOV &lt;alexey.gerasimov(a)opencascade.com&gt;om>; ceph-users(a)ceph.io
Subject: Re: [ceph-users] MDS crash

Hi Alexey,

This looks a new issue for me. Please create a tracker for it and provide the detail call
trace there.

Thanks

- Xiubo

On 4/19/24 05:42, alexey.gerasimov(a)opencascade.com wrote:
...
  Dear colleagues, hope that anybody can help us.

 The initial point:  Ceph cluster v15.2 (installed and controlled by the Proxmox) with 3
nodes based on physical servers rented from a cloud provider. CephFS is installed also.

 Yesterday we discovered that some of the applications stopped working. During the
investigation we recognized that we have the problem with Ceph, more precisely with СephFS
- MDS daemons suddenly crashed. We tried to restart them and found that they crashed again
immediately after the start. The crash information:
 2024-04-17T17:47:42.841+0000 7f959ced9700  1 mds.0.29134 recovery_done -- successful
recovery!
 2024-04-17T17:47:42.853+0000 7f959ced9700  1 mds.0.29134 active_start
 2024-04-17T17:47:42.881+0000 7f959ced9700  1 mds.0.29134 cluster recovered.
 2024-04-17T17:47:43.825+0000 7f959aed5700 -1 
 ./src/mds/OpenFileTable.cc: In function 'void 
 OpenFileTable::commit(MDSContext*, uint64_t, int)' thread 7f959aed5700 
 time 2024-04-17T17:47:43.831243+0000
 ./src/mds/OpenFileTable.cc: 549: FAILED ceph_assert(count > 0)

 Next hours we read the tons of articles, studied the documentation, and checked the
common state of Ceph cluster by the various diagnostic commands – but didn’t find anything
wrong. At evening we decided to upgrade it up to v16, and finally to v17.2.7.
Unfortunately, it didn’t solve the problem, MDS continue to crash with the same error. The
only difference that we found is “1 MDSs report damaged metadata” in the output of ceph -s
– see it below.

 I supposed that it may be the well-known bug, but couldn’t find the 
 same one on https://tracker.ceph.com - there are several bugs 
 associated with file OpenFileTable.cc but not related to 
 ceph_assert(count > 0)

 We tried to check the source code of OpenFileTable.cc also, here is a fragment of it, in
function OpenFileTable::_journal_finish
        int omap_idx = anchor.omap_idx;
        unsigned& count = omap_num_items.at(omap_idx);
        ceph_assert(count > 0);
 So, we guess that the object map is empty for some object in Ceph, and 
 it is unexpected behavior. But again, we found nothing wrong in our 
 cluster…

 Next, we started with 
 https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/ article – tried to
reset the journal (despite that it was Ok all the time) and wipe the sessions using
cephfs-table-tool all reset session command. No result… Now I decided to continue
following this article and run cephfs-data-scan scan_extents command, it is working just
now. But I have a doubt that it will solve the issue because of no problem with our
objects in Ceph.

 Is it the new bug? or something else? Any idea is welcome!

 The important outputs:

 ----- ceph -s
    cluster:
      id:     4cd1c477-c8d0-4855-a1f1-cb71d89427ed
      health: HEALTH_ERR
              1 MDSs report damaged metadata
              insufficient standby MDS daemons available
              83 daemons have recently crashed
              3 mgr modules have recently crashed

    services:
      mon: 3 daemons, quorum asrv-dev-stor-2,asrv-dev-stor-3,asrv-dev-stor-1 (age 22h)
      mgr: asrv-dev-stor-2(active, since 22h), standbys: asrv-dev-stor-1
      mds: 1/1 daemons up
      osd: 18 osds: 18 up (since 22h), 18 in (since 29h)

    data:
      volumes: 1/1 healthy
      pools:   5 pools, 289 pgs
      objects: 29.72M objects, 5.6 TiB
      usage:   21 TiB used, 47 TiB / 68 TiB avail
      pgs:     287 active+clean
               2   active+clean+scrubbing+deep

    io:
      client:   2.5 KiB/s rd, 172 KiB/s wr, 261 op/s rd, 195 op/s wr

 -----ceph fs dump
 e29480
 enable_multiple, ever_enabled_multiple: 0,1 default compat: 
 compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
 ranges,3=default file layouts on dirs,4=dir inode in separate 
 object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds 
 uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} 
 legacy client fscid: 1

 Filesystem 'cephfs' (1)
 fs_name cephfs
 epoch   29480
 flags   12 joinable allow_snaps allow_multimds_snaps
 created 2022-11-25T15:56:08.507407+0000
 modified        2024-04-18T16:52:29.970504+0000
 tableserver     0
 root    0
 session_timeout 60
 session_autoclose       300
 max_file_size   1099511627776
 required_client_features        {}
 last_failure    0
 last_failure_osd_epoch  14728
 compat  compat={},rocompat={},incompat={1=base v0.20,2=client 
 writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses
versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor
table,9=file layout v2,10=snaprealm v2} max_mds 1
 in      0
 up      {0=156636152}
 failed
 damaged
 stopped
 data_pools      [5]
 metadata_pool   6
 inline_data     disabled
 balancer
 standby_count_wanted    1
 [mds.asrv-dev-stor-1{0:156636152} state up:active seq 6 laggy since 
 2024-04-18T16:52:29.970479+0000 addr 
 [v2:172.22.2.91:6800/2487054023,v1:172.22.2.91:6801/2487054023] compat 
 {c=[1],r=[1],i=[7ff]}]

 -----cephfs-journal-tool --rank=cephfs:0 journal inspect Overall 
 journal integrity: OK

 -----ceph pg dump summary
 version 41137
 stamp 2024-04-18T21:17:59.133536+0000
 last_osdmap_epoch 0
 last_pg_scan 0
 PG_STAT  OBJECTS   MISSING_ON_PRIMARY  DEGRADED  MISPLACED  UNFOUND  BYTES         
OMAP_BYTES*  OMAP_KEYS*  LOG      DISK_LOG
 sum      29717605                   0         0          0        0  6112544251872 
13374192956    28493480  1806575   1806575
 OSD_STAT  USED    AVAIL   USED_RAW  TOTAL
 sum       21 TiB  47 TiB    21 TiB  68 TiB

 -----ceph pg dump pools
 POOLID  OBJECTS   MISSING_ON_PRIMARY  DEGRADED  MISPLACED  UNFOUND  BYTES         
OMAP_BYTES*  OMAP_KEYS*  LOG     DISK_LOG
 8          31771                   0         0          0        0   131337887503        
2482         140  401246    401246
 7         839707                   0         0          0        0  3519034650971        
 736          61  399328    399328
 6        1319576                   0         0          0        0      421044421 
13374189738    28493279  206749    206749
 5       27526539                   0         0          0        0  2461702171417        
   0           0  792165    792165
 2             12                   0         0          0        0       48497560        
   0           0    6991      6991
 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an 
 email to ceph-users-leave(a)ceph.io

2024

2023

2022

2021

2020

2019

[ceph-users] Re: MDS crash