[ceph-users] Re: MDS crash

22 Apr 2024

Right, I just figured from the health output you would have a couple  
of seconds or so to query the daemon:

...
 >      mds: 1/1 daemons up 
Zitat von Alexey GERASIMOV &lt;alexey.gerasimov(a)opencascade.com&gt;om>:

> Ok, we will create the ticket.
>
> Eugen Block - ceph tell  command needs to communicate with the MDS  
> daemon running, but it is crashed. So, I just have the information  
> about the impossibility to receive the information from daemon:
>
> ceph tell mds.0 damage ls
> Error ENOENT: problem getting command descriptions from mds.0
>
> ---
> Best regards,
>
> Alexey Gerasimov
> System Manager
>
>
> www.opencascade.com
> www.capgemini.com
>
>
>
> -----Original Message-----
> From: Xiubo Li &lt;xiubli(a)redhat.com&gt;
> Sent: Monday, April 22, 2024 2:21 AM
> To: Alexey GERASIMOV &lt;alexey.gerasimov(a)opencascade.com&gt;om>; ceph-users(a)ceph.io
> Subject: Re: [ceph-users] MDS crash
>
> Hi Alexey,
>
> This looks a new issue for me. Please create a tracker for it and  
> provide the detail call trace there.
>
> Thanks
>
> - Xiubo
>
> On 4/19/24 05:42, alexey.gerasimov(a)opencascade.com wrote:
>> Dear colleagues, hope that anybody can help us.
>>
>> The initial point:  Ceph cluster v15.2 (installed and controlled by  
>> the Proxmox) with 3 nodes based on physical servers rented from a  
>> cloud provider. CephFS is installed also.
>>
>> Yesterday we discovered that some of the applications stopped  
>> working. During the investigation we recognized that we have the  
>> problem with Ceph, more precisely with СephFS - MDS daemons  
>> suddenly crashed. We tried to restart them and found that they  
>> crashed again immediately after the start. The crash information:
>> 2024-04-17T17:47:42.841+0000 7f959ced9700  1 mds.0.29134  
>> recovery_done -- successful recovery!
>> 2024-04-17T17:47:42.853+0000 7f959ced9700  1 mds.0.29134 active_start
>> 2024-04-17T17:47:42.881+0000 7f959ced9700  1 mds.0.29134 cluster recovered.
>> 2024-04-17T17:47:43.825+0000 7f959aed5700 -1
>> ./src/mds/OpenFileTable.cc: In function 'void
>> OpenFileTable::commit(MDSContext*, uint64_t, int)' thread 7f959aed5700
>> time 2024-04-17T17:47:43.831243+0000
>> ./src/mds/OpenFileTable.cc: 549: FAILED ceph_assert(count > 0)
>>
>> Next hours we read the tons of articles, studied the documentation,  
>> and checked the common state of Ceph cluster by the various  
>> diagnostic commands – but didn’t find anything wrong. At evening we  
>> decided to upgrade it up to v16, and finally to v17.2.7.  
>> Unfortunately, it didn’t solve the problem, MDS continue to crash  
>> with the same error. The only difference that we found is “1 MDSs  
>> report damaged metadata” in the output of ceph -s – see it below.
>>
>> I supposed that it may be the well-known bug, but couldn’t find the
>> same one on https://tracker.ceph.com - there are several bugs
>> associated with file OpenFileTable.cc but not related to
>> ceph_assert(count > 0)
>>
>> We tried to check the source code of OpenFileTable.cc also, here is  
>> a fragment of it, in function OpenFileTable::_journal_finish
>>        int omap_idx = anchor.omap_idx;
>>        unsigned& count = omap_num_items.at(omap_idx);
>>        ceph_assert(count > 0);
>> So, we guess that the object map is empty for some object in Ceph, and
>> it is unexpected behavior. But again, we found nothing wrong in our
>> cluster…
>>
>> Next, we started with
>> https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/  
>> article – tried to reset the journal (despite that it was Ok all  
>> the time) and wipe the sessions using cephfs-table-tool all reset  
>> session command. No result… Now I decided to continue following  
>> this article and run cephfs-data-scan scan_extents command, it is  
>> working just now. But I have a doubt that it will solve the issue  
>> because of no problem with our objects in Ceph.
>>
>> Is it the new bug? or something else? Any idea is welcome!
>>
>> The important outputs:
>>
>> ----- ceph -s
>>    cluster:
>>      id:     4cd1c477-c8d0-4855-a1f1-cb71d89427ed
>>      health: HEALTH_ERR
>>              1 MDSs report damaged metadata
>>              insufficient standby MDS daemons available
>>              83 daemons have recently crashed
>>              3 mgr modules have recently crashed
>>
>>    services:
>>      mon: 3 daemons, quorum  
>> asrv-dev-stor-2,asrv-dev-stor-3,asrv-dev-stor-1 (age 22h)
>>      mgr: asrv-dev-stor-2(active, since 22h), standbys: asrv-dev-stor-1
...
 >      mds: 1/1 daemons up >>      osd:
18 osds: 18 up (since 22h), 18 in (since 29h)
>>
>>    data:
>>      volumes: 1/1 healthy
>>      pools:   5 pools, 289 pgs
>>      objects: 29.72M objects, 5.6 TiB
>>      usage:   21 TiB used, 47 TiB / 68 TiB avail
>>      pgs:     287 active+clean
>>               2   active+clean+scrubbing+deep
>>
>>    io:
>>      client:   2.5 KiB/s rd, 172 KiB/s wr, 261 op/s rd, 195 op/s wr
>>
>> -----ceph fs dump
>> e29480
>> enable_multiple, ever_enabled_multiple: 0,1 default compat:
>> compat={},rocompat={},incompat={1=base v0.20,2=client writeable
>> ranges,3=default file layouts on dirs,4=dir inode in separate
>> object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds
>> uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
>> legacy client fscid: 1
>>
>> Filesystem 'cephfs' (1)
>> fs_name cephfs
>> epoch   29480
>> flags   12 joinable allow_snaps allow_multimds_snaps
>> created 2022-11-25T15:56:08.507407+0000
>> modified        2024-04-18T16:52:29.970504+0000
>> tableserver     0
>> root    0
>> session_timeout 60
>> session_autoclose       300
>> max_file_size   1099511627776
>> required_client_features        {}
>> last_failure    0
>> last_failure_osd_epoch  14728
>> compat  compat={},rocompat={},incompat={1=base v0.20,2=client
>> writeable ranges,3=default file layouts on dirs,4=dir inode in  
>> separate object,5=mds uses versioned encoding,6=dirfrag is stored  
>> in omap,7=mds uses inline data,8=no anchor table,9=file layout  
>> v2,10=snaprealm v2} max_mds 1
>> in      0
>> up      {0=156636152}
>> failed
>> damaged
>> stopped
>> data_pools      [5]
>> metadata_pool   6
>> inline_data     disabled
>> balancer
>> standby_count_wanted    1
>> [mds.asrv-dev-stor-1{0:156636152} state up:active seq 6 laggy since
>> 2024-04-18T16:52:29.970479+0000 addr
>> [v2:172.22.2.91:6800/2487054023,v1:172.22.2.91:6801/2487054023] compat
>> {c=[1],r=[1],i=[7ff]}]
>>
>> -----cephfs-journal-tool --rank=cephfs:0 journal inspect Overall
>> journal integrity: OK
>>
>> -----ceph pg dump summary
>> version 41137
>> stamp 2024-04-18T21:17:59.133536+0000
>> last_osdmap_epoch 0
>> last_pg_scan 0
>> PG_STAT  OBJECTS   MISSING_ON_PRIMARY  DEGRADED  MISPLACED  UNFOUND  
>>  BYTES          OMAP_BYTES*  OMAP_KEYS*  LOG      DISK_LOG
>> sum      29717605                   0         0          0        0  
>>  6112544251872  13374192956    28493480  1806575   1806575
>> OSD_STAT  USED    AVAIL   USED_RAW  TOTAL
>> sum       21 TiB  47 TiB    21 TiB  68 TiB
>>
>> -----ceph pg dump pools
>> POOLID  OBJECTS   MISSING_ON_PRIMARY  DEGRADED  MISPLACED  UNFOUND   
>> BYTES          OMAP_BYTES*  OMAP_KEYS*  LOG     DISK_LOG
>> 8          31771                   0         0          0        0   
>>  131337887503         2482         140  401246    401246
>> 7         839707                   0         0          0        0   
>> 3519034650971          736          61  399328    399328
>> 6        1319576                   0         0          0        0   
>>     421044421  13374189738    28493279  206749    206749
>> 5       27526539                   0         0          0        0   
>> 2461702171417            0           0  792165    792165
>> 2             12                   0         0          0        0   
>>      48497560            0           0    6991      6991
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
>> email to ceph-users-leave(a)ceph.io
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io

2024

2023

2022

2021

2020

2019

[ceph-users] Re: MDS crash