Right, I just figured from the health output you would have a couple
of seconds or so to query the daemon:
> mds: 1/1 daemons up
Zitat von Alexey GERASIMOV <alexey.gerasimov(a)opencascade.com>om>:
> Ok, we will create the ticket.
>
> Eugen Block - ceph tell command needs to communicate with the MDS
> daemon running, but it is crashed. So, I just have the information
> about the impossibility to receive the information from daemon:
>
> ceph tell mds.0 damage ls
> Error ENOENT: problem getting command descriptions from mds.0
>
> ---
> Best regards,
>
> Alexey Gerasimov
> System Manager
>
>
>
www.opencascade.com
>
www.capgemini.com
>
>
>
> -----Original Message-----
> From: Xiubo Li <xiubli(a)redhat.com>
> Sent: Monday, April 22, 2024 2:21 AM
> To: Alexey GERASIMOV <alexey.gerasimov(a)opencascade.com>om>; ceph-users(a)ceph.io
> Subject: Re: [ceph-users] MDS crash
>
> Hi Alexey,
>
> This looks a new issue for me. Please create a tracker for it and
> provide the detail call trace there.
>
> Thanks
>
> - Xiubo
>
> On 4/19/24 05:42, alexey.gerasimov(a)opencascade.com wrote:
>> Dear colleagues, hope that anybody can help us.
>>
>> The initial point: Ceph cluster v15.2 (installed and controlled by
>> the Proxmox) with 3 nodes based on physical servers rented from a
>> cloud provider. CephFS is installed also.
>>
>> Yesterday we discovered that some of the applications stopped
>> working. During the investigation we recognized that we have the
>> problem with Ceph, more precisely with СephFS - MDS daemons
>> suddenly crashed. We tried to restart them and found that they
>> crashed again immediately after the start. The crash information:
>> 2024-04-17T17:47:42.841+0000 7f959ced9700 1 mds.0.29134
>> recovery_done -- successful recovery!
>> 2024-04-17T17:47:42.853+0000 7f959ced9700 1 mds.0.29134 active_start
>> 2024-04-17T17:47:42.881+0000 7f959ced9700 1 mds.0.29134 cluster recovered.
>> 2024-04-17T17:47:43.825+0000 7f959aed5700 -1
>> ./src/mds/OpenFileTable.cc: In function 'void
>> OpenFileTable::commit(MDSContext*, uint64_t, int)' thread 7f959aed5700
>> time 2024-04-17T17:47:43.831243+0000
>> ./src/mds/OpenFileTable.cc: 549: FAILED ceph_assert(count > 0)
>>
>> Next hours we read the tons of articles, studied the documentation,
>> and checked the common state of Ceph cluster by the various
>> diagnostic commands – but didn’t find anything wrong. At evening we
>> decided to upgrade it up to v16, and finally to v17.2.7.
>> Unfortunately, it didn’t solve the problem, MDS continue to crash
>> with the same error. The only difference that we found is “1 MDSs
>> report damaged metadata” in the output of ceph -s – see it below.
>>
>> I supposed that it may be the well-known bug, but couldn’t find the
>> same one on
https://tracker.ceph.com - there are several bugs
>> associated with file OpenFileTable.cc but not related to
>> ceph_assert(count > 0)
>>
>> We tried to check the source code of OpenFileTable.cc also, here is
>> a fragment of it, in function OpenFileTable::_journal_finish
>> int omap_idx = anchor.omap_idx;
>> unsigned& count = omap_num_items.at(omap_idx);
>> ceph_assert(count > 0);
>> So, we guess that the object map is empty for some object in Ceph, and
>> it is unexpected behavior. But again, we found nothing wrong in our
>> cluster…
>>
>> Next, we started with
>>
https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/
>> article – tried to reset the journal (despite that it was Ok all
>> the time) and wipe the sessions using cephfs-table-tool all reset
>> session command. No result… Now I decided to continue following
>> this article and run cephfs-data-scan scan_extents command, it is
>> working just now. But I have a doubt that it will solve the issue
>> because of no problem with our objects in Ceph.
>>
>> Is it the new bug? or something else? Any idea is welcome!
>>
>> The important outputs:
>>
>> ----- ceph -s
>> cluster:
>> id: 4cd1c477-c8d0-4855-a1f1-cb71d89427ed
>> health: HEALTH_ERR
>> 1 MDSs report damaged metadata
>> insufficient standby MDS daemons available
>> 83 daemons have recently crashed
>> 3 mgr modules have recently crashed
>>
>> services:
>> mon: 3 daemons, quorum
>> asrv-dev-stor-2,asrv-dev-stor-3,asrv-dev-stor-1 (age 22h)
>> mgr: asrv-dev-stor-2(active, since 22h), standbys: asrv-dev-stor-1
> mds: 1/1 daemons up
>> osd:
18 osds: 18 up (since 22h), 18 in (since 29h)
>>
>> data:
>> volumes: 1/1 healthy
>> pools: 5 pools, 289 pgs
>> objects: 29.72M objects, 5.6 TiB
>> usage: 21 TiB used, 47 TiB / 68 TiB avail
>> pgs: 287 active+clean
>> 2 active+clean+scrubbing+deep
>>
>> io:
>> client: 2.5 KiB/s rd, 172 KiB/s wr, 261 op/s rd, 195 op/s wr
>>
>> -----ceph fs dump
>> e29480
>> enable_multiple, ever_enabled_multiple: 0,1 default compat:
>> compat={},rocompat={},incompat={1=base v0.20,2=client writeable
>> ranges,3=default file layouts on dirs,4=dir inode in separate
>> object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds
>> uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
>> legacy client fscid: 1
>>
>> Filesystem 'cephfs' (1)
>> fs_name cephfs
>> epoch 29480
>> flags 12 joinable allow_snaps allow_multimds_snaps
>> created 2022-11-25T15:56:08.507407+0000
>> modified 2024-04-18T16:52:29.970504+0000
>> tableserver 0
>> root 0
>> session_timeout 60
>> session_autoclose 300
>> max_file_size 1099511627776
>> required_client_features {}
>> last_failure 0
>> last_failure_osd_epoch 14728
>> compat compat={},rocompat={},incompat={1=base v0.20,2=client
>> writeable ranges,3=default file layouts on dirs,4=dir inode in
>> separate object,5=mds uses versioned encoding,6=dirfrag is stored
>> in omap,7=mds uses inline data,8=no anchor table,9=file layout
>> v2,10=snaprealm v2} max_mds 1
>> in 0
>> up {0=156636152}
>> failed
>> damaged
>> stopped
>> data_pools [5]
>> metadata_pool 6
>> inline_data disabled
>> balancer
>> standby_count_wanted 1
>> [mds.asrv-dev-stor-1{0:156636152} state up:active seq 6 laggy since
>> 2024-04-18T16:52:29.970479+0000 addr
>> [v2:172.22.2.91:6800/2487054023,v1:172.22.2.91:6801/2487054023] compat
>> {c=[1],r=[1],i=[7ff]}]
>>
>> -----cephfs-journal-tool --rank=cephfs:0 journal inspect Overall
>> journal integrity: OK
>>
>> -----ceph pg dump summary
>> version 41137
>> stamp 2024-04-18T21:17:59.133536+0000
>> last_osdmap_epoch 0
>> last_pg_scan 0
>> PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND
>> BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG
>> sum 29717605 0 0 0 0
>> 6112544251872 13374192956 28493480 1806575 1806575
>> OSD_STAT USED AVAIL USED_RAW TOTAL
>> sum 21 TiB 47 TiB 21 TiB 68 TiB
>>
>> -----ceph pg dump pools
>> POOLID OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND
>> BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG
>> 8 31771 0 0 0 0
>> 131337887503 2482 140 401246 401246
>> 7 839707 0 0 0 0
>> 3519034650971 736 61 399328 399328
>> 6 1319576 0 0 0 0
>> 421044421 13374189738 28493279 206749 206749
>> 5 27526539 0 0 0 0
>> 2461702171417 0 0 792165 792165
>> 2 12 0 0 0 0
>> 48497560 0 0 6991 6991
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
>> email to ceph-users-leave(a)ceph.io
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io