Thank you very much! This solution helped:
Stop all MDS, then:
# rados -p cephfs_metadata_pool rm mds0_openfiles.0
then start one MDS.
We are back online. Amazing!!! :)
On 04.12.2020 12:20, Dan van der Ster wrote:
> Please also make sure the mds_beacon_grace is high on the mon's too.
>
> it doesn't matter which mds you select to be the running one.
>
> Is the processing getting killed, restarted?
> If you're confident that the mds is getting OOM killed during rejoin
> step, then you might find this useful:
>
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028964.html
>
> Stop all MDS, then:
> # rados -p cephfs_metadata_pool rm mds0_openfiles.0
> then start one MDS.
>
> -- Dan
>
> On Fri, Dec 4, 2020 at 11:05 AM Anton Aleksandrov <anton(a)aleksandrov.eu>
wrote:
>> Yes, MDS eats all memory+swap, stays like this for a moment and then
>> frees memory.
>>
>> mds_beacon_grace was already set to 1800
>>
>> Also on other it is seen this message: Map has assigned me to become a
>> standby.
>>
>> Does it matter, which MDS we stop and which we leave running?
>>
>> Anton
>>
>>
>> On 04.12.2020 11:53, Dan van der Ster wrote:
>>> How many active MDS's did you have? (max_mds == 1, right?)
>>>
>>> Stop the other two MDS's so you can focus on getting exactly one
running.
>>> Tail the log file and see what it is reporting.
>>> Increase mds_beacon_grace to 600 so that the mon doesn't fail this MDS
>>> while it is rejoining.
>>>
>>> Is that single MDS running out of memory during the rejoin phase?
>>>
>>> -- dan
>>>
>>> On Fri, Dec 4, 2020 at 10:49 AM Anton Aleksandrov
<anton(a)aleksandrov.eu> wrote:
>>>> Hello community,
>>>>
>>>> we are on ceph 13.2.8 - today something happenned with one MDS and cephs
>>>> status tells, that filesystem is degraded. It won't mount either. I
have
>>>> take server with MDS, that was not working down. There are 2 more MDS
>>>> servers, but they stay in "rejoin" state. Also only 1 is shown
in
>>>> "services", even though there are 2.
>>>>
>>>> Both running MDS servers have these lines in their logs:
>>>>
>>>> heartbeat_map is_healthy 'MDSRank' had timed out after 15
>>>> mds.beacon.mds2 Skipping beacon heartbeat to monitors (last acked
>>>> 28.8979s ago); MDS internal heartbeat is not healthy!
>>>>
>>>> On one of MDS nodes I enabled more detailed debug, so I am getting there
>>>> also:
>>>>
>>>> mds.beacon.mds3 Sending beacon up:standby seq 178
>>>> mds.beacon.mds3 received beacon reply up:standby seq 178 rtt 0.000999968
>>>>
>>>> Makes no sense and too much stress in my head... Anyone could help
please?
>>>>
>>>> Anton.
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io