Yes, MDS eats all memory+swap, stays like this for a moment and then
frees memory.
mds_beacon_grace was already set to 1800
Also on other it is seen this message: Map has assigned me to become a
standby.
Does it matter, which MDS we stop and which we leave running?
Anton
On 04.12.2020 11:53, Dan van der Ster wrote:
> How many active MDS's did you have? (max_mds == 1, right?)
>
> Stop the other two MDS's so you can focus on getting exactly one running.
> Tail the log file and see what it is reporting.
> Increase mds_beacon_grace to 600 so that the mon doesn't fail this MDS
> while it is rejoining.
>
> Is that single MDS running out of memory during the rejoin phase?
>
> -- dan
>
> On Fri, Dec 4, 2020 at 10:49 AM Anton Aleksandrov <anton(a)aleksandrov.eu>
wrote:
>> Hello community,
>>
>> we are on ceph 13.2.8 - today something happenned with one MDS and cephs
>> status tells, that filesystem is degraded. It won't mount either. I have
>> take server with MDS, that was not working down. There are 2 more MDS
>> servers, but they stay in "rejoin" state. Also only 1 is shown in
>> "services", even though there are 2.
>>
>> Both running MDS servers have these lines in their logs:
>>
>> heartbeat_map is_healthy 'MDSRank' had timed out after 15
>> mds.beacon.mds2 Skipping beacon heartbeat to monitors (last acked
>> 28.8979s ago); MDS internal heartbeat is not healthy!
>>
>> On one of MDS nodes I enabled more detailed debug, so I am getting there
>> also:
>>
>> mds.beacon.mds3 Sending beacon up:standby seq 178
>> mds.beacon.mds3 received beacon reply up:standby seq 178 rtt 0.000999968
>>
>> Makes no sense and too much stress in my head... Anyone could help please?
>>
>> Anton.
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io