(Only one of our test clusters saw this happen so far,
during mimic
days, and this provoked us to move all MDSs to 64GB VMs, with mds
cache mem limit = 4GB, so there is a large amount of RAM available in
case it's needed.
Ours are running on machines with 128GB RAM. I tried limits between 4
and 40GB. But the higher the limit, the higher the fall after a crash.
We used to have three MDSs, now I am testing out one to see if that's
more stable. At the moment, it runs fine, but we have also outsourced
all the heavy lifting to S3.
What do the crashes look like with the TF training? Do
you have a tracker?
At some point the MDS becomes laggy and is killed and not even the hot
standby is able to resume. There is nothing special going on. You only
notice that the FS is suddenly degraded and MDS daemons are playing
Russian roulette until systemd pulls the plug due to too many daemon
failures. At that point I have to fail the remaining ones, run systemctl
reset-failed, delete the openfiles objects and restart the daemons.
How many client sessions do need to crash an MDS?
Depends. Surprisingly, it can be as little as one big node with 1.5TB of
RAM and a few hungry GPUs.
> -- Dan
>
>
>>> I guess we're pretty lucky with our CephFS's because we have more
than
>>> 1k clients and it is pretty solid (though the last upgrade had a
>>> hiccup decreasing down to single active MDS).
>>>
>>> -- Dan
>>>
>>>
>>>
>>> On Fri, Dec 4, 2020 at 8:20 PM Janek Bevendorff
>>> <janek.bevendorff(a)uni-weimar.de> wrote:
>>>> This is very common issue. Deleting mdsX_openfiles.Y has become part of
>>>> my standard maintenance repertoire. As soon as you have a few more
>>>> clients and one of them starts opening and closing files in rapid
>>>> succession (or does other metadata-heavy things), it becomes very likely
>>>> that the MDS crashes and is unable to recover.
>>>>
>>>> There have been numerous fixes in the past, which improved the overall
>>>> stability, but it is far from perfect. I am happy to see another patch
>>>> in that direction, but I believe more effort needs to be spent here. It
>>>> is way too easy to DoS the MDS from a single client. Our 78-node CephFS
>>>> beats our old NFS RAID server in terms of throughput, but latency and
>>>> stability are way behind.
>>>>
>>>> Janek
>>>>
>>>> On 04/12/2020 11:39, Dan van der Ster wrote:
>>>>> Excellent!
>>>>>
>>>>> For the record, this PR is the plan to fix this:
>>>>>
https://github.com/ceph/ceph/pull/36089
>>>>> (nautilus, octopus PRs here:
https://github.com/ceph/ceph/pull/37382
>>>>>
https://github.com/ceph/ceph/pull/37383)
>>>>>
>>>>> Cheers, Dan
>>>>>
>>>>> On Fri, Dec 4, 2020 at 11:35 AM Anton Aleksandrov
<anton(a)aleksandrov.eu> wrote:
>>>>>> Thank you very much! This solution helped:
>>>>>>
>>>>>> Stop all MDS, then:
>>>>>> # rados -p cephfs_metadata_pool rm mds0_openfiles.0
>>>>>> then start one MDS.
>>>>>>
>>>>>> We are back online. Amazing!!! :)
>>>>>>
>>>>>>
>>>>>> On 04.12.2020 12:20, Dan van der Ster wrote:
>>>>>>> Please also make sure the mds_beacon_grace is high on the
mon's too.
>>>>>>>
>>>>>>> it doesn't matter which mds you select to be the running
one.
>>>>>>>
>>>>>>> Is the processing getting killed, restarted?
>>>>>>> If you're confident that the mds is getting OOM killed
during rejoin
>>>>>>> step, then you might find this useful:
>>>>>>>
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028964.html
>>>>>>>
>>>>>>> Stop all MDS, then:
>>>>>>> # rados -p cephfs_metadata_pool rm mds0_openfiles.0
>>>>>>> then start one MDS.
>>>>>>>
>>>>>>> -- Dan
>>>>>>>
>>>>>>> On Fri, Dec 4, 2020 at 11:05 AM Anton Aleksandrov
<anton(a)aleksandrov.eu> wrote:
>>>>>>>> Yes, MDS eats all memory+swap, stays like this for a
moment and then
>>>>>>>> frees memory.
>>>>>>>>
>>>>>>>> mds_beacon_grace was already set to 1800
>>>>>>>>
>>>>>>>> Also on other it is seen this message: Map has assigned
me to become a
>>>>>>>> standby.
>>>>>>>>
>>>>>>>> Does it matter, which MDS we stop and which we leave
running?
>>>>>>>>
>>>>>>>> Anton
>>>>>>>>
>>>>>>>>
>>>>>>>> On 04.12.2020 11:53, Dan van der Ster wrote:
>>>>>>>>> How many active MDS's did you have? (max_mds ==
1, right?)
>>>>>>>>>
>>>>>>>>> Stop the other two MDS's so you can focus on
getting exactly one running.
>>>>>>>>> Tail the log file and see what it is reporting.
>>>>>>>>> Increase mds_beacon_grace to 600 so that the mon
doesn't fail this MDS
>>>>>>>>> while it is rejoining.
>>>>>>>>>
>>>>>>>>> Is that single MDS running out of memory during the
rejoin phase?
>>>>>>>>>
>>>>>>>>> -- dan
>>>>>>>>>
>>>>>>>>> On Fri, Dec 4, 2020 at 10:49 AM Anton Aleksandrov
<anton(a)aleksandrov.eu> wrote:
>>>>>>>>>> Hello community,
>>>>>>>>>>
>>>>>>>>>> we are on ceph 13.2.8 - today something happenned
with one MDS and cephs
>>>>>>>>>> status tells, that filesystem is degraded. It
won't mount either. I have
>>>>>>>>>> take server with MDS, that was not working down.
There are 2 more MDS
>>>>>>>>>> servers, but they stay in "rejoin"
state. Also only 1 is shown in
>>>>>>>>>> "services", even though there are 2.
>>>>>>>>>>
>>>>>>>>>> Both running MDS servers have these lines in
their logs:
>>>>>>>>>>
>>>>>>>>>> heartbeat_map is_healthy 'MDSRank' had
timed out after 15
>>>>>>>>>> mds.beacon.mds2 Skipping beacon heartbeat to
monitors (last acked
>>>>>>>>>> 28.8979s ago); MDS internal heartbeat is not
healthy!
>>>>>>>>>>
>>>>>>>>>> On one of MDS nodes I enabled more detailed
debug, so I am getting there
>>>>>>>>>> also:
>>>>>>>>>>
>>>>>>>>>> mds.beacon.mds3 Sending beacon up:standby seq
178
>>>>>>>>>> mds.beacon.mds3 received beacon reply up:standby
seq 178 rtt 0.000999968
>>>>>>>>>>
>>>>>>>>>> Makes no sense and too much stress in my head...
Anyone could help please?
>>>>>>>>>>
>>>>>>>>>> Anton.
>>>>>>>>>> _______________________________________________
>>>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>>>>>> To unsubscribe send an email to
ceph-users-leave(a)ceph.io
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io