[ceph-users] Re: MDS lost, Filesystem degraded and wont mount

7 Dec 2020

On Sat, Dec 5, 2020 at 5:41 AM Janek Bevendorff
&lt;janek.bevendorff(a)uni-weimar.de&gt; wrote:
...

 On 05/12/2020 09:26, Dan van der Ster wrote:
  Hi Janek,

 I'd love to hear your standard maintenance procedures. Are you
 cleaning up those open files outside of "rejoin" OOMs ? 
 No, of course not. But those rejoin problems happen more often than I'd
 like them to. It has become much better with recent releases, but if one
 of the clients trains a Tensorflow model from files in the CephFS or
 when our Hadoop cluster starts reading from it, the MDS will almost
 certainly crash or at least degrade massively in performance. S3 doesn't
 have these problems at all, obviously. 
This sounds like there is one or a few clients acquiring too many
caps. Have you checked this? Are there any messages about the OOM
killer? What config changes for the MDS have you made?

I'm hopeful your problems will be addressed by:
https://tracker.ceph.com/issues/47307

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

2024

2023

2022

2021

2020

2019

[ceph-users] Re: MDS lost, Filesystem degraded and wont mount