On Sat, Dec 5, 2020 at 5:41 AM Janek Bevendorff
<janek.bevendorff(a)uni-weimar.de> wrote:
On 05/12/2020 09:26, Dan van der Ster wrote:
Hi Janek,
I'd love to hear your standard maintenance procedures. Are you
cleaning up those open files outside of "rejoin" OOMs ?
No, of course not. But those rejoin problems happen more often than I'd
like them to. It has become much better with recent releases, but if one
of the clients trains a Tensorflow model from files in the CephFS or
when our Hadoop cluster starts reading from it, the MDS will almost
certainly crash or at least degrade massively in performance. S3 doesn't
have these problems at all, obviously.
This sounds like there is one or a few clients acquiring too many
caps. Have you checked this? Are there any messages about the OOM
killer? What config changes for the MDS have you made?
I'm hopeful your problems will be addressed by:
https://tracker.ceph.com/issues/47307
--
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D