This sounds like there is one or a few clients
acquiring too many
caps. Have you checked this? Are there any messages about the OOM
killer? What config changes for the MDS have you made?
Yes, it's individual clients acquiring too my caps. I first ran the
adjusted recall settings you suggested after we had gone through several
bugs. Right now I am trying distributed ephemeral pinning with 3 MDS
Dan's suggestion of 6x the default values for recall from the MDS
documentation thread. So far, it's working quite well.
I'm hopeful your problems will be addressed by:
https://tracker.ceph.com/issues/47307 That does indeed sound a bit like it might
fix these kind of issues.