On Mon, Dec 7, 2020 at 9:42 AM Janek Bevendorff
I have played with many thresholds, including the decay rates. It is
indeed very difficult to assess their effects, since our workloads
differ widely depending on what people are working on at the moment. I
would need to develop a proper benchmarking suite to simulate the
different heavy workloads we have.
We currently run with all those options scaled up
6x the defaults, and
we almost never have caps recall warnings these days, with a couple
thousand cephfs clients.
Under normal operation, we don't either. We had issues in the past with
Ganesha and still do sometimes, but that's a bug in Ganesha and we don't
really use it for anything but legacy clients any way. Usually, recall
+1 we've seen that ganesha issue; it simply won't release caps ever,
even with the latest fixes in this area.
works flawlessly, unless some client suddenly starts
doing crazy shit.
We have just a few clients who regularly keep tens of thousands of caps
open and had I not limited the number, it would be hundreds of
thousands. Recalling them without threatening stability is not trivial
and at the least it degrades the performance for everybody else. Any
pointers here to better handling this situation are greatly appreciated.
I will definitely try your config recommendations.
2. A user running VSCodium, keeping 15k caps
open.. the opportunistic
caps recall eventually starts recalling those but the (el7 kernel)
client won't release them. Stopping Codium seems to be the only way to
As I said, 15k is not much for us. The limits right now are 64k per
client and a few hit that limit quite regularly. One of those clients is
What exactly do you set to 64k?
We used to set mds_max_caps_per_client to 50000, but once we started
using the tuned caps recall config, we reverted that back to the
default 1M without issue.
This 15k caps client I mentioned is not related to the max caps per
client config. In recent nautilus, the MDS will proactively recall
caps from idle clients -- so a client with even just a few caps like
this can provoke the caps recall warnings (if it is buggy, like in
this case). The client doesn't cause any real problems, just the
So what I'm looking for now is a way to disable proactively recalling
if the num caps is below some threshold -- `min_caps_per_client` might
do this but I haven't tested yet.
our VPN gateway, which, technically, is not a single
client, but to the
CephFS it looks like one due to source NAT. This is certainly something
I want to tune further, so that clients are routed directly via their
private IP instead of being NAT'ed. The other ones are our GPU deep
learning servers (just three of them, but they can generate astounding
numbers of iops) and the 135-node Hadoop cluster (which is hard to
sustain for any single machine, so we prefer to use the S3 here).
Otherwise, 4GB is normally sufficient in our env
mds_cache_memory_limit (3 active MDSs), however this is highly
workload dependent. If several clients are actively taking 100s of
thousands of caps, then the 4GB MDS needs to be ultra busy recalling
caps and latency increases. We saw this live a couple weeks ago: a few
users started doing intensive rsyncs, and some other users noticed an
MD latency increase; it was fixed immediately just by increasing the
mem limit to 8GB.
So you too have 3 active MDSs? Are you using directory pinning? We have
a very deep and unbalanced directory structure, so I cannot really pin
any top-level directory without skewing the load massively. From my
experience, three MDSs without explicit pinning aren't much better or
even worse than one. But perhaps you have different observations?
Yes 3 active today, and lots of pinning thanks to our flat hierarchy.
User dirs are pinned to one of three randomly, as are the manila
MD balancer = on creates a disaster in our env -- too much ping pong
of dirs between the MDSs, too much metadata IO needed to keep up, not
to mention "nice export" bugs in the past that forced us to disable
the balancer to begin with.
We used to have 10 active MDSs, but that is such a pain during
upgrades that we're now trying with just three. Next upgrade we'll
probably leave it at one for a while to see if that suffices.
Multi-active + pinning definitely increases the overall MD throughput
(once you can get the relevant inodes cached), because as you know the
MDS is single threaded and CPU bound at the limit.
We could get something like 4-5k handle_client_requests out of a
single MDS, and that really does scale horizontally as you add MDSs
> > I agree some sort of tuning best practises should all be documented
> > somehow, even though it's complex and rather delicate.