On Mon, Aug 5, 2019 at 12:21 AM Janek Bevendorff
You can also try increasing the aggressiveness of
the MDS recall but
I'm surprised it's still a problem with the settings I gave you:
ceph config set mds mds_recall_max_caps 15000
ceph config set mds mds_recall_max_decay_rate 0.75
I finally had the chance to try the more aggressive recall settings, but
they did not change anything. As soon as the client starts copying files
again, the numbers go up an I get a health message that the client is
failing to respond to cache pressure.
After this week of idle time, the dns/inos numbers (what does dns stand
for anyway?) settled at around 8000k. That's basically that "idle"
number that it goes back to when the client stops copying files. Though,
for some weird reason, this number gets (quite) a bit higher every time
(last time it was around 960k). Of course, I wouldn't expect it to go
back all the way to zero, because that would mean dropping the entire
cache for no reason, but it's still quite high and the same after
restarting the MDS and all clients, which doesn't make a lot of sense to
me. After resuming the copy job, the number went up to 20M in just the
time it takes to write this email. There must be a bug somewhere.
Can you share two captures of `ceph daemon mds.X
perf dump` about 1
I attached the requested perf dumps.
Thanks that helps. Looks like the problem is that the MDS is not
automatically trimming its cache fast enough. Please try bumping
bin/ceph config set mds mds_cache_trim_threshold 512K
Increase it further if it's not aggressive enough. Please let us know
if that helps.
It shouldn't be necessary to do this so I'll make a tracker ticket
once we confirm that's the issue.
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA