Hi,
You can also try increasing the aggressiveness of the
MDS recall but
I'm surprised it's still a problem with the settings I gave you:
ceph config set mds mds_recall_max_caps 15000
ceph config set mds mds_recall_max_decay_rate 0.75
I finally had the chance to try the more aggressive recall settings, but
they did not change anything. As soon as the client starts copying files
again, the numbers go up an I get a health message that the client is
failing to respond to cache pressure.
After this week of idle time, the dns/inos numbers (what does dns stand
for anyway?) settled at around 8000k. That's basically that "idle"
number that it goes back to when the client stops copying files. Though,
for some weird reason, this number gets (quite) a bit higher every time
(last time it was around 960k). Of course, I wouldn't expect it to go
back all the way to zero, because that would mean dropping the entire
cache for no reason, but it's still quite high and the same after
restarting the MDS and all clients, which doesn't make a lot of sense to
me. After resuming the copy job, the number went up to 20M in just the
time it takes to write this email. There must be a bug somewhere.
Can you share two captures of `ceph daemon mds.X perf
dump` about 1
second apart.
I attached the requested perf dumps.
Thanks!