[Ceph-users] Re: MDS failing under load with large cache sizes

5 Aug 2019

On Mon, Aug 5, 2019 at 12:21 AM Janek Bevendorff
&lt;janek.bevendorff(a)uni-weimar.de&gt; wrote:
...

 Hi,

  You can also try increasing the aggressiveness of
the MDS recall but
 I'm surprised it's still a problem with the settings I gave you:

 ceph config set mds mds_recall_max_caps 15000
 ceph config set mds mds_recall_max_decay_rate 0.75 
 I finally had the chance to try the more aggressive recall settings, but
 they did not change anything. As soon as the client starts copying files
 again, the numbers go up an I get a health message that the client is
 failing to respond to cache pressure.

 After this week of idle time, the dns/inos numbers (what does dns stand
 for anyway?) settled at around 8000k. That's basically that "idle"
 number that it goes back to when the client stops copying files. Though,
 for some weird reason, this number gets (quite) a bit higher every time
 (last time it was around 960k). Of course, I wouldn't expect it to go
 back all the way to zero, because that would mean dropping the entire
 cache for no reason, but it's still quite high and the same after
 restarting the MDS and all clients, which doesn't make a lot of sense to
 me. After resuming the copy job, the number went up to 20M in just the
 time it takes to write this email. There must be a bug somewhere.

  Can you share two captures of `ceph daemon mds.X
perf dump` about 1
 second apart. 
 I attached the requested perf dumps. 
Thanks that helps. Looks like the problem is that the MDS is not
automatically trimming its cache fast enough. Please try bumping
mds_cache_trim_threshold:

bin/ceph config set mds mds_cache_trim_threshold 512K

Increase it further if it's not aggressive enough. Please let us know
if that helps.

It shouldn't be necessary to do this so I'll make a tracker ticket
once we confirm that's the issue.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

2024

2023

2022

2021

2020

2019

[Ceph-users] Re: MDS failing under load with large cache sizes