[Ceph-users] Re: MDS failing under load with large cache sizes

6 Aug 2019

On Tue, Aug 6, 2019 at 12:48 AM Janek Bevendorff
&lt;janek.bevendorff(a)uni-weimar.de&gt; wrote:
...
  > However, now my client processes are basically in
constant I/O wait
 > state and the CephFS is slow for everybody. After I restarted the copy
 > job, I got around 4k reqs/s and then it went down to 100 reqs/s with
 > everybody waiting their turn. So yes, it does seem to help, but it
 > increases latency by a magnitude. 
4k req/s is too fast for a create workload on one MDS. That must
include other operations like getattr.

...
  Addition: I reduced the number to 256K and the cache
size started
 inflating instantly (with about 140 reqs/s). So I reset it to 512K and
 the cache size started reducing slowly, though with fewer reqs/s.

 So I guess it is solving the problem, but only by trading it off against
 severe latency issues (order of magnitude as we saw). 
I wouldn't expect such extreme latency issues. Please share:

ceph config dump
ceph daemon mds.X cache status

and the two perf dumps one second apart again please.

Also, you said you removed the aggressive recall changes. I assume you
didn't reset them to the defaults, right? Just the first suggested
change (10k/1.0)?

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

2024

2023

2022

2021

2020

2019

[Ceph-users] Re: MDS failing under load with large cache sizes