Re: Too slow CephFS MDS Restart (Recovery) Performance with Many Sessions and Large Cache Size

1 Apr 2021

On Thu, Apr 1, 2021 at 8:39 PM Dan van der Ster &lt;dan(a)vanderster.com&gt; wrote:

...
  On Wed, Mar 31, 2021 at 6:46 PM Patrick Donnelly
&lt;pdonnell(a)redhat.com&gt;
 wrote:

 Hello Yongseok,

 On Wed, Mar 31, 2021 at 1:13 AM Yongseok Oh &lt;yongseok.oh(a)linecorp.com&gt; 
wrote:
  >  ...
  > A few things I have analyzed
 > - Rejoining process consumes a considerable amount of time. That's a 
known issue. (Sometimes respawning MDS happened. Increasing
 mds_heartbeat_grace doesn't help.)

 Please turn up logging to:

 debug_mds = 5

 to get an idea what the MDS is doing when respawn occurs. 
 If it helps, here is a log with 2/5 from a recent failover which took
 3.5 minutes: https://termbin.com/b022
 This is 14.2.11 with the optimized recall/cache tuning.

 the "Updating MDS map to version" keep showing from init-rejoin at
around
2021-03-18 17:12:52.863
until
2021-03-18 17:22:36.356
while the rejoin itself finished at
2021-03-18 17:15:58.325

So,if it do associated to paxos(with too much changes between then), maybe
pinning some
subtree/directory to particular mds/rank would help too.

Indeed rejoin is always the longest step -- even with cephfs_metadata
...
  on SSDs. These MDSs had the cache limit set to 8GB,
and you can see
 that the rejoining MDS needed 56GB while booting.

 I haven't had a chance to test the rejoin/openfiletables optimizations
 yet. (https://github.com/ceph/ceph/pull/37383)
 But I had understood that this is intended to decrease that rejoin
 memory usage -- will it also speed things up?

 -- dan
 _______________________________________________
 Dev mailing list -- dev(a)ceph.io
 To unsubscribe send an email to dev-leave(a)ceph.io

2024

2023

2022

2021

2020

2019

Re: Too slow CephFS MDS Restart (Recovery) Performance with Many Sessions and Large Cache Size