Re: Too slow CephFS MDS Restart (Recovery) Performance with Many Sessions and Large Cache Size

31 Mar 2021

MDS has its own journals, so I guess it is the cause of slow recovery.
And maybe the re-election of authority MDS of particular inode/subtree in
paxos service also contributes to its slowdown.

Have you try setting up a stand-by-reply mds for each active one?
I mean not standy, but stand-by-reply mds which actively replying journal
from active mds.t
That one may help mitigate the no-service time?

On Wed, Mar 31, 2021 at 4:13 PM Yongseok Oh &lt;yongseok.oh(a)linecorp.com&gt;
wrote:

...
  Hi Ceph users and developers,

 [Apologies if duplicated. I posted the same content to ceph-users two days
 ago. But it cannot be seen so far.]

 The object of this is to find a solution to MDS slow recovery/rejoin
 process. Our group at LINE has provided a shared file system service based
 on CephFS with K8S and Openstack users since last year. The performance and
 functionalities of CephFS are good for us, whereas MDS availability is not
 feasible. In reality, restarting active MDS(s) is needed for some reasons
 such as version upgrade and daemon crash, and parameter change. In our
 experience, it takes from a few minutes to tens of minutes with hundreds of
 sessions and two active MDSs in a cluster, where the mds_cache_memory_limit
 is set to more than 16GB.  So, we have hesitation about restarting MDS and
 our customers are also not satisfied with this situation.

 To analyze and reproduce the slow MDS recovery process, some experiments
 have been conducted using our test environment as described below.
 - CentOS 7.9 kernel 3.10.0-1160.11.1.el7.x86_64
 - Nautilus 14.2.16
 - mds_cache_memory_limit 16GB
 - MDS (two active MDSs, two standby-replay MDSs, one single MDS)
 - OSD (20 OSDs, each OSD maintains a 100GB virtual disk)
 - 500 sessions using kernel driver (among them, only 50 sessions
 generating workloads are considered as active clients, while other sessions
 are just mounted and rarely issue  disk stat requests.)
 - VDbench tool is employed to generate metadata-intensive workloads

 In this experiment, each session has its own subvolume allocated for
 testing. VDbench for each session is configured with set depth=1, width=10,
 files=24576, filesize=4K and elapsed = 10800s. Each VDbench instance on a
 session first creates directories and files. When dirs/files are created as
 many as predefined values, it randomly issues getattr, create, and unlink
 operations for some hours.
 During VDbench running, our restart test procedure written in Python
 restarts an active MDS when the average numbers of inodes and caps are
 greater than 4,915,200 and 1,000,000. Recovery times from stopping an
 active MDS to becoming an active MDS were measured as listed below. the
 results show that numbers are varied from a few minutes to tens of minutes
 when experiments are conducted.

 recovery_count, recovery_time(s)
 1, 1557
 2, 1386
 3, 846
 4, 1012
 5, 1119
 6, 1272

 A few things I have analyzed
 - Rejoining process consumes a considerable amount of time. That's a known
 issue. (Sometimes respawning MDS happened. Increasing mds_heartbeat_grace
 doesn't help.)
 - Reducing caps and mds_cache_memory_limit has a somewhat impact on the
 recovery performance. As the inodes/dentries caches are reduced, reply
 latencies are sharply increased.
 - If all clients are using ceph-fuse, the recovery time may be a few times
 increased compared to using kernel driver.
 - Even though only one active MDS is restarted, other MDS is sometimes
 also abruptly restarted.
 - Dropping cache (e.g., ceph dameon mds.$(hostname) cache drop) helps
 reduce the recovery time, but it takes a few hours under active client
 workloads and latency spikes are inevitable.

 A few of my questions.
 - Why does the MDS recovery take so long despite a graceful restart? (What
 mainly depends on the time?)
 - Are there some solutions to this problem with many active sessions and
 large MDS cache size? (1~2 minutes recovery time will be satisfied, that's
 our target value.)
 - How can we make it deterministic in less than a few minutes?

 Thanks

 Yongseok
 _______________________________________________
 Dev mailing list -- dev(a)ceph.io
 To unsubscribe send an email to dev-leave(a)ceph.io

2024

2023

2022

2021

2020

2019

Re: Too slow CephFS MDS Restart (Recovery) Performance with Many Sessions and Large Cache Size