Two standby-replay daemons along with active daemons have be deployed in my test cluster. It can help reduce replay time.

-----Original Message-----
From: "Zizon Qiu"<zzdtsv@gmail.com>
To: "Yongseok Oh"<yongseok.oh@linecorp.com>;
Cc: "dev"<dev@ceph.io>;
Sent: 2021. 3. 31. (수) 23:23 (GMT+09:00)
Subject: Re: Too slow CephFS MDS Restart (Recovery) Performance with Many Sessions and Large Cache Size

MDS has its own journals, so I guess it is the cause of slow recovery.

And maybe the re-election of authority MDS of particular inode/subtree in paxos service also contributes to its slowdown.

Have you try setting up a stand-by-reply mds for each active one?

I mean not standy, but stand-by-reply mds which actively replying journal from active mds.t

That one may help mitigate the no-service time?

On Wed, Mar 31, 2021 at 4:13 PM Yongseok Oh <yongseok.oh@linecorp.com> wrote:

Hi Ceph users and developers,

[Apologies if duplicated. I posted the same content to ceph-users two days ago. But it cannot be seen so far.]

The object of this is to find a solution to MDS slow recovery/rejoin process. Our group at LINE has provided a shared file system service based on CephFS with K8S and Openstack users since last year. The performance and functionalities of CephFS are good for us, whereas MDS availability is not feasible. In reality, restarting active MDS(s) is needed for some reasons such as version upgrade and daemon crash, and parameter change. In our experience, it takes from a few minutes to tens of minutes with hundreds of sessions and two active MDSs in a cluster, where the mds_cache_memory_limit is set to more than 16GB. So, we have hesitation about restarting MDS and our customers are also not satisfied with this situation.

To analyze and reproduce the slow MDS recovery process, some experiments have been conducted using our test environment as described below.
- CentOS 7.9 kernel 3.10.0-1160.11.1.el7.x86_64
- Nautilus 14.2.16
- mds_cache_memory_limit 16GB
- MDS (two active MDSs, two standby-replay MDSs, one single MDS)
- OSD (20 OSDs, each OSD maintains a 100GB virtual disk)
- 500 sessions using kernel driver (among them, only 50 sessions generating workloads are considered as active clients, while other sessions are just mounted and rarely issue disk stat requests.)
- VDbench tool is employed to generate metadata-intensive workloads

In this experiment, each session has its own subvolume allocated for testing. VDbench for each session is configured with set depth=1, width=10, files=24576, filesize=4K and elapsed = 10800s. Each VDbench instance on a session first creates directories and files. When dirs/files are created as many as predefined values, it randomly issues getattr, create, and unlink operations for some hours.
During VDbench running, our restart test procedure written in Python restarts an active MDS when the average numbers of inodes and caps are greater than 4,915,200 and 1,000,000. Recovery times from stopping an active MDS to becoming an active MDS were measured as listed below. the results show that numbers are varied from a few minutes to tens of minutes when experiments are conducted.

recovery_count, recovery_time(s)
1, 1557
2, 1386
3, 846
4, 1012
5, 1119
6, 1272

A few things I have analyzed
- Rejoining process consumes a considerable amount of time. That's a known issue. (Sometimes respawning MDS happened. Increasing mds_heartbeat_grace doesn't help.)
- Reducing caps and mds_cache_memory_limit has a somewhat impact on the recovery performance. As the inodes/dentries caches are reduced, reply latencies are sharply increased.
- If all clients are using ceph-fuse, the recovery time may be a few times increased compared to using kernel driver.
- Even though only one active MDS is restarted, other MDS is sometimes also abruptly restarted.
- Dropping cache (e.g., ceph dameon mds.$(hostname) cache drop) helps reduce the recovery time, but it takes a few hours under active client workloads and latency spikes are inevitable.

A few of my questions.
- Why does the MDS recovery take so long despite a graceful restart? (What mainly depends on the time?)
- Are there some solutions to this problem with many active sessions and large MDS cache size? (1~2 minutes recovery time will be satisfied, that's our target value.)
- How can we make it deterministic in less than a few minutes?

Thanks

Yongseok
_______________________________________________
Dev mailing list -- dev@ceph.io
To unsubscribe send an email to dev-leave@ceph.io