Hi Patrick,
The average cap counter per session is varied from 50K to 250K, while the total caps in MDS is ranged from up to 4mill roughtly. I will collect MDS log with debug_mds = 2. But it will take several days because I need to check our security issue. Once logs are available, I will let you know.
I will try to upgrade the Ceph cluster version as you suggested. Thanks
Yongseok

-----Original Message-----
From: "Patrick Donnelly"<pdonnell@redhat.com>
To: "Yongseok Oh"<yongseok.oh@linecorp.com>;
Cc: "dev"<dev@ceph.io>;
Sent: 2021. 4. 1. (목) 01:46 (GMT+09:00)
Subject: Re: Too slow CephFS MDS Restart (Recovery) Performance with Many Sessions and Large Cache Size
 

Hello Yongseok,

On Wed, Mar 31, 2021 at 1:13 AM Yongseok Oh <yongseok.oh@linecorp.com> wrote:
>
> Hi Ceph users and developers,
>
> [Apologies if duplicated. I posted the same content to ceph-users two days ago. But it cannot be seen so far.]
>
> The object of this is to find a solution to MDS slow recovery/rejoin process. Our group at LINE has provided a shared file system service based on CephFS with K8S and Openstack users since last year. The performance and functionalities of CephFS are good for us, whereas MDS availability is not feasible. In reality, restarting active MDS(s) is needed for some reasons such as version upgrade and daemon crash, and parameter change. In our experience, it takes from a few minutes to tens of minutes with hundreds of sessions and two active MDSs in a cluster, where the mds_cache_memory_limit is set to more than 16GB.  So, we have hesitation about restarting MDS and our customers are also not satisfied with this situation.
>
> To analyze and reproduce the slow MDS recovery process, some experiments have been conducted using our test environment as described below.
> - CentOS 7.9 kernel 3.10.0-1160.11.1.el7.x86_64
> - Nautilus 14.2.16

Please try upgrading to v14.2.17+. The recall/cache config defaults
were updated which may help some.

> - mds_cache_memory_limit 16GB
> - MDS (two active MDSs, two standby-replay MDSs, one single MDS)
> - OSD (20 OSDs, each OSD maintains a 100GB virtual disk)
> - 500 sessions using kernel driver (among them, only 50 sessions generating workloads are considered as active clients, while other sessions are just mounted and rarely issue  disk stat requests.)
> - VDbench tool is employed to generate metadata-intensive workloads
>
> In this experiment, each session has its own subvolume allocated for testing. VDbench for each session is configured with set depth=1, width=10, files=24576, filesize=4K and elapsed = 10800s. Each VDbench instance on a session first creates directories and files. When dirs/files are created as many as predefined values, it randomly issues getattr, create, and unlink operations for some hours.
> During VDbench running, our restart test procedure written in Python restarts an active MDS when the average numbers of inodes and caps are greater than 4,915,200 and 1,000,000. Recovery times from stopping an active MDS to becoming an active MDS were measured as listed below. the results show that numbers are varied from a few minutes to tens of minutes when experiments are conducted.
>
> recovery_count, recovery_time(s)
> 1, 1557
> 2, 1386
> 3, 846
> 4, 1012
> 5, 1119
> 6, 1272

What is the average cap count per session?

> A few things I have analyzed
> - Rejoining process consumes a considerable amount of time. That's a known issue. (Sometimes respawning MDS happened. Increasing mds_heartbeat_grace doesn't help.)

Please turn up logging to:

debug_mds = 5

to get an idea what the MDS is doing when respawn occurs.

> - Reducing caps and mds_cache_memory_limit has a somewhat impact on the recovery performance. As the inodes/dentries caches are reduced, reply latencies are sharply increased.
> - If all clients are using ceph-fuse, the recovery time may be a few times increased compared to using kernel driver.
> - Even though only one active MDS is restarted, other MDS is sometimes also abruptly restarted.

This is also curious, the added debug may help diagnose.

> - Dropping cache (e.g., ceph dameon mds.$(hostname) cache drop) helps reduce the recovery time, but it takes a few hours under active client workloads and latency spikes are inevitable.

Dropping the MDS cache is limited by the ability to recall state from
clients. It can take a significant amount of time (and may not be able
to reach a satisfactory "zero" size).

> A few of my questions.
> - Why does the MDS recovery take so long despite a graceful restart? (What mainly depends on the time?)

It may be related to the open file table; there were several
improvements there recently.  Some of those fixes have not (and will
not) been backported to Nautilus/Octopus.

> - Are there some solutions to this problem with many active sessions and large MDS cache size? (1~2 minutes recovery time will be satisfied, that's our target value.)

This has not been well examined recently with MDS with large caches
(although 16GB is not at all atypical nowadays). So solutions will be
speculative. It would help to know what the MDS is doing that's taking
so long. The debugging will help. Upgrading may help.

--
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D