MDS has its own journals, so I guess it is the cause of slow recovery.
And maybe the re-election of authority MDS of particular inode/subtree in
paxos service also contributes to its slowdown.
Have you try setting up a stand-by-reply mds for each active one?
I mean not standy, but stand-by-reply mds which actively replying journal
from active mds.t
That one may help mitigate the no-service time?
On Wed, Mar 31, 2021 at 4:13 PM Yongseok Oh <yongseok.oh(a)linecorp.com>
wrote:
Hi Ceph users and developers,
[Apologies if duplicated. I posted the same content to ceph-users two days
ago. But it cannot be seen so far.]
The object of this is to find a solution to MDS slow recovery/rejoin
process. Our group at LINE has provided a shared file system service based
on CephFS with K8S and Openstack users since last year. The performance and
functionalities of CephFS are good for us, whereas MDS availability is not
feasible. In reality, restarting active MDS(s) is needed for some reasons
such as version upgrade and daemon crash, and parameter change. In our
experience, it takes from a few minutes to tens of minutes with hundreds of
sessions and two active MDSs in a cluster, where the mds_cache_memory_limit
is set to more than 16GB. So, we have hesitation about restarting MDS and
our customers are also not satisfied with this situation.
To analyze and reproduce the slow MDS recovery process, some experiments
have been conducted using our test environment as described below.
- CentOS 7.9 kernel 3.10.0-1160.11.1.el7.x86_64
- Nautilus 14.2.16
- mds_cache_memory_limit 16GB
- MDS (two active MDSs, two standby-replay MDSs, one single MDS)
- OSD (20 OSDs, each OSD maintains a 100GB virtual disk)
- 500 sessions using kernel driver (among them, only 50 sessions
generating workloads are considered as active clients, while other sessions
are just mounted and rarely issue disk stat requests.)
- VDbench tool is employed to generate metadata-intensive workloads
In this experiment, each session has its own subvolume allocated for
testing. VDbench for each session is configured with set depth=1, width=10,
files=24576, filesize=4K and elapsed = 10800s. Each VDbench instance on a
session first creates directories and files. When dirs/files are created as
many as predefined values, it randomly issues getattr, create, and unlink
operations for some hours.
During VDbench running, our restart test procedure written in Python
restarts an active MDS when the average numbers of inodes and caps are
greater than 4,915,200 and 1,000,000. Recovery times from stopping an
active MDS to becoming an active MDS were measured as listed below. the
results show that numbers are varied from a few minutes to tens of minutes
when experiments are conducted.
recovery_count, recovery_time(s)
1, 1557
2, 1386
3, 846
4, 1012
5, 1119
6, 1272
A few things I have analyzed
- Rejoining process consumes a considerable amount of time. That's a known
issue. (Sometimes respawning MDS happened. Increasing mds_heartbeat_grace
doesn't help.)
- Reducing caps and mds_cache_memory_limit has a somewhat impact on the
recovery performance. As the inodes/dentries caches are reduced, reply
latencies are sharply increased.
- If all clients are using ceph-fuse, the recovery time may be a few times
increased compared to using kernel driver.
- Even though only one active MDS is restarted, other MDS is sometimes
also abruptly restarted.
- Dropping cache (e.g., ceph dameon mds.$(hostname) cache drop) helps
reduce the recovery time, but it takes a few hours under active client
workloads and latency spikes are inevitable.
A few of my questions.
- Why does the MDS recovery take so long despite a graceful restart? (What
mainly depends on the time?)
- Are there some solutions to this problem with many active sessions and
large MDS cache size? (1~2 minutes recovery time will be satisfied, that's
our target value.)
- How can we make it deterministic in less than a few minutes?
Thanks
Yongseok
_______________________________________________
Dev mailing list -- dev(a)ceph.io
To unsubscribe send an email to dev-leave(a)ceph.io