I am trying to copy the contents of our storage server into a CephFS,
but am experiencing stability issues with my MDSs. The CephFS sits on
top of an erasure-coded pool with 5 MONs, 5 MDSs and a max_mds setting
of two. My Ceph cluster version is Nautilus, the client is Mimic and
uses the kernel module to mount the FS.
The index of filenames to copy is about 23GB and I am using 16 parallel
rsync processes over a 10G link to copy the files over to Ceph. This
works perfectly for a while, but then the MDSs start reporting oversized
caches (between 20 and 50GB, sometimes more) and an inode count between
1 and 4 million. Particularly the Inode count seems quite high to me.
Each rsync job has 25k files to work with, so if all 16 processes open
all their files at the same time, I should not exceed 400k. Even if I
double this number to account for the client's page cache, I should get
nowhere near that number of inodes (a sync flush takes about 1 second).
Then after a few hours, my MDSs start failing with messages like this:
-21> 2019-07-22 14:00:05.877 7f67eacec700 1 heartbeat_map
is_healthy 'MDSRank' had timed out after 15
-20> 2019-07-22 14:00:05.877 7f67eacec700 0 mds.beacon.XXX Skipping
beacon heartbeat to monitors (last acked 24.0042s ago); MDS internal
heartbeat is not healthy!
The standby nodes try to take over, but take forever to become active
and will fail as well eventually.
During my research, I found this related topic:
but I tried everything in there from increasing to lowering my cache
size, the number of segments etc. I also played around with the number
of active MDSs and two appears to work the best, whereas one cannot keep
up with the load and three seems to be the worst of all choices.
Do you have any ideas how I can improve the stability of my MDS damons
to handle the load properly? single 10G link is a toy and we could query
the cluster with a lot more requests per second, but it's already
yielding to 16 rsync processes.