Alright, I did some further research and found this topic which seems to
be about the same problem:
We have many small files (as I said, the file list is 23GB) and since we
are only copying them but we are not accessing them afterwards, the
clients start piling up capabilities, which at least explains the
growing cache sizes and the MDSs' failure to keep up (there should
definitely be a solution to this, batch-copying many files to a CephFS
is a pretty standard use case).
After I increased the beacon grace period, I experienced few MDS crashes
(although I still see them flapping occasionally), but now I have
another problem. After too many MDS failures (?) the client starts
locking up and the mount becomes unresponsive. Sometimes it becomes so
unresponsive, I cannot even unmount it with umount -lf and have to
force-reboot the server. While the client is locked up, the MDSs recover
and the FS is accessible again without issues from other clients. This
looks like a bug to me. I tried upgrading the client from Mimic to
Nautilus, but I have the same problem.
I increased the MDS max cache size massively and started the copy job
again, let's see how far it goes this time.
On 22.07.19 15:02, Janek Bevendorff wrote:
I am trying to copy the contents of our storage server into a CephFS,
but am experiencing stability issues with my MDSs. The CephFS sits on
top of an erasure-coded pool with 5 MONs, 5 MDSs and a max_mds setting
of two. My Ceph cluster version is Nautilus, the client is Mimic and
uses the kernel module to mount the FS.
The index of filenames to copy is about 23GB and I am using 16
parallel rsync processes over a 10G link to copy the files over to
Ceph. This works perfectly for a while, but then the MDSs start
reporting oversized caches (between 20 and 50GB, sometimes more) and
an inode count between 1 and 4 million. Particularly the Inode count
seems quite high to me. Each rsync job has 25k files to work with, so
if all 16 processes open all their files at the same time, I should
not exceed 400k. Even if I double this number to account for the
client's page cache, I should get nowhere near that number of inodes
(a sync flush takes about 1 second).
Then after a few hours, my MDSs start failing with messages like this:
-21> 2019-07-22 14:00:05.877 7f67eacec700 1 heartbeat_map
is_healthy 'MDSRank' had timed out after 15
-20> 2019-07-22 14:00:05.877 7f67eacec700 0 mds.beacon.XXX
Skipping beacon heartbeat to monitors (last acked 24.0042s ago); MDS
internal heartbeat is not healthy!
The standby nodes try to take over, but take forever to become active
and will fail as well eventually.
During my research, I found this related topic:
but I tried everything in there from increasing to lowering my cache
size, the number of segments etc. I also played around with the number
of active MDSs and two appears to work the best, whereas one cannot
keep up with the load and three seems to be the worst of all choices.
Do you have any ideas how I can improve the stability of my MDS damons
to handle the load properly? single 10G link is a toy and we could
query the cluster with a lot more requests per second, but it's
already yielding to 16 rsync processes.
Bauhausstr. 9a, Room 308
99423 Weimar, Germany
Phone: +49 (0)3643 - 58 3577