I just wanted to give you some feedback about how 14.2.5 is working for
me. I've had the chance to test it for a day now and overall, the
experience is much better, although not perfect (perhaps far from it).
I have two active MDS (I figured that'd spread the meta data load a
little and seems to work pretty well for me). After the upgrade to the
new release, I removed all special recall settings, so my MDS config is
basically on default. The only thing I set is a mds_max_caps_per_client
of 200k, a mds_cache_reservation of 0.1 and 40G of mds_cache_memory_limit.
Right now, everything seems to be running smoothly, although I notice
that the max cap setting isn't fully honoured. The overall cache size
seems fairly constant at 15M (for mds.0, mds.1 a little less), but the
client cap count can easily exceed 10M if I run something like `find` on
a large directory.
We have one particularly problematic folder containing about 400 sub
folders holding a total of about 35M files among them. My first attempts
at running `find -type d` on those had the weird effect that after
pretty much exactly 2M caps, mds.1 got killed and replaced by a standby.
Fortunately, the standby managed to take over in a matter of seconds
(sometimes up to a few minutes) resetting the cap count to about 5k. The
same thing then happened again once the new MDS reached the magical 2M
caps. I would suppose that this is still the same problem as before, but
with the huge improvement that the take-over standby MDS can actually
recover. Previously, it would just die the same way after a minute or
two of futile recovery attempts and the FS would be down indefinitely
until I delete the openfiles object.
Right now, I cannot reproduce the crash any more---the caps to surge to
10-15M, but no crash. However, I keep seeing the dreaded "client failing
to respond to cache pressure" message occasionally. So far, though, the
MDS have been able to keep up and reduce the number of caps after about
15M, though, so that the message disappears after a while and the cap
count growth isn't entirely unbounded. I ran a `find -type d` on the
most problematic folder and attached two perf dumps for you (current cap
count on the client: 14660568):
P.S. Just as I was finishing this email, the rank 0 MDS actually
crashed. Unfortunately, I didn't have increased debug levels enabled, so
its death note is rather uninformative:
2019-12-17 09:42:12.325 7f7633dde700 1 mds.deltaweb011 Updating MDS map
to version 103112 from mon.3
2019-12-17 09:43:27.774 7f7633dde700 1 mds.deltaweb011 Updating MDS map
to version 103113 from mon.3
2019-12-17 09:43:40.086 7f7633dde700 1 mds.deltaweb011 Updating MDS map
to version 103114 from mon.3
2019-12-17 09:44:46.203 7f7633dde700 -1 *** Caught signal (Aborted) **
in thread 7f7633dde700 thread_name:ms_dispatch
Also, this time around the recovery appears to be a lot more
problematic, so I'm afraid I have to apply the previous procedure again
of deleting the openfiles object to get it back up. I don't think my
`find` alone would have crashed the MDS, but if another client is doing
similar things at the same time, it overloads the MDS.