I had two MDS nodes. One was still active, but the other was stuck rejoining, which already caused the FS to hang (i.e. Ait was down, yes). Since at first I thought this was the old cache size bug, I deleted the open files objects and when that didn't seem to have an effect, I tried restarting the MDS nodes, so then both were seemingly stuck rejoining.

The main difference was that the MDSs were still able to sent beacons fast enough so they weren't killed and could eventually recover, but it took a long time (for that to happen and me to realise). In between, I also tried failing all MDSs and spawning an entirely new one from a clean slate, but I had the same problem there, so eventually I just waited and it worked.

I figured it was trimming the cache, but I have no idea why and where that cache came from. While the FS was down, I unmounted all 300+ clients, but until full recovery, "ceph fs status" would still claim that all of then were connected, which was abviously not true.

On 7 Jan 2020 2:43 pm, Stefan Kooman <stefan@bit.nl> wrote:

Quoting Janek Bevendorff (janek.bevendorff@uni-weimar.de):
> Update: turns out I just had to wait for an hour. The MDSs were sending
> Beacons regularly, so the MONs didn't try to kill them and instead let
> them finish doing whatever they were doing.
>
> Unlike the other bug where the number of open files outgrows what the
> MDS can handle, this incident allowed "self-healing", but I still
> consider this a severe bug.

Just to get this straight : was your fs offline during this time? Do you
have any idea why it was busy trimming it's cache (because that was wat
is was doing, right?).

Gr. Stefan

--
| BIT BV https://www.bit.nl/ Kamer van Koophandel 09090351
| GPG: 0xD14839C6 +31 318 648 688 / info@bit.nl