Allowing cephfs clients to reconnect - ceph-users

13 Nov 2019

Hi, I'm trying to make our system a bit more fault tolerant, and I struggle a bit with
letting clients reconnect if they have lost contact for a while.
When there is a temporary network problem, I would like clients to block I/O, wait for a
connection, and resume.
Do I have any options other than just increasing mds_session_autoclose ?
Is there a downside for using very large value here (like, a full day?)? I expect all
clients to be connected at all times anyway when things are running normally.

What I see right now (if the disconnect is sufficiently long) is that the ceph client
releases the I/O block, and you get permission denied on all I/O operations on the
existing mount point.
Re-mounting it works, but, this also requires killing off all active session blocking
unmounting. Basically, just overall bad is this happens, and I would prefer almost any
other option.

I can see that the client tries a reconnect when this happens:
Nov 12 11:53:24 hebbe01-3 kernel: libceph: mds0 10.43.20.3:6800 connection reset
Nov 12 11:53:24 hebbe01-3 kernel: libceph: reset on mds0
Nov 12 11:53:24 hebbe01-3 kernel: ceph: mds0 closed our session
Nov 12 11:53:24 hebbe01-3 kernel: ceph: mds0 reconnect start
Nov 12 11:53:24 hebbe01-3 kernel: ceph: mds0 reconnect denied
Nov 12 11:56:55 hebbe01-3 kernel: libceph: mds0 10.43.20.3:6800 socket closed (con state
NEGOTIATING)
Nov 12 11:56:55 hebbe01-3 kernel: ceph: mds0 rejected session
but the logs on the MDS server disallows it as it's not in a "reconnect
state"-
So, if I understand this correctly, reconnecting is just available in the case that the
MDS server was rebooted?

Best regards, Mikael