On Wed, Nov 13, 2019 at 11:23 PM Mikael Öhman <micketeer(a)gmail.com> wrote:
Hi, I'm trying to make our system a bit more fault tolerant, and I struggle a bit
with letting clients reconnect if they have lost contact for a while.
When there is a temporary network problem, I would like clients to block I/O, wait for a
connection, and resume.
Do I have any options other than just increasing mds_session_autoclose ?
Is there a downside for using very large value here (like, a full day?)? I expect all
clients to be connected at all times anyway when things are running normally.
What I see right now (if the disconnect is sufficiently long) is that the ceph client
releases the I/O block, and you get permission denied on all I/O operations on the
existing mount point.
Re-mounting it works, but, this also requires killing off all active session blocking
unmounting. Basically, just overall bad is this happens, and I would prefer almost any
other option.
I can see that the client tries a reconnect when this happens:
Nov 12 11:53:24 hebbe01-3 kernel: libceph: mds0 10.43.20.3:6800 connection reset
Nov 12 11:53:24 hebbe01-3 kernel: libceph: reset on mds0
Nov 12 11:53:24 hebbe01-3 kernel: ceph: mds0 closed our session
Nov 12 11:53:24 hebbe01-3 kernel: ceph: mds0 reconnect start
Nov 12 11:53:24 hebbe01-3 kernel: ceph: mds0 reconnect denied
Nov 12 11:56:55 hebbe01-3 kernel: libceph: mds0 10.43.20.3:6800 socket closed (con state
NEGOTIATING)
Nov 12 11:56:55 hebbe01-3 kernel: ceph: mds0 rejected session
but the logs on the MDS server disallows it as it's not in a "reconnect
state"-
So, if I understand this correctly, reconnecting is just available in the case that the
MDS server was rebooted?
you can disable mds_session_blacklist_on_evict if consistency is not
your primary concern
> Best regards, Mikael
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io