On Mon, 2020-05-18 at 10:21 -0700, Patrick Donnelly wrote:
On Mon, May 18, 2020 at 9:56 AM Ken Dreyer
<kdreyer(a)redhat.com> wrote:
Hi folks,
I was reading
https://ceph.io/community/automatic-cephfs-recovery-after-blacklisting/
about the new recover_session=clean feature.
The end of that blog post says that this setting involves a trade-off:
"availability is more important than correctness"
Are there cases where the old behavior is really safer than simply
returning errors?
Basically: a frozen (hung mount) or dead (restarted box) application
can't have unintended side-effects. If the application is poorly
written to not handle I/O errors or to not fsync, then any undesirable
behavior resulting from that may occur after the mount reconnects.
Right. It's not so much a problem with correctness, but rather that a
misbehaving client could continue to hammer the MDS in this situation.
It seems like
this feature would not make things worse for
applications. Can we make recover_session=clean the default?
There was a proposal for recover_session=strict which would (IIRC)
basically kill any application that had any file descriptor open with
the backend file system. That would probably be the safest default but
also the most intrusive and (perhaps) surprising. Unfortunately, I
think there were implementation issues that blocked it and we tabled
the idea.
Killing tasks from the kernel is somewhat perilous. There's also no
guarantee that it will help anything, particularly if the application is
restarted (a'la systemd or something).
Whether or not recover_session=clean should be the
default is
undecided. I think we should wait to hear back from the community
testing it before deciding.
Yes. I think we have to proceed with caution when it comes to making
user-visible behavior changes like this.
--
Jeff Layton <jlayton(a)redhat.com>