I had posted about some of this a year ago in [1] and got some really helpful answers.
Fortunately, I know a lot more now and feel a lot more comfortable with the scenario.
Because I didn’t understand the architecture very well, I took a pause on distributing
monitors and MDS over a WAN. I want to try that now.
With a hard limit on the production side of the WAN at two machines and a single
monitor/MDS, it’s impossible to upgrade that machine without taking the network down. It
only has a few hundred PGs, 8 OSDs and a mostly static CRUSH map. WAN latency is 4ms and
there’s a 10Ge link between the production machines, so quorum will be maintained in all
cases on the production side except during an upgrade.
Most importantly, all the OSDs will remain on the production side of the WAN link.
It seems like the worst thing that could happen under normal state is the mon/MDS on the
non-prod side of the WAN may be a few clocks behind the quorum on production. In an
upgrade state, one of the two production machines is taken down and quorum exists across
the WAN. Performance on the cluster might be slower as a result, but everything will
remain stable with a stable link. Of course, after the upgrade, quorum is returned to the
production side and the normal state returns.
This seems like a reasonable working model to me. Do others see holes in my logic?
Thanks! Brian
[1]
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-January/032271.html