I did not mean to have a back network configured but it is taken down. Of course this
won't work. What I mean is that you:
1. remove the cluster network definition from the cluster config (ceph.conf and/or ceph
config ...)
2. restart OSDs to apply the change
3. remove the physical network
Step 2 will most likely require down time as you write, because during the transition
some OSDs will think all OSDs listen on 2 while other OSDs think everyone is listening on
1 network. If you can afford to take all clients down and do a full cluster restart, this
is doable. If you set noout,nodown,pause and maybe some other flags
(norebalance,nobackfill,norecover), wait for all client *and* recovery I/O to complete, it
is probably possible to do this transition without disconnecting clients by just
restarting all OSDs failure domain by failure domain.
Perhaps temporarily setting mon_osd_min_down_reporters to a large number would help avoid
flapping. I fear at least some [RBD] clients would still experience timeouts / kernel
panics though.
After the transition things should work fine with just 1 network.
In any case, my recommendation would be to keep both networks if they are on different
VLAN IDs. Then, nothing special is required to do the transition and this is what I did to
simplify the physical networking (two logical networks, identical physical networking).
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Stefan Kooman <stefan(a)bit.nl>
Sent: 13 May 2020 07:40
To: ceph-users(a)ceph.io
Subject: [ceph-users] Re: Cluster network and public network
On 2020-05-12 18:59, Anthony D'Atri wrote:
I think, however, that a disappearing back
network has no real consequences as the heartbeats always go over both.
FWIW this has not been my experience, at least through Luminous.
What I’ve seen is that when the cluster/replication net is configured but unavailable,
OSD heartbeats fail and peers report them to the mons as down. The mons send out a map
accordingly, and the affected OSDs report “I’m not dead yet!”. Flap flap flap.
+1. This has also been my experience. And it's quit hard to debug as
well (confusing / seemingly contradictory messages).
It uses the back network to replicate data ... and as long as it can't
(client) IO wont go through.
Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io