The reason
this was particularly annoying is that the switch hardware
also runs a lot of our critical services, some of which also store data
on our internal Ceph cluster which was itself also affected by the
network integrity issue (loooooong heartbeats (like, tens of seconds),
presumed-dead mons and osds flapping, etc.). Narrowing down on a
specific cause was not fun.
Were the osds actually flapping or were they just being reported down
by osds that lost connection with them? Curious as I wouldn't expect
them to actually go down, just be reported down (but then maybe that's
what you meant by 'flapping').
I might have extemporized a bit too far. The mons were going in and out
of quorum; I did not prove they were restarting. The OSDs, I don't
know; all I know for certain is that they were reporting *huge*
osd-to-osd heartbeat times. So the procs may not have actually exited
in either case.