Hi all. We've had problems in the sepia lab again, but we're hoping
that most services are back up and stable. As always, please let us
know of any problems that seem to be infrastructure-related via email
here or to our Slack channel
https://ceph-storage.slack.com/archives/C1HFJ4VTN.
Details:
We had a scheduled maintenance for all lab switches on 31 Jan that was,
as always, supposed to be a short interruption with no expected
problems. It turned out not to be true, but after a few days of bad
connectivity (not missing, just bad; lost packets, retransmissions,
really bad bursty inconsistent performance) the network operations guys
think they've resolved it. The cause seems to have been lack of proper
support in the updated firmware and the optical transceivers on the
switches (mitigated by installing a different branch of firmware), as
well as at least one virtual chassis with failing internal hardware,
being RMAed as of today.
So things should have gotten a *lot* better as of today, but there are
still some systems which will perform worse than normal and will need
another interruption when the new hardware arrives. Even those hosts
are connected, just experiencing bad packet integrity. Those include a
few of the Ceph cluster hosts, a few test hosts, and a couple others.
Even given that, the cluster has been pretty happy since reloading the
last choice of firmware.
The reason this was particularly annoying is that the switch hardware
also runs a lot of our critical services, some of which also store data
on our internal Ceph cluster which was itself also affected by the
network integrity issue (loooooong heartbeats (like, tens of seconds),
presumed-dead mons and osds flapping, etc.). Narrowing down on a
specific cause was not fun.
Anyway, we're mostly back up now, and we've identified a failing switch
before it completely failed, which is a bit of a silver lining.