[sepia] Lab outage 31Jan-2Feb

2 Feb 2023

Hi all.  We've had problems in the sepia lab again, but we're hoping 
that most services are back up and stable.  As always, please let us 
know of any problems that seem to be infrastructure-related via email 
here or to our Slack channel 
https://ceph-storage.slack.com/archives/C1HFJ4VTN.

Details:

We had a scheduled maintenance for all lab switches on 31 Jan that was, 
as always, supposed to be a short interruption with no expected 
problems.  It turned out not to be true, but after a few days of bad 
connectivity (not missing, just bad; lost packets, retransmissions, 
really bad bursty inconsistent performance) the network operations guys 
think they've resolved it.  The cause seems to have been lack of proper 
support in the updated firmware and the optical transceivers on the 
switches (mitigated by installing a different branch of firmware), as 
well as at least one virtual chassis with failing internal hardware, 
being RMAed as of today.

So things should have gotten a *lot* better as of today, but there are 
still some systems which will perform worse than normal and will need 
another interruption when the new hardware arrives.  Even those hosts 
are connected, just experiencing bad packet integrity.  Those include a 
few of the Ceph cluster hosts, a few test hosts, and a couple others.
Even given that, the cluster has been pretty happy since reloading the 
last choice of firmware.

The reason this was particularly annoying is that the switch hardware 
also runs a lot of our critical services, some of which also store data 
on our internal Ceph cluster which was itself also affected by the 
network integrity issue (loooooong heartbeats (like, tens of seconds), 
presumed-dead mons and osds flapping, etc.). Narrowing down on a 
specific cause was not fun.

Anyway, we're mostly back up now, and we've identified a failing switch 
before it completely failed, which is a bit of a silver lining.

2024

2023

2022

2021

2020

2019

[sepia] Lab outage 31Jan-2Feb