[sepia] Re: Lab outage 31Jan-2Feb

2 Feb 2023

...
   The reason
this was particularly annoying is that the switch hardware
 also runs a lot of our critical services, some of which also store data
 on our internal Ceph cluster which was itself also affected by the
 network integrity issue (loooooong heartbeats (like, tens of seconds),
 presumed-dead mons and osds flapping, etc.). Narrowing down on a
 specific cause was not fun.  
 Were the osds actually flapping or were they just being reported down
 by osds that lost connection with them? Curious as I wouldn't expect
 them to actually go down, just be reported down (but then maybe that's
 what you meant by 'flapping'). 
I might have extemporized a bit too far.  The mons were going in and out 
of quorum; I did not prove they were restarting.  The OSDs, I don't 
know; all I know for certain is that they were reporting *huge* 
osd-to-osd heartbeat times.  So the procs may not have actually exited 
in either case.

2024

2023

2022

2021

2020

2019

[sepia] Re: Lab outage 31Jan-2Feb