[sepia] Re: Lab outage 31Jan-2Feb

2 Feb 2023

On Fri, Feb 3, 2023 at 12:32 PM Dan Mick &lt;dmick(a)redhat.com&gt; wrote:
...

Hey Dan,

First of all thanks for your work on sorting this out.

> Hi all.  We've had problems in the sepia lab again, but we're hoping
> that most services are back up and stable.  As always, please let us
> know of any problems that seem to be infrastructure-related via email
> here or to our Slack channel
> https://ceph-storage.slack.com/archives/C1HFJ4VTN.
...
  > Details:
<snip...

...
  The reason this was particularly annoying is that the
switch hardware
 also runs a lot of our critical services, some of which also store data
 on our internal Ceph cluster which was itself also affected by the
 network integrity issue (loooooong heartbeats (like, tens of seconds),
 presumed-dead mons and osds flapping, etc.). Narrowing down on a
 specific cause was not fun. 
Were the osds actually flapping or were they just being reported down
by osds that lost connection with them? Curious as I wouldn't expect
them to actually go down, just be reported down (but then maybe that's
what you meant by 'flapping').

...
  > Anyway, we're mostly back up now, and
we've identified a failing switch
> before it completely failed, which is a bit of a silver lining.
> _______________________________________________
> Sepia mailing list -- sepia(a)ceph.io
> To unsubscribe send an email to sepia-leave(a)ceph.io
...

-- 
Cheers,
Brad

2024

2023

2022

2021

2020

2019

[sepia] Re: Lab outage 31Jan-2Feb