On Fri, Feb 3, 2023 at 12:32 PM Dan Mick <dmick(a)redhat.com> wrote:
Hey Dan,
First of all thanks for your work on sorting this out.
> Hi all. We've had problems in the sepia lab again, but we're hoping
> that most services are back up and stable. As always, please let us
> know of any problems that seem to be infrastructure-related via email
> here or to our Slack channel
>
https://ceph-storage.slack.com/archives/C1HFJ4VTN.
> Details:
<snip
The reason this was particularly annoying is that the
switch hardware
also runs a lot of our critical services, some of which also store data
on our internal Ceph cluster which was itself also affected by the
network integrity issue (loooooong heartbeats (like, tens of seconds),
presumed-dead mons and osds flapping, etc.). Narrowing down on a
specific cause was not fun.
Were the osds actually flapping or were they just being reported down
by osds that lost connection with them? Curious as I wouldn't expect
them to actually go down, just be reported down (but then maybe that's
what you meant by 'flapping').
> Anyway, we're mostly back up now, and
we've identified a failing switch
> before it completely failed, which is a bit of a silver lining.
> _______________________________________________
> Sepia mailing list -- sepia(a)ceph.io
> To unsubscribe send an email to sepia-leave(a)ceph.io
--
Cheers,
Brad