Hi Anthony and Phil,
since my meltdown case was mentioned and I might have a network capacity issue, here a
question about why having separate VLANS for private and public network might have its
merits:
In our part of the ceph cluster that was overloaded (our cluster has 2 sites logically
separate and physically different), I see a lot of dropped packets on the spine switch and
it looks like its the downlinks to the leafs where storage servers. I'm still not
finished investigating, so a network overload is still a hypothetical part of our
meltdown. The question below should, however, be interesting in any case as it might help
prevent a meltdown in case of similar set ups.
Our network connectivity is as follows: we have 1 storage server and up to 18 clients per
leaf. The storage servers have 6x10G connectivity in an LACP bond and front- and
back-network share all ports but are separated by VLAN. The clients have 1x10G on the
public network. Unfortunately, currently the up-links from leaf to spine switches are
limited to 2x10G. We are in the progress of upgrading to 2x40, so let's ignore fixing
this temporary bottleneck here (Corona got in the way) and focus on workarounds until we
can access the site again.
For every write, currently every storage server is hit (10 servers with 8+2EC). Since we
believed the low uplink bandwidth to be short time only during a network upgrade, we were
willing to accept the low bandwidth assuming that the competition between client and
storage traffic would throttle the clients sufficiently much to result in a working system
maybe with reduced performance but not becoming unstable.
The questions relevant to this thread:
I kept the separation into public and cluster network, because this enables QOS
definitions, which are typical per VLAN. In my situation, what if the up-links were
saturated by the competing client- and storage server traffic? Both run on the same VLAN,
obviously. The only way to make space for the OSD/heartbeat traffic would be to give the
cluster network VLAN higher priority over public network by QOS settings. This should at
least allow the OSDs to continue checking heartbeats etc. over a busy line.
Is this correct?
This also raises a question I had a long time ago and was also raised by Anthony. Why are
the MONs not on the cluster network? If I can make a priority line for the OSDs, why
can't I make OSD-MON communication a priority too?
While digging through heartbeat options as a consequence of our meltdown, I found this
one:
# ceph daemon osd.0 config show | grep heart
...
"osd_heartbeat_addr": "-",
...
# ceph daemon mon.ceph-01 config show | grep heart
...
"osd_heartbeat_addr": "-",
...
Is it actually possible to reserve a dedicated (third) VLAN with high QOS to heartbeat
traffic by providing a per-host IP address to this parameter? What does this parameter
do?
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Anthony D'Atri <anthony.datri(a)gmail.com>
Sent: 09 May 2020 23:59:49
To: Phil Regnauld
Cc: ceph-users(a)ceph.io
Subject: [ceph-users] Re: Cluster network and public network
If your public
network is saturated, that actually is a problem, last thing you want is to add recovery
traffic, or to slow down heartbeats. For most people, it isn’t saturated.
See Frank Schilder's post about a meltdown which he believes could have
been caused by beacon/hearbeat being drowned out by other recovery/IO
trafic, not at the network level, but at the processing level on the OSDs.
If indeed there are cases where the OSDs are too busy to send (or process)
heartbeat/beacon messaging, it wouldn't help to have a separate network ?
Agreed. Many times I’ve had to argue that CPUs that aren’t nearly saturated *aren’t*
necessarily overkill, especially with fast media where latency hurts. It would be
interesting to consider an architecture where a core/HT is dedicated to the control
plane.
That said, I’ve seen a situation where excessive CPU appeared to affect latency by
allowing the CPUs to drop C-states, this especially affected network traffic (2x dual
10GE).
Curiously some systems in the same cluster experienced this but some didn’t. There was a
mix of Sandy Bridge and Ivy Bridge IIRC, as well as different Broadcom chips. Despite an
apparently alignment with older vs newer Broadcom chip, I never fully characterized the
situation — replacing one of the Broadcom NICs in an affected system with the model in use
on unaffected systems diddn’t resolve the issue. It’s possible that replacing the other
wwould have made a difference.
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io