If your public
network is saturated, that actually is a problem, last thing you want is to add recovery
traffic, or to slow down heartbeats. For most people, it isn’t saturated.
See Frank Schilder's post about a meltdown which he believes could have
been caused by beacon/hearbeat being drowned out by other recovery/IO
trafic, not at the network level, but at the processing level on the OSDs.
If indeed there are cases where the OSDs are too busy to send (or process)
heartbeat/beacon messaging, it wouldn't help to have a separate network ?
Agreed. Many times I’ve had to argue that CPUs that aren’t nearly saturated *aren’t*
necessarily overkill, especially with fast media where latency hurts. It would be
interesting to consider an architecture where a core/HT is dedicated to the control
plane.
That said, I’ve seen a situation where excessive CPU appeared to affect latency by
allowing the CPUs to drop C-states, this especially affected network traffic (2x dual
10GE).
Curiously some systems in the same cluster experienced this but some didn’t. There was a
mix of Sandy Bridge and Ivy Bridge IIRC, as well as different Broadcom chips. Despite an
apparently alignment with older vs newer Broadcom chip, I never fully characterized the
situation — replacing one of the Broadcom NICs in an affected system with the model in use
on unaffected systems diddn’t resolve the issue. It’s possible that replacing the other
wwould have made a difference.