[ceph-users] Re: Network design issues

21 Feb 2021

Hi Stefan,

thanks for the additional info. Dell will put me in touch with their deployment team
soonish and then I can ask about matching abilities.

It turns out that the problem I observed might have a much more profane reason. I saw
really long periods with slow ping time yesterday and finally managed to pin it down to a
flapping link. My best bet is that an SFP transceiver has gone bad.

What I'm really surprised about is, that the switch seems not to have any flapping
detection. It happily takes the port up and down several times per second. Unfortunately,
I can't find anything about server-sided flapping detection on mode=4 bonds nor for
members of a LAG on the switch. Do you know of anything that does that? I might be looking
for the wrong term.

We have quite high redundancy. I can loose up to 3 ports on a server before the aggregated
bandwidth might get too small. Therefore, I would be happy to take the occasional false
positive as long as we don't miss the real flaps. Something like "permanently
shut down interface if it does a down-up 3 times per second" would be perfect.
Ideally without having to watch the logs.

For the future, I plan to go 25G active-passive without preferred port. This config will
handle flapping gracefully.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Stefan Kooman &lt;stefan(a)bit.nl&gt;
Sent: 15 February 2021 21:24:09
To: Frank Schilder
Cc: ceph-users(a)ceph.io
Subject: Re: [ceph-users] Re: Network design issues

On 2/15/21 5:38 PM, Frank Schilder wrote:
...
  Hi Stefan,

 I think you gave me the right pointers.

 Last summer I was looking up exactly this, how do Dell switches hash connections onto
members of a LAG. What I found was, that the only option was by MAC. I did a test with
iperf using several connections between the same two servers, or from one to many. This
test confirmed what I found in the documentation, all connections between 2 servers shared
a single 10G member, while one-to-many connections were distributed over multiple members.
Back then, I thought this was it and didn't look into this further.

 Now, after your hints, I went back to the manual and find that the switches actually do
support more advanced hash functions - at least after enabling ECMP. By default it is
disabled. I'm not sure if I was reading a manual for the wrong switch family, no idea
where I found "MAC only" statement. I got in touch with Dell support to help me
here, the manual on load balancing is not exactly great.

 I can use MACs, IP, port, VLAN ID and a few other packet fields for hashing. I hope not
only in layer 3 routing. In particular, including the VLAN ID should help spreading client
and replication traffic out a bit better. And Dell also supports defining salts to avoid
polarisation, which I believe is hurting us as well at the moment.

 I have one last question. The Dell manual states that one can enable monitoring of load
balancing and it will check every 15secs for imbalance across the members of a LAG. You
wrote "... and with OVS you can balance the load between the LACP links (by default
it evaluates every 10 seconds if it should move flows around)." How is this done? The
hash function doesn't change, so how can port mappings be re-arranged in a predictable
way? The Dell switches will only create log events, nothing more. The Dell manual uses the
term "dynamic load balancing", but generating log messages is not really the
same. Am  missing something? 
When the workload is perfectly static, nothing changes. But that will
hardly every be the case. Here the info for OVS on this:

"Every 10 seconds, vswitchd rebalances the bond members (see
bond_rebalance()). To rebalance, vswitchd examines the statistics for
the number of bytes transmitted by each member over approximately the
past minute, with data sent more recently weighted more heavily than
data sent less recently. It considers each of the members in order from
most-loaded to least-loaded. If highly loaded member H is significantly
more heavily loaded than the least-loaded member L, and member H carries
at least two hashes, then vswitchd shifts one of H’s hashes to L.
However, vswitchd will only shift a hash from H to L if it will decrease
the ratio of the load between H and L by at least 0.1.

Currently, “significantly more loaded” means that H must carry at least
1 Mbps more traffic, and that traffic must be at least 3% greater than L’s."

So if it makes sense to move one or more flows on other links, it will
do so.

I guess the Dell switches will do something similar.

...

 For us, I think a bit more clever hashing and, maybe, higher priority for the replication
VLAN will do. As far as I can see, our cluster is essentially running on 10G internally
and anything better than that should do and be easy to achieve.

 Thanks for putting me on the right track. 
Good to hear, I hope you manage to solve it.

Gr. Stefan

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Network design issues