Hi Stefan,
I think you gave me the right pointers.
Last summer I was looking up exactly this, how do Dell switches hash connections onto
members of a LAG. What I found was, that the only option was by MAC. I did a test with
iperf using several connections between the same two servers, or from one to many. This
test confirmed what I found in the documentation, all connections between 2 servers shared
a single 10G member, while one-to-many connections were distributed over multiple members.
Back then, I thought this was it and didn't look into this further.
Now, after your hints, I went back to the manual and find that the switches actually do
support more advanced hash functions - at least after enabling ECMP. By default it is
disabled. I'm not sure if I was reading a manual for the wrong switch family, no idea
where I found "MAC only" statement. I got in touch with Dell support to help me
here, the manual on load balancing is not exactly great.
I can use MACs, IP, port, VLAN ID and a few other packet fields for hashing. I hope not
only in layer 3 routing. In particular, including the VLAN ID should help spreading client
and replication traffic out a bit better. And Dell also supports defining salts to avoid
polarisation, which I believe is hurting us as well at the moment.
I have one last question. The Dell manual states that one can enable monitoring of load
balancing and it will check every 15secs for imbalance across the members of a LAG. You
wrote "... and with OVS you can balance the load between the LACP links (by default
it evaluates every 10 seconds if it should move flows around)." How is this done? The
hash function doesn't change, so how can port mappings be re-arranged in a predictable
way? The Dell switches will only create log events, nothing more. The Dell manual uses the
term "dynamic load balancing", but generating log messages is not really the
same. Am missing something?
For us, I think a bit more clever hashing and, maybe, higher priority for the replication
VLAN will do. As far as I can see, our cluster is essentially running on 10G internally
and anything better than that should do and be easy to achieve.
Thanks for putting me on the right track.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans(a)dtu.dk>
Sent: 12 February 2021 18:59:23
To: Stefan Kooman
Cc: ceph-users(a)ceph.io
Subject: [ceph-users] Re: Network design issues
By the way, thanks for reminding me of bmon! Of course. I have a decent collection of live
monitoring tools installed and bmon was one of the first. How could I forget?
Another tool I became good friends with is atop. It gives a really good overview of the
entire system, including network, disks, swap paging, you name it. I forgot about that
too.
Have a good weekend.
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder
Sent: 12 February 2021 18:52:05
To: Stefan Kooman
Cc: ceph-users(a)ceph.io
Subject: Re: [ceph-users] Network design issues
Hi Stefan,
OK, I added the ceph-users again :)
Thanks for your reply, this is a lot of useful pointers. Yes, its Dell EMC switches
running OS9 and I believe they support per-VLAN bandwidth reservations. It would be the
easiest to configure and test. At the moment, I always see the slow ping times on both,
the front- and back interface at the same time on exactly the same OSD pairs. If I reserve
bandwidth to the replication VLAN and the slow ping times on the back interface disappear,
this would be a really strong clue.
I will go through everything after the weekend.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Stefan Kooman <stefan(a)bit.nl>
Sent: 12 February 2021 18:18
To: Frank Schilder
Subject: Re: [ceph-users] Network design issues
On 2/12/21 5:27 PM, Frank Schilder wrote:
Hi Stefan,
do you want to keep this out of the ceph-users list or was it a click-and-miss?
^^ This, I recently switched to Thunderbird because of mail migration
(from Mutt) ... and I'm not used to it yet. I *tried* to reply to all
(incl. list) but might have screwed up.
I would consider this as of general interest.
Thanks for your detailed reply. I take it that I need to provide more info and will try
to make a few sketches of the architecture. I think it will help explaining the problem.
Some quick replies:
I'm curious what you changed. Want to share
it?
# ceph config set mds mds_max_caps_per_client 65536
Thread "cephfs: massive drop in MDS requests per second with increasing number of
caps"
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/B7K6B5VXM3I…
Ah yes, I've read that thread. Interesting. I haven't tested it out
yet, but will do so.
There are a number of config values with significantly too large defaults, this is one of
them. Another one is mon_sync_max_payload_size.
quite a few people have run into issues with that settig. We haven't had
any issues with it yet. But perhaps I should downscale it as well.
Do you know what causes the slow ops?
I don't care about slow ops under high load, these are to be expected. I worry about
"slow ping times". These are not expected and are almost certainly caused by
congestion of a link.
Yeah sure, I would suspect that as well. Or "discards" from a switch
because of errors, but those are less likely.
I don't quite get the 10G bottleneck. Sure, a
client can saturate a 10
Gb/s link, but how does this affect storage <-> storage (replication)
traffic and / or other clients?
Because it all happens on the same physical link. We don't have a dedicated
replication network. Its all mixed on the same hardware. If a 10G link is saturated,
nothing moves any more through this particular link and the clients are so superior in
capacity that they can easily starve parts of the internal ceph traffic in this way.
Basically, we started out with a dedicated replication VLAN and decided to merge this
with the access VLAN for simplicity of the set-up. Our networking is currently equivalent
to having a single network only. Here the interfaces:
ceph0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP
group default qlen 1000
ceph0.81@ceph0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP
group default qlen 1000
ceph0.82@ceph0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP
group default qlen 1000
$ ethtool ceph0
Settings for ceph0:
Supported ports: [ ]
Supported link modes: Not reported
Supported pause frame use: No
Supports auto-negotiation: No
Supported FEC modes: Not reported
Advertised link modes: Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Advertised FEC modes: Not reported
Speed: 60000Mb/s
Duplex: Full
Port: Other
PHYAD: 0
Transceiver: internal
Auto-negotiation: off
Cannot get wake-on-lan settings: Operation not permitted
Link detected: yes
the bond is 6x10G active-active, VLAN 81 is the access VLAN and 82 is the replication
network. Goes all over the same lines. This config is very convenient for maintenance, but
seems to suffer from not physically reserving bandwidth to VLAN 82. Maybe such a bandwidth
reservation QOS definition could already help?
Are these Dell / EMC switches? You might be able to give priority on a
VLAN level, or "shape" bandwith based on VLANs. I know that Aristas
(that we use) have support for that in newish firmware. You might also
want to support "pause" frames (ethernet flow control) as that might
help during congestion (a back off protocol), see:
https://en.wikipedia.org/wiki/Ethernet_flow_control
Just to note: we don't have a separate replication network / interfaces.
We only have one network. Wido (den Hollander) and myself as well, don't
see any added benefit of a separate network. You only waste bandwith if
you split them up. And make debugging more complex in certain failure
scenarios. Do you know what hashing is in use for the LACP port-channel?
You want to use mac, ip and port (5-tuple). We use OpenvSwitch (OVS) a
lot, and with OVS you can balance the load between the LACP links (by
default it evaluates every 10 seconds if it should move flows around).
I doubt there is a silver bullet, but hey, you never know. Do change one
thing at a time, otherwise it will be hard to know what the effect is of
each of the changes (they might even cancel each other out).
I will provide a sketch of the set-up, I think this will make things more clear. I
don't think we have an aggregated bandwidth problem, I believe what we have is a load
distribution/priority problem over physical link members in the aggregation group
"ceph0" on the storage servers.
Yes, your issue makes more sense to me now. Do you have any metrics from
the load on the individual links? Even bmon might be a useful tool. You
might want to capture metrics (like every second or so) to detect
"bursts" of traffic that might cause issues. Just to make sure you are
on the right track. We use telegraf as metric collecting agent sending
them to influxdb, but there are many more options.
And then there are also other things to tune: tcp checksum offload et
al. You might also hit IRQ balance issue, and there are also ways to
overcome those. Are those single CPU systems? And / or AMD? NUMA might
be a thing as well, and ideally you have the Ceph OSD daemons pinned to
the CPU with the network / storage adapters connected.
Finally, this might be of use:
http://www.brendangregg.com/usemethod.html ;-).
Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io