New subject: Network design issues

12 Feb 2021

Hi Stefan,

OK, I added the ceph-users again :)

Thanks for your reply, this is a lot of useful pointers. Yes, its Dell EMC switches
running OS9 and I believe they support per-VLAN bandwidth reservations. It would be the
easiest to configure and test. At the moment, I always see the slow ping times on both,
the front- and back interface at the same time on exactly the same OSD pairs. If I reserve
bandwidth to the replication VLAN and the slow ping times on the back interface disappear,
this would be a really strong clue.

I will go through everything after the weekend.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Stefan Kooman &lt;stefan(a)bit.nl&gt;
Sent: 12 February 2021 18:18
To: Frank Schilder
Subject: Re: [ceph-users] Network design issues

On 2/12/21 5:27 PM, Frank Schilder wrote:
...
  Hi Stefan,

 do you want to keep this out of the ceph-users list or was it a click-and-miss? 
^^ This, I recently switched to Thunderbird because of mail migration
(from Mutt) ... and I'm not used to it yet. I *tried* to reply to all
(incl. list) but might have screwed up.

I would consider this as of general interest.
...

 Thanks for your detailed reply. I take it that I need to provide more info and will try
to make a few sketches of the architecture. I think it will help explaining the problem.
Some quick replies:

  I'm curious what you changed. Want to share
it? 
 # ceph config set mds mds_max_caps_per_client 65536

 Thread "cephfs: massive drop in MDS requests per second with increasing number of
caps"

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/B7K6B5VXM3I…

Ah yes, I've read that thread. Interesting. I haven't tested it out
yet, but will do so.

...

 There are a number of config values with significantly too large defaults, this is one of
them. Another one is mon_sync_max_payload_size. 
quite a few people have run into issues with that settig. We haven't had
any issues with it yet. But perhaps I should downscale it as well.

...

  Do you know what causes the slow ops? 
 I don't care about slow ops under high load, these are to be expected. I worry about
"slow ping times". These are not expected and are almost certainly caused by
congestion of a link. 
Yeah sure, I would suspect that as well. Or "discards" from a switch
because of errors, but those are less likely.

...

  I don't quite get the 10G bottleneck. Sure, a
client can saturate a 10
 Gb/s link, but how does this affect storage <-> storage (replication)
 traffic and / or other clients? 
 Because it all happens on the same physical link. We don't have a dedicated
replication network. Its all mixed on the same hardware. If a 10G link is saturated,
nothing moves any more through this particular link and the clients are so superior in
capacity that they can easily starve parts of the internal ceph traffic in this way.

 Basically, we started out with a dedicated replication VLAN and decided to merge this
with the access VLAN for simplicity of the set-up. Our networking is currently equivalent
to having a single network only. Here the interfaces:

 ceph0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP
group default qlen 1000

 ceph0.81@ceph0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP
group default qlen 1000

 ceph0.82@ceph0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP
group default qlen 1000

 $ ethtool ceph0
 Settings for ceph0:
          Supported ports: [ ]
          Supported link modes:   Not reported
          Supported pause frame use: No
          Supports auto-negotiation: No
          Supported FEC modes: Not reported
          Advertised link modes:  Not reported
          Advertised pause frame use: No
          Advertised auto-negotiation: No
          Advertised FEC modes: Not reported
          Speed: 60000Mb/s
          Duplex: Full
          Port: Other
          PHYAD: 0
          Transceiver: internal
          Auto-negotiation: off
 Cannot get wake-on-lan settings: Operation not permitted
          Link detected: yes

 the bond is 6x10G active-active, VLAN 81 is the access VLAN and 82 is the replication
network. Goes all over the same lines. This config is very convenient for maintenance, but
seems to suffer from not physically reserving bandwidth to VLAN 82. Maybe such a bandwidth
reservation QOS definition could already help? 
Are these Dell / EMC switches? You might be able to give priority on a
VLAN level, or "shape" bandwith based on VLANs. I know that Aristas
(that we use) have support for that in newish firmware. You might also
want to support "pause" frames (ethernet flow control) as that might
help during congestion (a back off protocol), see:
https://en.wikipedia.org/wiki/Ethernet_flow_control

Just to note: we don't have a separate replication network / interfaces.
We only have one network. Wido (den Hollander) and myself as well, don't
see any added benefit of a separate network. You only waste bandwith if
you split them up. And make debugging more complex in certain failure
scenarios. Do you know what hashing is in use for the LACP port-channel?
You want to use mac, ip and port (5-tuple). We use OpenvSwitch (OVS) a
lot, and with OVS you can balance the load between the LACP links (by
default it evaluates every 10 seconds if it should move flows around).

I doubt there is a silver bullet, but hey, you never know. Do change one
thing at a time, otherwise it will be hard to know what the effect is of
each of the changes (they might even cancel each other out).

...

 I will provide a sketch of the set-up, I think this will make things more clear. I
don't think we have an aggregated bandwidth problem, I believe what we have is a load
distribution/priority problem over physical link members in the aggregation group
"ceph0" on the storage servers. 
Yes, your issue makes more sense to me now. Do you have any metrics from
the load on the individual links? Even bmon might be a useful tool. You
might want to capture metrics (like every second or so) to detect
"bursts" of traffic that might cause issues. Just to make sure you are
on the right track. We use telegraf as metric collecting agent sending
them to influxdb, but there are many more options.

And then there are also other things to tune: tcp checksum offload et
al. You might also hit IRQ balance issue, and there are also ways to
overcome those. Are those single CPU systems? And / or AMD? NUMA might
be a thing as well, and ideally you have the Ceph OSD daemons pinned to
the CPU with the network / storage adapters connected.

Finally, this might be of use:
http://www.brendangregg.com/usemethod.html ;-).

Gr. Stefan

Re: Network design issues