Network design issues - ceph-users

12 Feb 2021

Dear cephers,

I believe we are facing a bottleneck due to an inappropriate overall network design and
would like to hear about experience and recommendations. I start with a description of the
urgent problem/question and follow up with more details/questions.

These observations are on our HPC home file system served with ceph. It has 12 storage
servers facing 550+ client servers.

Under high load, I start seeing "slow ping time" warnings with quite incredible
latencies. I suspect we have a network bottleneck. On the storage servers we have 6x10G
LACP trunks. Clients are on single 10G NICs. We have separate VLANs for front- and back
network, but they both go through all NICs in the same way, so, technically, its just one
cluster network shared with clients. The aggregated bandwidth is sufficient for a
single-node storage server load (roughly matches with disk controller IO capacity).
However, point-to-point connections are 10G only and I believe that we start observing
clients saturating a 10G link and starving all other ceph cluster traffic that needs to go
through this link as well. This, in turn, leads to backlog effects with slow ops on
unrelated OSDs, affecting overall user experience. The number of OSDs reporting slow ping
times is about the percentage one would expect if one or two 10G links are congested. Its
usually just one storage server that coughs up.

I guess the users with aggressive workloads getting the full bandwidth are happy, but
everyone else is complaining. What I observe is that one or two clients can DOS everyone
else. I typically see a very high read bandwidth from a few OSDs only and my suspicion is
that this is a large job of 50-100 nodes starting the same application at the same time.
For example, 50-100 clients reading the same executable simultaneously. I see 5-6GB/s and
up to 10K IOP/s read, which is really good in principle. Except that is not fair-shared
with other users.

Question: I start considering to enable QOS on the switches for traffic between storage
servers and would like to know if anyone is doing this and what the experience is.
Unfortunately, our network design is probably flawed and makes this now difficult; see
below.

More Info.

Our FS data pool is EC 8+2. I have fast-read enabled. Hence, the network traffic
amplification for both, read and write, is quite substantial.

Our network is a spine-leaf architecture where ceph servers and ceph clients are
distributed more or less equally over the leaf switches. I'm afraid that this is a
first flaw in the design, because storage servers and clients compete for the same
switches and the clients greatly outnumber the storage servers. It also makes implementing
QOS a real pain while it could be just traffic shaping on an uplink trunk to clients if
the storage servers were isolated.

This is the first design question: Isolated storage cluster providing service via
uplinks/gateways versus "integrated/hyper-converged" where storage servers and
clients are distributed equally over a spine-leaf architecture. Pros and cons?

We have a 100G spine VLT-pair with ports configured as 40G. Up-links from leafs are 2x40,
in fact, we have these leafs configured as VLT-pairs for HA as well. A pair has 2x2x40G
uplinks and 2x40G VLT interlinks. There are 2 ceph servers per VLT leaf-pair and ca. 85+
client servers on the same pair. There are also clients on leaf switches without ceph
servers. I don't think the 40G uplinks are congested, but you never know.

We started with the ceph servers having 15HDDs for fs data and 1 SSD for fs meta-data
each. With this configuration, the disk speed was the bottleneck and I observed slow ops
under high load, but everything was more or less stable. I recently changed an MDS setting
that greatly improved both, client performance and also the client's ability to
overload OSDs. In addition, one week ago I added 20HDDs in a JBOD per host, which more
than doubled the HDD throughput. Both increases in performance together have now the
counter-intuitive effect that aggregated performance has tripled in comparison to 2 months
ago, but the user experience is very erratic. My suspicion is, as explained above, that
each server can now handle a volume of traffic that easily saturates a 10G link, leading
to observations that seem to indicate insufficient network capacity whenever too many
client/cluster requests go through the same 10G link.

In essence, we increased aggregated performance greatly but users complain more than
ever.

I suspect that this imbalance of server throughput ability and 10G point-to-point
limitation is a problem. However, I cannot change the networking and would like some
advice of how similar set-ups are configured and if QOS can help. My idea is to enable
dot1p layer 2 QOS and give traffic coming from ports with storage servers connected a
higher priority than traffic coming from everywhere else. I know it would be a lot simpler
if the storage cluster was isolated, but I have to deal with the situation as is for now.
Any advice and experience is highly appreciated.

If I do it, should I do QOS on both, front- and back network, or is QOS on the VLAN for
back-network enough? Note that MONs are only on the front network.

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14