[ceph-users] Re: [ceph-users]: Ceph Nautius not working after setting MTU 9000

30 May 2020

I’m pretty sure I’ve seen that happen with QFX5100 switches and 

net.core.netdev_max_backlog=250000
net.ipv4.tcp_max_syn_backlog=100000
net.ipv4.tcp_max_tw_buckets=2000000

...
  On May 29, 2020, at 10:53 AM, Dave Hall
&lt;kdhall(a)binghamton.edu&gt; wrote:

 I agree with Paul 100%.  Going further - there are many more 'knobs to turn' than
just Jumbo Frames, which makes the problem even harder.  Changing any one setting may just
move the bottleneck, or possibly introduce instabilities.  In the worst case, one might
tune their Linux system so well that it overruns the switch it's connected to.  So
then we have to add more knobs in the switch and see what we can do there, or de-tune
Linux to make it play nice with the switch.

 Just to be sure, I will add a disclaimer at the top of my document to emphasize
before/after benchmarking.

 -Dave

 Dave Hall
 Binghamton University
 kdhall(a)binghamton.edu
 607-760-2328 (Cell)
 607-777-4641 (Office)

 On 5/29/2020 6:29 AM, Paul Emmerich wrote:
  Please do not apply any optimization without
benchmarking *before* and *after* in a somewhat realistic scenario.

 No, iperf is likely not a realistic setup because it will usually be limited by available
network bandwidth which is (should) rarely be maxed out on your actual Ceph setup.

 Paul

 -- 
 Paul Emmerich

 Looking for help with your Ceph cluster? Contact us at https://croit.io

 croit GmbH
 Freseniusstr. 31h
 81247 München
 www.croit.io <http://www.croit.io>
 Tel: +49 89 1896585 90

 On Fri, May 29, 2020 at 2:15 AM Dave Hall &lt;kdhall(a)binghamton.edu
<mailto:kdhall@binghamton.edu>> wrote:

    Hello.

    A few days ago I offered to share the notes I've compiled on network
    tuning.  Right now it's a Google Doc:

https://docs.google.com/document/d/1nB5fzIeSgQF0ti_WN-tXhXAlDh8_f8XF9GhU7J1…

    I've set it up to allow comments and I'd be glad for questions and
    feedback.  If Google Docs not an acceptable format I'll try to put
    it up
    somewhere as HTML or Wiki.  Disclosure: some sections were copied
    verbatim from other sources.

    Regarding the current discussion about iperf, the likely
    bottleneck is
    buffering.  There is a per-NIC output queue set with 'ip link' and
    a per
    CPU core input queue set with 'sysctl'.  Both should be set to some
    multiple of the frame size based on calculations related to link
    speed
    and latency.  Jumping from 1500 to 9000 could negatively impact
    performance because one buffer or the other might be 1500 bytes
    short of
    a low multiple of 9000.

    It would be interesting to see the iperf tests repeated with
    corresponding buffer sizing.  I will perform this experiment as
    soon as
    I complete some day-job tasks.

    -Dave

    Dave Hall
    Binghamton University
    kdhall(a)binghamton.edu <mailto:kdhall@binghamton.edu>
    607-760-2328 (Cell)
    607-777-4641 (Office)

    On 5/27/2020 6:51 AM, EDH - Manuel Rios wrote:
  Anyone can share their table with other MTU
values?

 Also interested into Switch CPU load

 KR,
 Manuel

 -----Mensaje original-----
 De: Marc Roos &lt;M.Roos(a)f1-outsourcing.eu    
<mailto:M.Roos@f1-outsourcing.eu>>
  Enviado el: miércoles, 27 de mayo de 2020 12:01
 Para: chris.palmer &lt;chris.palmer(a)pobox.com    
<mailto:chris.palmer@pobox.com>>; paul.emmerich
    &lt;paul.emmerich(a)croit.io <mailto:paul.emmerich@croit.io>>
  CC: amudhan83 &lt;amudhan83(a)gmail.com    
<mailto:amudhan83@gmail.com>>; anthony.datri
    &lt;anthony.datri(a)gmail.com <mailto:anthony.datri@gmail.com>>;
    ceph-users &lt;ceph-users(a)ceph.io <mailto:ceph-users@ceph.io>>;
    doustar &lt;doustar(a)rayanexon.ir <mailto:doustar@rayanexon.ir>>;
    kdhall &lt;kdhall(a)binghamton.edu <mailto:kdhall@binghamton.edu>>;
    sstkadu &lt;sstkadu(a)gmail.com <mailto:sstkadu@gmail.com>>
  Asunto: [ceph-users] Re: [External Email] Re:
Ceph Nautius not     working after setting MTU 9000

 Interesting table. I have this on a production cluster 10gbit at a
 datacenter (obviously doing not that much).

 [@]# iperf3 -c 10.0.0.13 -P 1 -M 9000
 Connecting to host 10.0.0.13, port 5201
 [  4] local 10.0.0.14 port 52788 connected to 10.0.0.13 port 5201
 [ ID] Interval           Transfer     Bandwidth  Retr  Cwnd
 [  4]   0.00-1.00   sec  1.14 GBytes  9.77 Gbits/sec 0    690 KBytes
 [  4]   1.00-2.00   sec  1.15 GBytes  9.90 Gbits/sec 0   1.08 MBytes
 [  4]   2.00-3.00   sec  1.15 GBytes  9.88 Gbits/sec 0   1.08 MBytes
 [  4]   3.00-4.00   sec  1.15 GBytes  9.88 Gbits/sec 0   1.08 MBytes
 [  4]   4.00-5.00   sec  1.15 GBytes  9.88 Gbits/sec 0   1.08 MBytes
 [  4]   5.00-6.00   sec  1.15 GBytes  9.90 Gbits/sec 0   1.21 MBytes
 [  4]   6.00-7.00   sec  1.15 GBytes  9.89 Gbits/sec 0   1.21 MBytes
 [  4]   7.00-8.00   sec  1.15 GBytes  9.88 Gbits/sec 0   1.21 MBytes
 [  4]   8.00-9.00   sec  1.15 GBytes  9.89 Gbits/sec 0   1.21 MBytes
 [  4]   9.00-10.00  sec  1.15 GBytes  9.89 Gbits/sec 0   1.21 MBytes
 - - - - - - - - - - - - - - - - - - - - - - - - -
 [ ID] Interval           Transfer     Bandwidth  Retr
 [  4]   0.00-10.00  sec  11.5 GBytes  9.87 Gbits/sec    0
 sender
 [  4]   0.00-10.00  sec  11.5 GBytes  9.87 Gbits/sec
 receiver

 -----Original Message-----
 Subject: Re: [ceph-users] Re: [External Email] Re: Ceph Nautius not
 working after setting MTU 9000

 To elaborate on some aspects that have been mentioned already     and add
  some others::

 *     Test using iperf3.

 *     Don't try to use jumbos on networks where you don't have    
complete
  control over every host. This usually includes
the main ceph     network.
  It's just too much grief. You can consider
using it for     limited-access
  networks (e.g. ceph cluster network, hypervisor
migration     network, etc)
  where you know every switch & host is tuned
correctly. (This     works even
  when those nets share a vlan trunk with non-jumbo
vlans - just     set the
  max value on the trunk itself, and individual
values on each vlan.)

 *     If you are pinging make sure it doesn't fragment otherwise you
 will get misleading results: e.g. ping -M do -s 9000 x.x.x.x
 *     Do not assume that 9000 is the best value. It depends on your
 NICs, your switch, kernel/device parameters, etc. Try different     values
  (using iperf3). As an example the results below
are using a     small cheap
  Mikrotek 10G switch and HPE 10G NICs. It
highlights how in this
 configuration 9000 is worse than 1500, but that 5139 is optimal     yet 5140
  is worst. The same pattern (obviously with
different values) was
 apparent when multiple tests were run concurrently. Always test     your own
  network in a controlled manner. And of course if
you introduce     anything
  different later on, test again. With
enterprise-grade kit this     might not
  be so common, but always test if you fiddle.

 MTU  Gbps  (actual data transfer values using iperf3)  - one     particular
  configuration only

 9600 8.91 (max value)
 9000 8.91
 8000 8.91
 7000 8.91
 6000 8.91
 5500 8.17
 5200 7.71
 5150 7.64
 5140 7.62
 5139 9.81 (optimal)
 5138 9.81
 5137 9.81
 5135 9.81
 5130 9.81
 5120 9.81
 5100 9.81
 5000 9.81
 4000 9.76
 3000 9.68
 2000 9.28
 1500 9.37 (default)

 Whether any of this will make a tangible difference for ceph is     moot. I
  just spend a little time getting the network
stack correct as above,
 then leave it. That way I know I am probably getting some     benefit, and
  not doing any harm. If you blindly change things
you may well do     harm
  that can manifest itself in all sorts of ways
outside of Ceph.     Getting
  some test results for this using Ceph will be
easy; getting     MEANINGFUL
  results that way will be hard.

 Chris

 On 27/05/2020 09:25, Marc Roos wrote:

       I would not call a ceph page, a random tuning tip. At     least I hope
  they
       are not. NVMe-only with 100Gbit is not really a standard     setup. I
  assume
       with such setup you have the luxury to not notice many
 optimizations.

       What I mostly read is that changing to mtu 9000 will allow     you to
  better
       saturate the 10Gbit adapter, and I expect this to show on     a low end
  busy
       cluster. Don't you have any test results of such a setup?

       -----Original Message-----

       Subject: Re: [ceph-users] Re: [External Email] Re: Ceph     Nautius not

       working after setting MTU 9000

       Don't optimize stuff without benchmarking *before and     after*,
don't

       apply random tuning tipps from the Internet without     benchmarking
  them.

       My experience with Jumbo frames: 3% performance. On a     NVMe-only
  setup
       with 100 Gbit/s network.

       Paul

       --
       Paul Emmerich

       Looking for help with your Ceph cluster? Contact us at
 https://croit.io

       croit GmbH
       Freseniusstr. 31h
       81247 München
 www.croit.io <http://www.croit.io>
       Tel: +49 89 1896585 90

       On Tue, May 26, 2020 at 7:02 PM Marc Roos
 &lt;M.Roos(a)f1-outsourcing.eu <mailto:M.Roos@f1-outsourcing.eu>>    
<mailto:M.Roos@f1-outsourcing.eu <mailto:M.Roos@f1-outsourcing.eu>>
        wrote:

               Look what I have found!!! :)
 https://ceph.com/geen-categorie/ceph-loves-jumbo-frames/

               -----Original Message-----
               From: Anthony D'Atri     [mailto:anthony.datri@gmail.com
<mailto:anthony.datri@gmail.com>]
                Sent: maandag 25 mei 2020 22:12
               To: Marc Roos
               Cc: kdhall; martin.verges; sstkadu; amudhan83;     ceph-users;
  doustar
               Subject: Re: [ceph-users] Re: [External Email] Re:     Ceph
  Nautius not

               working after setting MTU 9000

               Quick and easy depends on your network infrastructure.
 Sometimes
       it is
               difficult or impossible to retrofit a live cluster     without
        disruption.

  On May 25, 2020, at 1:03 AM, Marc Roos 
&lt;M.Roos(a)f1-outsourcing.eu <mailto:M.Roos@f1-outsourcing.eu>>    
<mailto:M.Roos@f1-outsourcing.eu <mailto:M.Roos@f1-outsourcing.eu>>

               wrote:
               >
               > 
               > I am interested. I am always setting mtu to     9000. To be
  honest I
               > cannot imagine there is no optimization since     you have less
        interrupt
               > requests, and you are able x times as much data.     Every time
  there

               > something written about optimizing the first     thing mention
  is
       changing

               > to the mtu 9000. Because it is quick and easy win.
               >
               >
               >
               >
               > -----Original Message-----
               > From: Dave Hall [mailto:kdhall@binghamton.edu    
<mailto:kdhall@binghamton.edu>]
                > Sent: maandag 25 mei 2020
5:11
               > To: Martin Verges; Suresh Rama
               > Cc: Amudhan P; Khodayar Doustar; ceph-users
               > Subject: [ceph-users] Re: [External Email] Re:     Ceph Nautius
  not
               > working after setting MTU 9000
               >
               > All,
               >
               > Regarding Martin's observations about Jumbo     Frames....
                >
               > I have recently been gathering some notes from     various
  internet
               > sources regarding Linux network performance, and     Linux
        performance in
               > general, to be applied to a Ceph cluster I     manage but also
  to the
       rest

               > of the Linux server farm I'm responsible for.
               >
               > In short, enabling Jumbo Frames without also     tuning a
number
  of
       other
               > kernel and NIC attributes will not provide the     performance
        increases
               > we'd like to see.  I have not yet had a chance     to go
through
  the
       rest
               > of the testing I'd like to do, but  I can     confirm (via
  iperf3)
       that
  only enabling Jumbo Frames didn't make a
significant  difference.
               >
               > Some of the other attributes I'm referring to     are
incoming
  and
               > outgoing buffer sizes at the NIC, IP, and TCP     levels,
  interrupt
               > coalescing, NIC offload functions that should or    
shouldn't
  be
       turned
               > on, packet queuing disciplines (tc), the best     choice of TCP
        slow-start

               > algorithms, and other TCP features and attributes.
               >
               > The most off-beat item I saw was something about     adding
  IPTABLES
       rules

               > to bypass CONNTRACK table lookups.
               >
               > In order to do anything meaningful to assess the     effect of
  all of

               > these settings I'd like to figure out how to set     them
all
  via
       Ansible
               > - so more to learn before I can give opinions.
               >
               > -->  If anybody has added this type of     configuration to
Ceph

       Ansible,
               > I'd be glad for some pointers.
               >
               > I have started to compile a document containing     my notes.
  It's
       rough,

               > but I'd be glad to share if anybody is interested.
               >
               > -Dave
               >
               > Dave Hall
               > Binghamton University
               >
               >> On 5/24/2020 12:29 PM, Martin Verges wrote:
               >>
               >> Just save yourself the trouble. You won't have     any
real
  benefit
       from
               > MTU
               >> 9000. It has some smallish, but it is not worth     the
effort,

       problems,
               > and
               >> loss of reliability for most environments.
               >> Try it yourself and do some benchmarks,     especially with
  your
       regular
               >> workload on the cluster (not the maximum peak    
performance),
  then
       drop
               > the
               >> MTU to default ;).
               >>
               >> Please if anyone has other real world     benchmarks
showing
  huge
               > differences
               >> in regular Ceph clusters, please feel free to     post it
here.
                >>
               >> --
               >> Martin Verges
               >> Managing director
               >>
               >> Mobile: +49 174 9335695
               >> E-Mail: martin.verges(a)croit.io    
<mailto:martin.verges@croit.io>
                >> Chat:
https://t.me/MartinVerges
               >>
               >> croit GmbH, Freseniusstr. 31h, 81247 Munich
               >> CEO: Martin Verges - VAT-ID: DE310638492 Com.     register:
        Amtsgericht
               >> Munich HRB 231263
               >>
               >> Web: https://croit.io
               >> YouTube: https://goo.gl/PGE1Bx
               >>
               >>
               >>> Am So., 24. Mai 2020 um 15:54 Uhr schrieb     Suresh
Rama
                >> &lt;sstkadu(a)gmail.com
<mailto:sstkadu@gmail.com>>     <mailto:sstkadu@gmail.com
<mailto:sstkadu@gmail.com>> :
                >>
               >>> Ping with 9000 MTU won't get response as I     said
and it
  should
       be
               > 8972. Glad
               >>> it is working but you should know what     happened to
avoid
  this
       issue
  later.
>>
>>> On Sun, May 24, 2020, 3:04 AM Amudhan P  &lt;amudhan83(a)gmail.com
<mailto:amudhan83@gmail.com>>     <mailto:amudhan83@gmail.com
<mailto:amudhan83@gmail.com>>
                wrote:
               >>>
               >>>> No, ping with MTU size 9000 didn't work.
               >>>>
               >>>> On Sun, May 24, 2020 at 12:26 PM Khodayar Doustar
               > &lt;doustar(a)rayanexon.ir    
<mailto:doustar@rayanexon.ir>> <mailto:doustar@rayanexon.ir
    <mailto:doustar@rayanexon.ir>>

>>> wrote:
>>>
>>>> Does your ping work or not?
>>>>
>>>>
>>>> On Sun, May 24, 2020 at 6:53 AM Amudhan P       
&lt;amudhan83(a)gmail.com <mailto:amudhan83@gmail.com>>    
<mailto:amudhan83@gmail.com <mailto:amudhan83@gmail.com>>
                > wrote:
               >>>>>
               >>>>>> Yes, I have set setting on the switch side  
  also.
                >>>>>>
               >>>>>> On Sat 23 May, 2020, 6:47 PM Khodayar Doustar,
               > &lt;doustar(a)rayanexon.ir    
<mailto:doustar@rayanexon.ir>> <mailto:doustar@rayanexon.ir
    <mailto:doustar@rayanexon.ir>>
                >>>>>> wrote:
               >>>>>>
               >>>>>>> Problem should be with network. When you
    change MTU it

       should be
               >>>> changed
               >>>>>>> all over the network, any single hup on 
   your network
  should

               >>>>>>> speak
               > and
               >>>>>>> accept 9000 MTU packets. you can check it
    on your
  hosts
       with
               >>> "ifconfig"
               >>>>>>> command and there is also equivalent    
commands for
  other
               >>>> network/security
               >>>>>>> devices.
               >>>>>>>
               >>>>>>> If you have just one node which it not  
  correctly
  configured
       for
               > MTU
               >>>> 9000
               >>>>>>> it wouldn't work.
               >>>>>>>
               >>>>>>> On Sat, May 23, 2020 at 2:30 PM    
sinan(a)turka.nl <mailto:sinan@turka.nl>
        &lt;sinan(a)turka.nl
<mailto:sinan@turka.nl>>     <mailto:sinan@turka.nl
<mailto:sinan@turka.nl>>
                >>> wrote:
               >>>>>>>> Can the servers/nodes ping eachother
    using large
  packet
       sizes?
 >>>>>>> I
>> guess
>>>>>>> not.
>>>>>>>
>>>>>>> Sinan Polat
>>>>>>>
>>>>>>>> Op 23 mei 2020 om 14:21 heeft Amudhan P       
&lt;amudhan83(a)gmail.com <mailto:amudhan83@gmail.com>>    
<mailto:amudhan83@gmail.com <mailto:amudhan83@gmail.com>>
                > het
               >>>>>>>> volgende geschreven:
               >>>>>>>>> In OSD logs "heartbeat_check: no
reply     from OSD"

>>>>>>>>>
               >>>>>>>>>> On Sat, May 23, 2020 at 5:44 PM
Amudhan P
               > &lt;amudhan83(a)gmail.com    
<mailto:amudhan83@gmail.com>> <mailto:amudhan83@gmail.com
    <mailto:amudhan83@gmail.com>>
                >>>>>>>>
wrote:
               >>>>>>>>>> Hi,
               >>>>>>>>>>
               >>>>>>>>>> I have set Network switch with MTU
size     9000 and
  also in
       my
 >> netplan
>>>>>>>>> configuration.
>>>>>>>>>
>>>>>>>>> What else needs to be checked?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Sat, May 23, 2020 at 3:39 PM Wido den
Hollander  <
 >> wido(a)42on.com
<mailto:wido@42on.com>
>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On 5/23/20 12:02 PM, Amudhan P wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I am using ceph Nautilus in Ubuntu 18.04
working  fine
       wit
  MTU
>>> size
>>>>>>> 1500
>>>>>>>>>>> (default) recently i tried to update MTU size
to  9000.
 >>>>>>>>>>> After
setting Jumbo frame running ceph -s is  timing
       out.
 >>>>>>>>>> Ceph can
run just fine with an MTU of 9000. But  there
       is
 >> probably
>>>>>>>>>> something else wrong on the network which is
 causing
       this.
 >>>>>>>>>>
>>>>>>>>>> Check the Jumbo Frames settings on all the 
switches as
       well
               > to
               >>>> make
               >>>>>>>> sure
  >>>>>>>>>>> they forward all the packets.
  >>>>>>>>>>>
  >>>>>>>>>>> This is definitely not a Ceph issue.
  >>>>>>>>>>>
  >>>>>>>>>>> Wido
  >>>>>>>>>>>
  >>>>>>>>>>>> regards
  >>>>>>>>>>>> Amudhan P
  >>>>>>>>>>>>
_______________________________________________
  >>>>>>>>>>>> ceph-users mailing list --
ceph-users(a)ceph.io     <mailto:ceph-users@ceph.io> To

>>>>>>>>>>> unsubscribe send an email to 
ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>
  >>>>>>>>>>>>
  >>>>>>>>>>>
_______________________________________________
  >>>>>>>>>>> ceph-users mailing list --
ceph-users(a)ceph.io     <mailto:ceph-users@ceph.io> To
        unsubscribe

  >>>>>>>>>>> send an email to ceph-users-leave(a)ceph.io
    <mailto:ceph-users-leave@ceph.io>
   >>>>>>>>>>>
               >>>>>>>>>    
_______________________________________________

>>>>>>>>> ceph-users mailing list --    
ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> To
        unsubscribe
               >>>>>>>>> send an email to    
ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>
                >>>>>>>>
    _______________________________________________
                >>>>>>>>
ceph-users mailing list --     ceph-users(a)ceph.io <mailto:ceph-users@ceph.io>
To
        unsubscribe
               >>>>>>>> send an email to ceph-users-leave(a)ceph.io
    <mailto:ceph-users-leave@ceph.io>
                >>>>>>>>
    >               >>>>
_______________________________________________
    >               >>>> ceph-users mailing list -- ceph-users(a)ceph.io
    <mailto:ceph-users@ceph.io> To
  unsubscribe
       send
               >>>> an email to ceph-users-leave(a)ceph.io    
<mailto:ceph-users-leave@ceph.io>
                >>>>
               >>> _______________________________________________
               >>> ceph-users mailing list -- ceph-users(a)ceph.io    
<mailto:ceph-users@ceph.io> To
  unsubscribe
       send an

               >>> email to ceph-users-leave(a)ceph.io    
<mailto:ceph-users-leave@ceph.io>
                >>>
               >> _______________________________________________
               >> ceph-users mailing list -- ceph-users(a)ceph.io    
<mailto:ceph-users@ceph.io> To
  unsubscribe
       send an
               >> email to ceph-users-leave(a)ceph.io    
<mailto:ceph-users-leave@ceph.io>
                >
_______________________________________________
               > ceph-users mailing list -- ceph-users(a)ceph.io    
<mailto:ceph-users@ceph.io> To unsubscribe
  send
       an
               > email to ceph-users-leave(a)ceph.io    
<mailto:ceph-users-leave@ceph.io>
    >               >
                >
_______________________________________________
               > ceph-users mailing list -- ceph-users(a)ceph.io    
<mailto:ceph-users@ceph.io> To unsubscribe
  send
       an
               > email to ceph-users-leave(a)ceph.io    
<mailto:ceph-users-leave@ceph.io>

  _______________________________________________
               ceph-users mailing list -- ceph-users(a)ceph.io    
<mailto:ceph-users@ceph.io>
                To unsubscribe send an email to
    ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

       _______________________________________________
       ceph-users mailing list -- ceph-users(a)ceph.io    
<mailto:ceph-users@ceph.io>
        To unsubscribe send an email to
ceph-users-leave(a)ceph.io     <mailto:ceph-users-leave@ceph.io>

 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io    
<mailto:ceph-users@ceph.io>
  To unsubscribe send an email to
ceph-users-leave(a)ceph.io     <mailto:ceph-users-leave@ceph.io>

<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
	Virus-free. www.avg.com
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>

 <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: [ceph-users]: Ceph Nautius not working after setting MTU 9000