[ceph-users] Re: Network performance checks

30 Jan 2020

Quoting Massimo Sgaravatto (massimo.sgaravatto(a)gmail.com):
...
  After having upgraded my ceph cluster from Luminous to
Nautilus 14.2.6 ,
 from time to time "ceph health detail" claims about some"Long heartbeat
 ping times on front/back interface seen".

 As far as I can understand (after having read
 https://docs.ceph.com/docs/nautilus/rados/operations/monitoring/), this
 means that  the ping from one OSD to another one exceeded 1 s.

 I have some questions on these network performance checks

 1) What is meant exactly with front and back interface ? 
Do you have a "public" and a "cluster" network? I would expect that
the
"back" interface is a "cluster" network interface.

...
  2) I can see the involved OSDs only in the output of
"ceph health detail"
 (when there is the problem) but I can't find this information  in the log
 files. In the mon log file I can only see messages such as:

 2020-01-28 11:14:07.641 7f618e644700  0 log_channel(cluster) log [WRN] :
 Health check failed: Long heartbeat ping times on back interface seen,
 longest is 1416.618 msec (OSD_SLOW_PING_TIME_BACK)

 but the involved OSDs are not reported in this log.
 Do I just need to increase the verbosity of the mon log ?

 3) Is 1 s a reasonable value for this threshold ? How could this value be
 changed ? What is the relevant configuration variable ? 
Not sure how much priority Ceph gives to this ping check. But if you're
on a 10 Gb/s network I would start complaining when things take longer
than 1 ms ... a ping should not take much longer than 0.05 ms so if it
would take an order of magnitude longer than expected latency is not
optimal.

For Gigabit networks I would bump above values by an order of magnitude.

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/        Kamer van Koophandel 09090351
| GPG: 0xD14839C6                   +31 318 648 688 / info(a)bit.nl

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Network performance checks