Network performance checks - ceph-users

29 Jan 2020

After having upgraded my ceph cluster from Luminous to Nautilus 14.2.6 ,
from time to time "ceph health detail" claims about some"Long heartbeat
ping times on front/back interface seen".

As far as I can understand (after having read
https://docs.ceph.com/docs/nautilus/rados/operations/monitoring/), this
means that  the ping from one OSD to another one exceeded 1 s.

I have some questions on these network performance checks

1) What is meant exactly with front and back interface ?

2) I can see the involved OSDs only in the output of "ceph health detail"
(when there is the problem) but I can't find this information  in the log
files. In the mon log file I can only see messages such as:

2020-01-28 11:14:07.641 7f618e644700  0 log_channel(cluster) log [WRN] :
Health check failed: Long heartbeat ping times on back interface seen,
longest is 1416.618 msec (OSD_SLOW_PING_TIME_BACK)

but the involved OSDs are not reported in this log.
Do I just need to increase the verbosity of the mon log ?

3) Is 1 s a reasonable value for this threshold ? How could this value be
changed ? What is the relevant configuration variable ?

4)  https://docs.ceph.com/docs/nautilus/rados/operations/monitoring/
suggests to use the dump_osd_network command. I think there is an error in
that page: it says that the command should be issued on ceph-mgr.x.asok,
while I think that instead the ceph-osd-x.asok should be used

I have an other ceph cluster (running nautilus 14.2.6 as well) where there
aren't OSD_SLOW_PING_* error messages in the mon logs, but:

ceph daemon /var/run/ceph/ceph-osd..asok dump_osd_network 1

reports a lot of entries (i.e. pings exceeded 1 s). How can this be
explained ?

Thanks, Massimo