After having upgraded my ceph cluster from Luminous to Nautilus 14.2.6 ,
from time to time "ceph health detail" claims about some"Long heartbeat
ping times on front/back interface seen".
As far as I can understand (after having read
https://docs.ceph.com/docs/nautilus/rados/operations/monitoring/), this
means that the ping from one OSD to another one exceeded 1 s.
I have some questions on these network performance checks
1) What is meant exactly with front and back interface ?
2) I can see the involved OSDs only in the output of "ceph health detail"
(when there is the problem) but I can't find this information in the log
files. In the mon log file I can only see messages such as:
2020-01-28 11:14:07.641 7f618e644700 0 log_channel(cluster) log [WRN] :
Health check failed: Long heartbeat ping times on back interface seen,
longest is 1416.618 msec (OSD_SLOW_PING_TIME_BACK)
but the involved OSDs are not reported in this log.
Do I just need to increase the verbosity of the mon log ?
3) Is 1 s a reasonable value for this threshold ? How could this value be
changed ? What is the relevant configuration variable ?
4)
https://docs.ceph.com/docs/nautilus/rados/operations/monitoring/
suggests to use the dump_osd_network command. I think there is an error in
that page: it says that the command should be issued on ceph-mgr.x.asok,
while I think that instead the ceph-osd-x.asok should be used
I have an other ceph cluster (running nautilus 14.2.6 as well) where there
aren't OSD_SLOW_PING_* error messages in the mon logs, but:
ceph daemon /var/run/ceph/ceph-osd..asok dump_osd_network 1
reports a lot of entries (i.e. pings exceeded 1 s). How can this be
explained ?
Thanks, Massimo