I am seeing very few of such error messages in the mon logs (~ a couple per
day)
If I issue on every OSD the command "ceph daemon osd.$id dump_osd_network"
with the default 1000 ms threshold, I can't see entries.
I guess this is because that command considers only the last (15 ?) minutes.
Am I supposed to see in some log files which are the problematic OSDs ?
Thanks, Massimo
On Thu, Jan 30, 2020 at 11:13 AM Stefan Kooman <stefan(a)bit.nl> wrote:
Hi,
Quoting Massimo Sgaravatto (massimo.sgaravatto(a)gmail.com):
Thanks for your answer
MON-MGR hosts have a mgmt network and a public network.
OSD nodes have instead a mgmt network, a public network. and a cluster
network
This is what I have in ceph.conf:
public network = 192.168.61.0/24
cluster network = 192.168.222.0/24
public and cluster networks are 10 Gbps networks (actually there is a
single 10 Gbps NIC on each node used for both the public and the cluster
networks).
In that case there is no advantage of using a seperate cluster network.
As it would only be beneficial when replication data between OSDs is on
a seperate interface. Is the cluster heavily loaded? Do you have metrics
on bandwith usage / switch port statistics? If you have many "discards"
(and / or errors) this might impact the ping times as well.
The mgmt network is a 1 Gbps network, but this
one shouldn't be used for
such pings among the OSDs ...
I doubt Ceph will use the mgmt network, but not sure if 't doing a
lookup on hostname which might use mgmt network in your case or if it's
using configured IPs for ceph
You can dump osd network info per OSD on the storage nodes themselves by
this command:
ceph daemon osd.$id dump_osd_network
You would have to do that for every OSD and see which ones report
"entries".
Gr. Stefan
--
| BIT BV
https://www.bit.nl/ Kamer van Koophandel 09090351
| GPG: 0xD14839C6 +31 318 648 688 / info(a)bit.nl