PVE CEPH OSD heartbeat show - ceph-users

25 Apr 2023

Dear all,

We are experiencing with Ceph after deploying it by PVE with the network backed by a 10G
Cisco switch with VPC feature on. We are encountering a slow OSD heartbeat and have not
been able to identify any network traffic issues.

Upon checking, we found that the ping is around 0.1ms, and there is occasional 2% packet
loss when using flood ping, but not consistently. We also noticed a large number of UDP
port 5405 packets and the 'corosync' process utilizing a significant amount of
CPU.

When running the 'ceph -s' command, we observed a slow OSD heartbeat on the back
and front, with the longest latency being 2250.54ms. We suspect that this may be a network
issue, but we are unsure of how Ceph detects such long latency. Additionally, we are
wondering if a 2% packet loss can significantly affect Ceph's performance and even
cause the OSD process to fail sometimes.

We have heard about potential issues with rockdb 6 causing OSD process failures, and we
are curious about how to check the rockdb version. Furthermore, we are wondering how
severe traffic package loss and latency must be to cause OSD process crashes, and how the
monitoring system determines that an OSD is offline.

We would greatly appreciate any assistance or insights you could provide on these
matters.
Thanks,