Messenger v2 and IPv6-only still seems to prefer IPv4 (OSDs stuck in booting state) - ceph-users

3 Sep 2020

Hi,

Last night I've spend a couple of hours debugging a issue where OSDs
would be marked as 'up', but then PGs stayed in the 'peering' state.

Looking through the admin socket I saw these OSDs were in the 'booting'
state.

Looking at the OSDMap I saw this:

osd.3 up   in  weight 1 up_from 26 up_thru 700 down_at 0
last_clean_interval [0,0)
[v2:[2a05:xx0:700:2::7]:6816/7923,v1:[2a05:xx:700:2::7]:6817/7923,v2:0.0.0.0:6818/7923,v1:0.0.0.0:6819/7923]
[v2:[2a05:xx:700:2::7]:6820/7923,v1:[2a05:1500:700:2::7]:6821/7923,v2:0.0.0.0:6822/7923,v1:0.0.0.0:6823/7923]
exists,up 786d3e9d-047f-4b09-b368-db9e8dc0805d

In ceph.conf this was set:

ms_bind_ipv6 = true
public_addr = 2a05:xx:700:2::6

On true IPv6-only nodes this works fine. But on nodes where there is
also IPv4 present this can (and will?) cause problems.

It did not use tcpdump/wireshark to investigate, but it seems that the
OSDs tried to contact each other. Using the 0.0.0.0 IPv4 address.

After adding these settings the problems were resolved:

ms_bind_msgr1 = false
ms_bind_ipv4 = false

This also disables msgrv1 as we didn't need it here. A cluster and
clients all running Octopus.

The OSDMap now showed:

osd.3 up   in  weight 1 up_from 704 up_thru 712 down_at 702
last_clean_interval [26,701) v2:[2a05:xx:700:2::7]:6804/791503
v2:[2a05:xx:700:2::7]:6805/791503 exists,up
786d3e9d-047f-4b09-b368-db9e8dc0805d

OSDs can back right away, PGs peered and the problems were resolved.

Wido