I know I've seen that warning before, but for some reason it wasn't alerting on
these clusters which were upgraded to 14.2.2 first and then to 14.2.4.
Bryan
On Nov 19, 2019, at 3:20 PM, Paul Emmerich
<paul.emmerich(a)croit.io> wrote:
Notice: This email is from an external sender.
There should be a warning that says something like "all OSDs are
running nautilus but require-osd-release nautilus is not set"
That warning did exist for older releases, pretty sure nautilus also has it?
Paul
--
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at
https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
On Tue, Nov 19, 2019 at 8:42 PM Bryan Stillwell <bstillwell(a)godaddy.com> wrote:
>
> Closing the loop here. I figured out that I missed a step during the Nautilus
upgrade which was causing this issue:
>
> ceph osd require-osd-release nautilus
>
> If you don't do this your cluster will start having problems once you enable
msgr2:
>
> ceph mon enable-msgr2
>
> Based on how hard this was to track down, maybe a check should be added before
enabling msgr2 to make sure the require-osd-release is set to nautilus?
>
> Bryan
>
>> On Nov 18, 2019, at 5:41 PM, Bryan Stillwell <bstillwell(a)godaddy.com>
wrote:
>>
>> I cranked up debug_ms to 20 on two of these clusters today and I'm still not
understanding why some of the clusters use v2 and some just use v1.
>>
>> Here's the boot/peering process for the cluster which uses v2:
>>
>> 2019-11-18 16:46:03.027 7fabb6281dc0 0 osd.0 39101 done with init, starting boot
process
>> 2019-11-18 16:46:03.028 7fabb6281dc0 1 osd.0 39101 start_boot
>> 2019-11-18 16:46:03.030 7fabaebac700 5 --2-
[v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >>
[v2:10.0.32.3:6800/1473285,v1:10.0.32.3:6801/1473285] conn(0x5596b30c3000 0x5596b4bf4000
unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).handle_hello received hello:
peer_type=16 peer_addr_for_me=v2:10.0.32.67:51508/0
>> 2019-11-18 16:46:03.034 7faba8116700 1 --
[v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] -->
[v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] -- osd_boot(osd.0 booted 0 features
4611087854031667199 v39101) v7 -- 0x5596b4bd6000 con 0x5596b3b06400
>> 2019-11-18 16:46:03.034 7faba8116700 5 --2-
[v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >>
[v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 0x5596b2bca580 crc :-1
s=READY pgs=11687624 cs=0 l=1 rx=0 tx=0).send_message enqueueing message m=0x5596b4bd6000
type=71 osd_boot(osd.0 booted 0 features 4611087854031667199 v39101) v7
>> 2019-11-18 16:46:03.034 7fabaf3ad700 20 --2-
[v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >>
[v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 0x5596b2bca580 crc :-1
s=READY pgs=11687624 cs=0 l=1 rx=0 tx=0).prepare_send_message m=osd_boot(osd.0 booted 0
features 4611087854031667199 v39101) v7
>> 2019-11-18 16:46:03.034 7fabaf3ad700 20 --2-
[v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >>
[v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 0x5596b2bca580 crc :-1
s=READY pgs=11687624 cs=0 l=1 rx=0 tx=0).prepare_send_message encoding features
4611087854031667199 0x5596b4bd6000 osd_boot(osd.0 booted 0 features 4611087854031667199
v39101) v7
>> 2019-11-18 16:46:03.034 7fabaf3ad700 5 --2-
[v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >>
[v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 0x5596b2bca580 crc :-1
s=READY pgs=11687624 cs=0 l=1 rx=0 tx=0).write_message sending message m=0x5596b4bd6000
seq=8 osd_boot(osd.0 booted 0 features 4611087854031667199 v39101) v7
>> 2019-11-18 16:46:03.352 7fab9d100700 1 osd.0 39104 state: booting -> active
>> 2019-11-18 16:46:03.354 7fabaebac700 5 --2-
[v2:10.0.32.67:6802/258117,v1:10.0.32.67:6803/258117] >>
[v2:10.0.32.9:6802/3892454,v1:10.0.32.9:6803/3892454] conn(0x5596b4d68800 0x5596b4bf5080
unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0).handle_hello received hello:
peer_type=4 peer_addr_for_me=v2:10.0.32.67:45488/0
>> 2019-11-18 16:46:03.354 7fabafbae700 5 --2-
[v2:10.0.32.67:6802/258117,v1:10.0.32.67:6803/258117] >>
[v2:10.0.32.142:6810/2881684,v1:10.0.32.142:6811/2881684] conn(0x5596b4d68000
0x5596b4bf4580 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0).handle_hello
received hello: peer_type=4 peer_addr_for_me=v2:10.0.32.67:39044/0
>> 2019-11-18 16:46:03.355 7fabaf3ad700 5 --2- >>
[v2:10.0.32.67:6814/100535,v1:10.0.32.67:6815/100535] conn(0x5596b4d68400 0x5596b4bf4b00
unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).handle_hello received hello:
peer_type=4 peer_addr_for_me=v2:10.0.32.67:51558/0
>> 2019-11-18 16:46:03.355 7fabaf3ad700 1 -- 10.0.32.67:0/258117 learned_addr
learned my addr 10.0.32.67:0/258117 (peer_addr_for_me v2:10.0.32.67:0/0)
>> 2019-11-18 16:46:03.355 7fabafbae700 5 --2- >>
[v2:10.0.32.67:6812/100535,v1:10.0.32.67:6813/100535] conn(0x5596b4d68c00 0x5596b4bf5600
unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).handle_hello received hello:
peer_type=4 peer_addr_for_me=v2:10.0.32.67:40378/0
>> 2019-11-18 16:46:03.355 7fabafbae700 1 -- 10.0.32.67:0/258117 learned_addr
learned my addr 10.0.32.67:0/258117 (peer_addr_for_me v2:10.0.32.67:0/0)
>>
>>
>> You can see at the end it learns the address to be v2:10.0.32.67:0/0, but compare
that to the cluster which uses v1:
>>
>> 2019-11-18 16:46:05.066 7f9182d8ce00 0 osd.0 46410 done with init, starting boot
process
>> 2019-11-18 16:46:05.066 7f9182d8ce00 1 osd.0 46410 start_boot
>> 2019-11-18 16:46:05.069 7f917becf700 5 --2-
[v2:10.0.13.2:6800/3084510,v1:10.0.13.2:6801/3084510] >>
[v2:10.0.12.131:6800/3507,v1:10.0.12.131:6801/3507] conn(0x56011fc40000 0x56010f0b3b80
unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).handle_hello received hello:
peer_type=16 peer_addr_for_me=v2:10.0.13.2:44708/0
>> 2019-11-18 16:46:05.072 7f9174c3e700 1 --
[v2:10.0.13.2:6800/3084510,v1:10.0.13.2:6801/3084510] -->
[v2:10.0.13.137:3300/0,v1:10.0.13.137:6789/0] -- osd_boot(osd.0 booted 0 features
4611087854031667199 v46410) v7 -- 0x56011fc26000 con 0x56010f291400
>> 2019-11-18 16:46:05.072 7f9174c3e700 5 --2-
[v2:10.0.13.2:6800/3084510,v1:10.0.13.2:6801/3084510] >>
[v2:10.0.13.137:3300/0,v1:10.0.13.137:6789/0] conn(0x56010f291400 0x56010f0b3080 crc :-1
s=READY pgs=9200660 cs=0 l=1 rx=0 tx=0).send_message enqueueing message m=0x56011fc26000
type=71 osd_boot(osd.0 booted 0 features 4611087854031667199 v46410) v7
>> 2019-11-18 16:46:05.072 7f917c6d0700 20 --2-
[v2:10.0.13.2:6800/3084510,v1:10.0.13.2:6801/3084510] >>
[v2:10.0.13.137:3300/0,v1:10.0.13.137:6789/0] conn(0x56010f291400 0x56010f0b3080 crc :-1
s=READY pgs=9200660 cs=0 l=1 rx=0 tx=0).prepare_send_message m=osd_boot(osd.0 booted 0
features 4611087854031667199 v46410) v7
>> 2019-11-18 16:46:05.072 7f917c6d0700 20 --2-
[v2:10.0.13.2:6800/3084510,v1:10.0.13.2:6801/3084510] >>
[v2:10.0.13.137:3300/0,v1:10.0.13.137:6789/0] conn(0x56010f291400 0x56010f0b3080 crc :-1
s=READY pgs=9200660 cs=0 l=1 rx=0 tx=0).prepare_send_message encoding features
4611087854031667199 0x56011fc26000 osd_boot(osd.0 booted 0 features 4611087854031667199
v46410) v7
>> 2019-11-18 16:46:05.072 7f917c6d0700 5 --2-
[v2:10.0.13.2:6800/3084510,v1:10.0.13.2:6801/3084510] >>
[v2:10.0.13.137:3300/0,v1:10.0.13.137:6789/0] conn(0x56010f291400 0x56010f0b3080 crc :-1
s=READY pgs=9200660 cs=0 l=1 rx=0 tx=0).write_message sending message m=0x56011fc26000
seq=8 osd_boot(osd.0 booted 0 features 4611087854031667199 v46410) v7
>> 2019-11-18 16:46:05.951 7f917c6d0700 10 --
[v2:10.0.13.2:6806/3084510,v1:10.0.13.2:6807/3084510] >> conn(0x56011fc40c00
legacy=0x56011fcc1800 unknown :-1 s=STATE_NONE l=0).accept sd=73 listen_addr
v1:10.0.13.2:6807/3084510 peer_addr v1:10.0.12.146:50030/0
>> 2019-11-18 16:46:05.951 7f917c6d0700 10 --
[v2:10.0.13.2:6804/3084510,v1:10.0.13.2:6805/3084510] >> conn(0x56011fc41000
legacy=0x56011fcc3000 unknown :-1 s=STATE_NONE l=0).accept sd=74 listen_addr
v1:10.0.13.2:6805/3084510 peer_addr v1:10.0.12.146:45484/0
>> 2019-11-18 16:46:05.956 7f917c6d0700 10 --
[v2:10.0.13.2:6806/3084510,v1:10.0.13.2:6807/3084510] >> conn(0x56011fc41400
legacy=0x56011fcc3800 unknown :-1 s=STATE_NONE l=0).accept sd=75 listen_addr
v1:10.0.13.2:6807/3084510 peer_addr v1:10.0.12.137:50776/0
>> 2019-11-18 16:46:05.956 7f917c6d0700 10 --
[v2:10.0.13.2:6804/3084510,v1:10.0.13.2:6805/3084510] >> conn(0x56011fc41800
legacy=0x56011fda4000 unknown :-1 s=STATE_NONE l=0).accept sd=76 listen_addr
v1:10.0.13.2:6805/3084510 peer_addr v1:10.0.12.137:60880/0
>> 2019-11-18 16:46:05.960 7f917c6d0700 10 --
[v2:10.0.13.2:6802/3084510,v1:10.0.13.2:6803/3084510] >> conn(0x56011fc41c00
legacy=0x56011fda4800 unknown :-1 s=STATE_NONE l=0).accept sd=77 listen_addr
v1:10.0.13.2:6803/3084510 peer_addr v1:10.0.13.146:57630/0
>> 2019-11-18 16:46:05.974 7f917c6d0700 10 --
[v2:10.0.13.2:6802/3084510,v1:10.0.13.2:6803/3084510] >> conn(0x56011fdb0000
legacy=0x56011fda5000 unknown :-1 s=STATE_NONE l=0).accept sd=78 listen_addr
v1:10.0.13.2:6803/3084510 peer_addr v1:10.0.13.134:58584/0
>> 2019-11-18 16:46:05.974 7f917c6d0700 10 --
[v2:10.0.13.2:6806/3084510,v1:10.0.13.2:6807/3084510] >> conn(0x56011fdb0400
legacy=0x56011fda5800 unknown :-1 s=STATE_NONE l=0).accept sd=79 listen_addr
v1:10.0.13.2:6807/3084510 peer_addr v1:10.0.13.134:56662/0
>> 2019-11-18 16:46:05.974 7f917c6d0700 10 --
[v2:10.0.13.2:6804/3084510,v1:10.0.13.2:6805/3084510] >> conn(0x56011fdb0800
legacy=0x56011fda6000 unknown :-1 s=STATE_NONE l=0).accept sd=80 listen_addr
v1:10.0.13.2:6805/3084510 peer_addr v1:10.0.13.134:39326/0
>> 2019-11-18 16:46:05.977 7f9169c28700 1 osd.0 46413 state: booting -> active
>> 2019-11-18 16:46:05.978 7f917b6ce700 1 -- 10.0.13.2:0/3084510 learned_addr
learned my addr 10.0.13.2:0/3084510 (peer_addr_for_me v1:10.0.13.2:0/0)
>> 2019-11-18 16:46:05.979 7f917becf700 1 -- 10.0.13.2:0/3084510 learned_addr
learned my addr 10.0.13.2:0/3084510 (peer_addr_for_me v1:10.0.13.2:0/0)
>>
>>
>> The learned address is v1:10.0.13.2:0/0. What else can I do to figure out why
it's deciding to use the legacy protocol only?
>>
>> Thanks,
>> Bryan
>>
>>> On Nov 15, 2019, at 4:49 PM, Bryan Stillwell <bstillwell(a)godaddy.com>
wrote:
>>>
>>> I've upgraded 7 of our clusters to Nautilus (14.2.4) and noticed that on
some of the clusters (3 out of 7) the OSDs aren't using msgr2 at all. Here's the
output for osd.0 on 2 clusters of each type:
>>>
>>> ### Cluster 1 (v1 only):
>>> # ceph osd find 0 | jq -r '.addrs'
>>> {
>>> "addrvec": [
>>> {
>>> "type": "v1",
>>> "addr": "10.26.0.33:6809",
>>> "nonce": 4185021
>>> }
>>> ]
>>> }
>>>
>>> ### Cluster 2 (v1 only):
>>> # ceph osd find 0 | jq -r '.addrs'
>>> {
>>> "addrvec": [
>>> {
>>> "type": "v1",
>>> "addr": "10.197.0.243:6801",
>>> "nonce": 3802140
>>> }
>>> ]
>>> }
>>>
>>> ### Cluster 3 (v1 & v2):
>>> # ceph osd find 0 | jq -r '.addrs'
>>> {
>>> "addrvec": [
>>> {
>>> "type": "v2",
>>> "addr": "10.32.0.36:6802",
>>> "nonce": 3167
>>> },
>>> {
>>> "type": "v1",
>>> "addr": "10.32.0.36:6804",
>>> "nonce": 3167
>>> }
>>> ]
>>> }
>>>
>>> ### Cluster 4 (v1 & v2):
>>> # ceph osd find 0 | jq -r '.addrs'
>>> {
>>> "addrvec": [
>>> {
>>> "type": "v2",
>>> "addr": "10.36.0.12:6820",
>>> "nonce": 3150
>>> },
>>> {
>>> "type": "v1",
>>> "addr": "10.36.0.12:6827",
>>> "nonce": 3150
>>> }
>>> ]
>>> }
>>>
>>>
>>> All of the mon nodes have the same msgr settings of:
>>>
>>> # ceph daemon mon.$(hostname -s) config show | grep msgr
>>> "mon_warn_on_msgr2_not_enabled": "true",
>>> "ms_bind_msgr1": "true",
>>> "ms_bind_msgr2": "true",
>>> "ms_msgr2_encrypt_messages": "false",
>>> "ms_msgr2_sign_messages": "false",
>>>
>>>
>>> What could be causing this? All of the clusters are listening on port 3300
for v2 and 6789 for v1. I can even connect to port 3300 on the mon nodes from the OSD
nodes.
>>>
>>> Thanks,
>>> Bryan
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io