Hi all,
I'm experiencing some stability issues with my ceph-mgr daemons post Nautilus upgrade
(13.2.6->14.2.2).
On top of Ubuntu 18.04, the mgr's are colocated with the mon's, and 2 of the 3
mon/mgr nodes also colocate an MDS, running a single rank active:standby-replay.
The main issue(s) I am seeing are that I will have mgr's silently die, to where ceph
-s will report "no active mgr", despite none of the mgr daemons crashing or
dying. Typically they end up coming back up/in after some amount of time, but metrics
stall, and mgr commands will stall/hang.
$ ceph mgr module ls
{
"enabled_modules": [
"diskprediction_local",
"influx",
"iostat",
"prometheus",
"rbd_support",
"restful",
"telemetry"
],
I ran the mgr with debug=20 and was unable to find anything terribly useful there, but can
try to reproduce things if that seems helpful.
The only real non-standard bits of my mgr instances is that I cherry-picked this change
<https://github.com/ceph/ceph/pull/28427/files#diff-5add36422933f238d3aaf5291e106489>
to module.py to cure an issue with the influx plugin.
Typically these stalls happen when the mgr is servicing a command, and lately that command
has been the balancer running.
For example:
$ time ceph balancer eval $POOL --debug-ms 1
2019-08-14 11:36:21.370 7f4990601700 1 Processor -- start
2019-08-14 11:36:21.370 7f4990601700 1 -- start start
2019-08-14 11:36:21.370 7f4990601700 1 --2- >> v2:10.0.1.250:3300/0
conn(0x7f4988168a80 0x7f4988169ae0 unknown :-1 s=NONE pgs=0 cs=0 l=0 rx=0 tx=0).connect
2019-08-14 11:36:21.370 7f4990601700 1 --2- >> v2:10.0.1.249:3300/0
conn(0x7f498816f6a0 0x7f4988171b00 unknown :-1 s=NONE pgs=0 cs=0 l=0 rx=0 tx=0).connect
2019-08-14 11:36:21.370 7f4990601700 1 -- --> v1:10.0.1.250:6789/0 -- auth(proto 0
30 bytes epoch 0) v1 -- 0x7f498808bbb0 con 0x7f49881684c0
2019-08-14 11:36:21.370 7f4990601700 1 -- --> v2:10.0.1.249:3300/0 -- mon_getmap
magic: 0 v1 -- 0x7f498808ce00 con 0x7f498816f6a0
2019-08-14 11:36:21.370 7f4990601700 1 -- --> v2:10.0.1.250:3300/0 -- mon_getmap
magic: 0 v1 -- 0x7f498808cfb0 con 0x7f4988168a80
2019-08-14 11:36:21.370 7f498dbc6700 1 --2- >> v2:10.0.1.250:3300/0
conn(0x7f4988168a80 0x7f4988169ae0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rx=0
tx=0)._handle_peer_banner_payload supported=0 required=0
2019-08-14 11:36:21.370 7f498dbc6700 1 --2- >> v2:10.0.1.250:3300/0
conn(0x7f4988168a80 0x7f4988169ae0 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=0 rx=0
tx=0).handle_hello peer v2:10.0.1.250:3300/0 says I am v2:10.0.1.250:44952/0 (socket says
10.0.1.250:44952)
2019-08-14 11:36:21.370 7f498dbc6700 1 -- 10.0.1.250:0/2021657657 learned_addr learned
my addr 10.0.1.250:0/2021657657 (peer_addr_for_me v2:10.0.1.250:0/0)
2019-08-14 11:36:21.370 7f498ebc8700 1 --2- 10.0.1.250:0/2021657657 >>
v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 0x7f4988171b00 unknown :-1 s=BANNER_CONNECTING
pgs=0 cs=0 l=0 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0
2019-08-14 11:36:21.370 7f498dbc6700 1 -- 10.0.1.250:0/2021657657 >>
v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 msgr2=0x7f4988171b00 unknown :-1
s=STATE_CONNECTION_ESTABLISHED l=0).mark_down
2019-08-14 11:36:21.370 7f498dbc6700 1 --2- 10.0.1.250:0/2021657657 >>
v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 0x7f4988171b00 unknown :-1 s=AUTH_CONNECTING
pgs=0 cs=0 l=0 rx=0 tx=0).stop
2019-08-14 11:36:21.370 7f498dbc6700 1 -- 10.0.1.250:0/2021657657 >>
v1:10.0.1.250:6789/0 conn(0x7f49881684c0 legacy=0x7f49881692e0 unknown :-1
s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-08-14 11:36:21.370 7f498dbc6700 1 -- 10.0.1.250:0/2021657657 -->
v2:10.0.1.250:3300/0 -- mon_subscribe({config=0+,monmap=0+}) v3 -- 0x7f498808d1b0 con
0x7f4988168a80
2019-08-14 11:36:21.370 7f498dbc6700 1 --2- 10.0.1.250:0/2021657657 >>
[v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f4988168a80 0x7f4988169ae0 crc :-1
s=READY pgs=82826 cs=0 l=1 rx=0 tx=0).ready entity=mon.1 client_cookie=10837f5a102e309b
server_cookie=0 in_seq=0 out_seq=0
2019-08-14 11:36:21.370 7f498d3c5700 1 -- 10.0.1.250:0/2021657657 <== mon.1
v2:10.0.1.250:3300/0 1 ==== mon_map magic: 0 v1 ==== 395+0+0 (crc 0 0 0) 0x7f497c0020d0
con 0x7f4988168a80
2019-08-14 11:36:21.370 7f498d3c5700 1 -- 10.0.1.250:0/2021657657 <== mon.1
v2:10.0.1.250:3300/0 2 ==== config(6 keys) v1 ==== 207+0+0 (crc 0 0 0) 0x7f497c002410 con
0x7f4988168a80
2019-08-14 11:36:21.370 7f498d3c5700 1 -- 10.0.1.250:0/2021657657 <== mon.1
v2:10.0.1.250:3300/0 3 ==== mon_map magic: 0 v1 ==== 395+0+0 (crc 0 0 0) 0x7f497c002b40
con 0x7f4988168a80
2019-08-14 11:36:21.370 7f4990601700 1 -- 10.0.1.250:0/2021657657 >>
[v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f4988168a80 msgr2=0x7f4988169ae0 crc
:-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-08-14 11:36:21.370 7f4990601700 1 --2- 10.0.1.250:0/2021657657 >>
[v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f4988168a80 0x7f4988169ae0 crc :-1
s=READY pgs=82826 cs=0 l=1 rx=0 tx=0).stop
2019-08-14 11:36:21.370 7f4990601700 1 -- 10.0.1.250:0/2021657657 shutdown_connections
2019-08-14 11:36:21.370 7f4990601700 1 --2- 10.0.1.250:0/2021657657 >>
v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 0x7f4988171b00 unknown :-1 s=CLOSED pgs=0 cs=0
l=0 rx=0 tx=0).stop
2019-08-14 11:36:21.370 7f4990601700 1 --2- 10.0.1.250:0/2021657657 >>
[v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f4988168a80 0x7f4988169ae0 unknown :-1
s=CLOSED pgs=82826 cs=0 l=1 rx=0 tx=0).stop
2019-08-14 11:36:21.370 7f4990601700 1 -- 10.0.1.250:0/2021657657 shutdown_connections
2019-08-14 11:36:21.370 7f4990601700 1 -- 10.0.1.250:0/2021657657 wait complete.
2019-08-14 11:36:21.370 7f4990601700 1 -- 10.0.1.250:0/2021657657 >>
10.0.1.250:0/2021657657 conn(0x7f4988163a40 msgr2=0x7f4988165ed0 unknown :-1 s=STATE_NONE
l=0).mark_down
2019-08-14 11:36:21.374 7f4990601700 1 Processor -- start
2019-08-14 11:36:21.374 7f4990601700 1 -- start start
2019-08-14 11:36:21.374 7f4990601700 1 -- --> v1:10.0.1.249:6789/0 -- auth(proto 0
30 bytes epoch 0) v1 -- 0x7f49880b3f30 con 0x7f49880a8360
2019-08-14 11:36:21.374 7f4990601700 1 -- --> v1:10.0.1.250:6789/0 -- auth(proto 0
30 bytes epoch 0) v1 -- 0x7f49880b4170 con 0x7f49880b0030
2019-08-14 11:36:21.374 7f4990601700 1 -- --> v1:10.0.1.251:6789/0 -- auth(proto 0
30 bytes epoch 0) v1 -- 0x7f49880b4420 con 0x7f49880a7da0
2019-08-14 11:36:21.374 7f498ebc8700 1 --1- >> v1:10.0.1.250:6789/0
conn(0x7f49880b0030 0x7f49880b2560 :-1 s=CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0
l=1).handle_server_banner_and_identify peer v1:10.0.1.250:6789/0 says I am
v1:10.0.1.250:57466/0 (socket says 10.0.1.250:57466)
2019-08-14 11:36:21.374 7f498ebc8700 1 -- 10.0.1.250:0/40211647 learned_addr learned my
addr 10.0.1.250:0/40211647 (peer_addr_for_me v1:10.0.1.250:0/0)
2019-08-14 11:36:21.374 7f497b7fe700 1 -- 10.0.1.250:0/40211647 <== mon.0
v1:10.0.1.249:6789/0 1 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (unknown
3197104487 0 0) 0x7f497c007c60 con 0x7f49880a8360
2019-08-14 11:36:21.374 7f497b7fe700 1 -- 10.0.1.250:0/40211647 -->
v1:10.0.1.249:6789/0 -- auth(proto 2 36 bytes epoch 0) v1 -- 0x7f496c002d30 con
0x7f49880a8360
2019-08-14 11:36:21.374 7f497b7fe700 1 -- 10.0.1.250:0/40211647 <== mon.0
v1:10.0.1.249:6789/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 935+0+0 (unknown
1101082921 0 0) 0x7f497c007f40 con 0x7f49880a8360
2019-08-14 11:36:21.374 7f497b7fe700 1 -- 10.0.1.250:0/40211647 -->
v1:10.0.1.249:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- 0x7f496c003a30 con
0x7f49880a8360
2019-08-14 11:36:21.374 7f497b7fe700 1 -- 10.0.1.250:0/40211647 <== mon.0
v1:10.0.1.249:6789/0 3 ==== mon_map magic: 0 v1 ==== 395+0+0 (unknown 1022418838 0 0)
0x7f497c0024e0 con 0x7f49880a8360
2019-08-14 11:36:21.374 7f497b7fe700 1 -- 10.0.1.250:0/40211647 >>
v1:10.0.1.251:6789/0 conn(0x7f49880a7da0 legacy=0x7f49880a8c50 unknown :-1
s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-08-14 11:36:21.374 7f497b7fe700 1 -- 10.0.1.250:0/40211647 >>
v1:10.0.1.250:6789/0 conn(0x7f49880b0030 legacy=0x7f49880b2560 unknown :-1
s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-08-14 11:36:21.374 7f497b7fe700 1 -- 10.0.1.250:0/40211647 >>
v1:10.0.1.249:6789/0 conn(0x7f49880a8360 legacy=0x7f49880a9450 unknown :-1
s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-08-14 11:36:21.374 7f497b7fe700 1 --2- 10.0.1.250:0/40211647 >>
[v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] conn(0x7f496c003ed0 0x7f496c0049c0 unknown :-1
s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect
2019-08-14 11:36:21.374 7f497b7fe700 1 --2- 10.0.1.250:0/40211647 >>
[v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 0x7f496c005360 unknown :-1
s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect
2019-08-14 11:36:21.374 7f497b7fe700 1 --2- 10.0.1.250:0/40211647 >>
[v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 0x7f496c00bfe0 unknown :-1
s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect
2019-08-14 11:36:21.374 7f497b7fe700 1 -- 10.0.1.250:0/40211647 -->
[v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] -- mon_getmap magic: 0 v1 -- 0x7f496c00c530
con 0x7f496c009b80
2019-08-14 11:36:21.374 7f497b7fe700 1 -- 10.0.1.250:0/40211647 -->
[v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_getmap magic: 0 v1 -- 0x7f496c00c6e0
con 0x7f496c004f10
2019-08-14 11:36:21.374 7f497b7fe700 1 -- 10.0.1.250:0/40211647 -->
[v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] -- mon_getmap magic: 0 v1 -- 0x7f496c00c890
con 0x7f496c003ed0
2019-08-14 11:36:21.374 7f498dbc6700 1 --2- 10.0.1.250:0/40211647 >>
[v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 0x7f496c005360 unknown :-1
s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0)._handle_peer_banner_payload supported=0
required=0
2019-08-14 11:36:21.374 7f498e3c7700 1 --2- 10.0.1.250:0/40211647 >>
[v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] conn(0x7f496c003ed0 0x7f496c0049c0 unknown :-1
s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0)._handle_peer_banner_payload supported=0
required=0
2019-08-14 11:36:21.374 7f498ebc8700 1 --2- 10.0.1.250:0/40211647 >>
[v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 0x7f496c00bfe0 unknown :-1
s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0)._handle_peer_banner_payload supported=0
required=0
2019-08-14 11:36:21.378 7f498dbc6700 1 -- 10.0.1.250:0/40211647 >>
[v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] conn(0x7f496c003ed0 msgr2=0x7f496c0049c0
unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-08-14 11:36:21.378 7f498dbc6700 1 --2- 10.0.1.250:0/40211647 >>
[v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] conn(0x7f496c003ed0 0x7f496c0049c0 unknown :-1
s=AUTH_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).stop
2019-08-14 11:36:21.378 7f498dbc6700 1 -- 10.0.1.250:0/40211647 >>
[v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 msgr2=0x7f496c00bfe0
unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-08-14 11:36:21.378 7f498dbc6700 1 --2- 10.0.1.250:0/40211647 >>
[v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 0x7f496c00bfe0 unknown :-1
s=AUTH_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).stop
2019-08-14 11:36:21.378 7f498dbc6700 1 -- 10.0.1.250:0/40211647 -->
[v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_subscribe({config=0+,monmap=0+}) v3 --
0x7f497c008180 con 0x7f496c004f10
2019-08-14 11:36:21.378 7f4990601700 1 -- 10.0.1.250:0/40211647 -->
[v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_subscribe({mgrmap=0+}) v3 --
0x7f49880b47e0 con 0x7f496c004f10
2019-08-14 11:36:21.378 7f498ebc8700 1 --2- 10.0.1.250:0/40211647 >>
[v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 0x7f496c00bfe0 unknown :-1
s=CLOSED pgs=0 cs=0 l=1 rx=0 tx=0).handle_auth_reply_more state changed!
2019-08-14 11:36:21.378 7f498ebc8700 1 -- 10.0.1.250:0/40211647 reap_dead start
2019-08-14 11:36:21.378 7f4990601700 1 -- 10.0.1.250:0/40211647 -->
[v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_subscribe({osdmap=0}) v3 --
0x7f49880a8360 con 0x7f496c004f10
2019-08-14 11:36:21.378 7f498dbc6700 1 --2- 10.0.1.250:0/40211647 >>
[v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 0x7f496c005360 crc :-1
s=READY pgs=82828 cs=0 l=1 rx=0 tx=0).ready entity=mon.1 client_cookie=0 server_cookie=0
in_seq=0 out_seq=0
2019-08-14 11:36:21.378 7f497b7fe700 1 -- 10.0.1.250:0/40211647 <== mon.1
v2:10.0.1.250:3300/0 1 ==== mon_map magic: 0 v1 ==== 395+0+0 (crc 0 0 0) 0x7f49880b47e0
con 0x7f496c004f10
2019-08-14 11:36:21.378 7f497b7fe700 1 -- 10.0.1.250:0/40211647 <== mon.1
v2:10.0.1.250:3300/0 2 ==== config(6 keys) v1 ==== 207+0+0 (crc 0 0 0) 0x7f497c0010a0 con
0x7f496c004f10
2019-08-14 11:36:21.378 7f497b7fe700 1 -- 10.0.1.250:0/40211647 <== mon.1
v2:10.0.1.250:3300/0 3 ==== mon_map magic: 0 v1 ==== 395+0+0 (crc 0 0 0) 0x7f497c001490
con 0x7f496c004f10
2019-08-14 11:36:21.378 7f497b7fe700 1 -- 10.0.1.250:0/40211647 <== mon.1
v2:10.0.1.250:3300/0 4 ==== mgrmap(e 2536) v1 ==== 28602+0+0 (crc 0 0 0) 0x7f497c001690
con 0x7f496c004f10
2019-08-14 11:36:21.378 7f497b7fe700 1 --2- 10.0.1.250:0/40211647 >>
[v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 0x7f496c02ace0
unknown :-1 s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect
2019-08-14 11:36:21.378 7f497b7fe700 1 -- 10.0.1.250:0/40211647 <== mon.1
v2:10.0.1.250:3300/0 5 ==== osd_map(172849..172849 src has 171976..172849) v4 ====
106776+0+0 (crc 0 0 0) 0x7f497c0489d0 con 0x7f496c004f10
2019-08-14 11:36:21.378 7f498e3c7700 1 --2- 10.0.1.250:0/40211647 >>
[v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 0x7f496c02ace0
unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0)._handle_peer_banner_payload
supported=0 required=0
2019-08-14 11:36:21.378 7f498e3c7700 1 --2- 10.0.1.250:0/40211647 >>
[v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 0x7f496c02ace0 crc
:-1 s=READY pgs=16527 cs=0 l=1 rx=0 tx=0).ready entity=mgr.1731601147 client_cookie=0
server_cookie=0 in_seq=0 out_seq=0
2019-08-14 11:36:21.382 7f4990601700 1 -- 10.0.1.250:0/40211647 -->
[v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_command({"prefix":
"get_command_descriptions"} v 0) v1 -- 0x7f49680031e0 con 0x7f496c004f10
2019-08-14 11:36:21.390 7f497b7fe700 1 -- 10.0.1.250:0/40211647 <== mon.1
v2:10.0.1.250:3300/0 6 ==== mon_command_ack([{"prefix":
"get_command_descriptions"}]=0 v0) v1 ==== 72+0+101614 (crc 0 0 0)
0x7f497c000e90 con 0x7f496c004f10
2019-08-14 11:36:21.586 7f4990601700 1 -- 10.0.1.250:0/40211647 -->
[v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] -- command(tid 0: {"prefix":
"balancer eval", "option": "$POOL", "target":
["mgr", ""]}) v1 -- 0x7f4968002150 con 0x7f496c028790
2019-08-14 11:37:40.274 7f497b7fe700 1 -- 10.0.1.250:0/40211647 <== mgr.1731601147
v2:10.0.1.250:6800/34061 1 ==== command_reply(tid 0: 0 ) v1 ==== 8+0+47 (crc 0 0 0)
0x7f4968002150 con 0x7f496c028790
pool "$POOL" score 0.009701 (lower is better)
2019-08-14 11:37:40.306 7f4990601700 1 -- 10.0.1.250:0/40211647 >>
[v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790
msgr2=0x7f496c02ace0 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-08-14 11:37:40.306 7f4990601700 1 --2- 10.0.1.250:0/40211647 >>
[v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 0x7f496c02ace0 crc
:-1 s=READY pgs=16527 cs=0 l=1 rx=0 tx=0).stop
2019-08-14 11:37:40.306 7f4990601700 1 -- 10.0.1.250:0/40211647 >>
[v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 msgr2=0x7f496c005360 crc
:-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-08-14 11:37:40.306 7f4990601700 1 --2- 10.0.1.250:0/40211647 >>
[v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 0x7f496c005360 crc :-1
s=READY pgs=82828 cs=0 l=1 rx=0 tx=0).stop
2019-08-14 11:37:40.306 7f4990601700 1 -- 10.0.1.250:0/40211647 shutdown_connections
2019-08-14 11:37:40.306 7f4990601700 1 --2- 10.0.1.250:0/40211647 >>
[v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 0x7f496c02ace0
unknown :-1 s=CLOSED pgs=16527 cs=0 l=1 rx=0 tx=0).stop
2019-08-14 11:37:40.306 7f4990601700 1 --2- 10.0.1.250:0/40211647 >>
[v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 0x7f496c005360 unknown :-1
s=CLOSED pgs=82828 cs=0 l=1 rx=0 tx=0).stop
2019-08-14 11:37:40.306 7f4990601700 1 -- 10.0.1.250:0/40211647 shutdown_connections
2019-08-14 11:37:40.306 7f4990601700 1 -- 10.0.1.250:0/40211647 wait complete.
2019-08-14 11:37:40.306 7f4990601700 1 -- 10.0.1.250:0/40211647 >>
10.0.1.250:0/40211647 conn(0x7f498809d440 msgr2=0x7f498809d8b0 unknown :-1 s=STATE_NONE
l=0).mark_down
real 1m19.023s
user 0m0.288s
sys 0m0.050s
While the mgr didn't die, it did take 80 seconds to eval a pool, which typically takes
roughly 2-5 seconds to run.
And while this command was running, the ceph-mgr processes was sitting around 100% in
top.
2019-08-14 11:36:20.734 7fbd77570700 0
ms_deliver_dispatch: unhandled message 0x55d96ee809c0 mgrreport(mds.db +110-0 packed 1366)
v7 from mds.1717053370 v2:10.0.1.251:6800/140458953
2019-08-14 11:36:20.734 7fbd8a95a700 1 mgr finish mon failed to return metadata for
mds.db: (2) No such file or directory
2019-08-14 11:36:21.586 7fbd77570700 0 log_channel(audit) log [DBG] :
from='client.1731656635 -' entity='client.admin' cmd=[{"prefix":
"balancer eval", "option": "$POOL", "target":
["mgr", ""]}]: dispatch
2019-08-14 11:36:21.734 7fbd77570700 0 ms_deliver_dispatch: unhandled message
0x55d968250a40 mgrreport(mds.db +110-0 packed 1366) v7 from mds.1717053370
v2:10.0.1.251:6800/140458953
2019-08-14 11:36:21.734 7fbd8a95a700 1 mgr finish mon failed to return metadata for
mds.db: (2) No such file or directory
2019-08-14 11:36:22.734 7fbd77570700 0 ms_deliver_dispatch: unhandled message
0x55d969297a00 mgrreport(mds.db +110-0 packed 1366) v7 from mds.1717053370
v2:10.0.1.251:6800/140458953
[snip-repeating]
2019-08-14 11:37:38.826 7fbd77570700 0
ms_deliver_dispatch: unhandled message 0x55d96b768d80 mgrreport(mds.db +110-0 packed 1366)
v7 from mds.1717053370 v2:10.0.1.251:6800/140458953
2019-08-14 11:37:38.826 7fbd8a95a700 1 mgr finish mon failed to return metadata for
mds.db: (2) No such file or directory
2019-08-14 11:37:39.574 7fbd77d71700 1 mgr[balancer] Handling command:
'{'prefix': 'balancer eval', 'option': '$POOL',
'target': ['mgr', '']}'
2019-08-14 11:37:40.186 7fbd77570700 0 ms_deliver_dispatch: unhandled message
0x55d967aa23c0 mgrreport(mds.db +110-0 packed 1366) v7 from mds.1717053370
v2:10.0.1.251:6800/140458953
2019-08-14 11:37:40.186 7fbd8a95a700 1 mgr finish mon failed to return metadata for
mds.db: (2) No such file or directory
2019-08-14 11:37:41.054 7fbd77570700 0 ms_deliver_dispatch: unhandled message
0x55d96117c9c0 mgrreport(mds.db +110-0 packed 1366) v7 from mds.1717053370
v2:10.0.1.251:6800/140458953
2019-08-14 11:37:41.054 7fbd8a95a700 1 mgr finish mon failed to return metadata for
mds.db: (2) No such file or directory
I filtered out a bunch of messages with the mgr servicing `osd pool stats` requests, which
I have running in a watch -n1.
And I also snipped a bunch of repeating messages about mds.db, the standby-replay mds, and
not sure if that is anything worth caring about.
Hoping someone may have some ideas of where to look to try and improve stability with the
mgr's.
Thanks,
Reed