I would look into a potential network problem. Check for errors on both the server side
and on the switch side.
Otherwise I'm not really sure what's going on. Someone else will have to jump
into the conversation.
Bryan
On Oct 29, 2019, at 10:38 AM, Thomas Schneider <74cmonty(a)gmail.com> wrote:
Notice: This email is from an external sender.
Thanks.
2 of 4 MGR nodes are sick.
I have stopped MGR services on both nodes.
When I start the service again on node A, I get this in its log:
root@ld5508:~# tail -f /var/log/ceph/ceph-mgr.ld5508.log
2019-10-29 17:32:02.024 7fe20e881700 0 --1- 10.97.206.96:0/201758478 >>
v1:10.97.206.96:7055/17961 conn(0x564582ad5180 0x5645991ca800 :-1
s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
connect got BADAUTHORIZER
2019-10-29 17:32:02.028 7fe20e881700 0 --1- 10.97.206.96:0/201758478 >>
v1:10.97.206.96:7055/17961 conn(0x5645977b5180 0x564582b2e800 :-1
s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
connect got BADAUTHORIZER
2019-10-29 17:32:02.032 7fe20e881700 0 --1- 10.97.206.96:0/201758478 >>
v1:10.97.206.96:7055/17961 conn(0x564582ad5180 0x5645991ca800 :-1
s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
connect got BADAUTHORIZER
2019-10-29 17:32:02.040 7fe20e881700 0 --1- 10.97.206.96:0/201758478 >>
v1:10.97.206.96:7055/17961 conn(0x5645977b5180 0x564582b2e800 :-1
s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
connect got BADAUTHORIZER
2019-10-29 17:32:02.044 7fe20e881700 0 --1- 10.97.206.96:0/201758478 >>
v1:10.97.206.96:7055/17961 conn(0x564582ad5180 0x5645991ca800 :-1
s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
connect got BADAUTHORIZER
2019-10-29 17:32:02.048 7fe20e881700 0 --1- 10.97.206.96:0/201758478 >>
v1:10.97.206.96:7055/17961 conn(0x5645977b5180 0x564582b2e800 :-1
s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
connect got BADAUTHORIZER
2019-10-29 17:32:02.052 7fe20e881700 0 --1- 10.97.206.96:0/201758478 >>
v1:10.97.206.96:7055/17961 conn(0x564582ad5180 0x5645991ca800 :-1
s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
connect got BADAUTHORIZER
2019-10-29 17:32:02.060 7fe20e881700 0 --1- 10.97.206.96:0/201758478 >>
v1:10.97.206.96:7055/17961 conn(0x5645977b5180 0x564582b2e800 :-1
s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
connect got BADAUTHORIZER
2019-10-29 17:32:02.064 7fe209fe8700 -1 received signal: Terminated
from /sbin/init (PID: 1) UID: 0
2019-10-29 17:32:02.064 7fe209fe8700 -1 mgr handle_signal *** Got signal
Terminated ***
2019-10-29 17:37:54.319 7f0e26fc1dc0 0 set uid:gid to 64045:64045
(ceph:ceph)
2019-10-29 17:37:54.319 7f0e26fc1dc0 0 ceph version 14.2.4
(65249672c6e6d843510e7e01f8a4b976dcac3db1) nautilus (stable), process
ceph-mgr, pid 250399
2019-10-29 17:37:54.319 7f0e26fc1dc0 0 pidfile_write: ignore empty
--pid-file
2019-10-29 17:37:54.331 7f0e26fc1dc0 1 mgr[py] Loading python module
'ansible'
2019-10-29 17:37:54.503 7f0e26fc1dc0 1 mgr[py] Loading python module
'balancer'
2019-10-29 17:37:54.531 7f0e26fc1dc0 1 mgr[py] Loading python module
'crash'
2019-10-29 17:37:54.551 7f0e26fc1dc0 1 mgr[py] Loading python module
'dashboard'
2019-10-29 17:37:54.915 7f0e26fc1dc0 1 mgr[py] Loading python module
'deepsea'
2019-10-29 17:37:55.071 7f0e26fc1dc0 1 mgr[py] Loading python module
'devicehealth'
2019-10-29 17:37:55.103 7f0e26fc1dc0 1 mgr[py] Loading python module
'influx'
2019-10-29 17:37:55.127 7f0e26fc1dc0 1 mgr[py] Loading python module
'insights'
2019-10-29 17:37:55.207 7f0e26fc1dc0 1 mgr[py] Loading python module
'iostat'
2019-10-29 17:37:55.227 7f0e26fc1dc0 1 mgr[py] Loading python module
'localpool'
2019-10-29 17:37:55.247 7f0e26fc1dc0 1 mgr[py] Loading python module
'orchestrator_cli'
2019-10-29 17:37:55.295 7f0e26fc1dc0 1 mgr[py] Loading python module
'pg_autoscaler'
2019-10-29 17:37:55.347 7f0e26fc1dc0 1 mgr[py] Loading python module
'progress'
2019-10-29 17:37:55.387 7f0e26fc1dc0 1 mgr[py] Loading python module
'prometheus'
2019-10-29 17:37:55.599 7f0e26fc1dc0 1 mgr[py] Loading python module
'rbd_support'
2019-10-29 17:37:55.647 7f0e26fc1dc0 1 mgr[py] Loading python module
'restful'
2019-10-29 17:37:55.959 7f0e26fc1dc0 1 mgr[py] Loading python module
'selftest'
2019-10-29 17:37:55.983 7f0e26fc1dc0 1 mgr[py] Loading python module
'status'
2019-10-29 17:37:56.015 7f0e26fc1dc0 1 mgr[py] Loading python module
'telegraf'
2019-10-29 17:37:56.051 7f0e26fc1dc0 1 mgr[py] Loading python module
'telemetry'
2019-10-29 17:37:56.331 7f0e26fc1dc0 1 mgr[py] Loading python module
'test_orchestrator'
2019-10-29 17:37:56.399 7f0e26fc1dc0 1 mgr[py] Loading python module
'volumes'
2019-10-29 17:37:56.459 7f0e26fc1dc0 1 mgr[py] Loading python module
'zabbix'
2019-10-29 17:37:56.503 7f0e21cdd700 1 mgr load Constructed class from
module: dashboard
2019-10-29 17:37:56.503 7f0e214dc700 0 ms_deliver_dispatch: unhandled
message 0x56346f978400 mon_map magic: 0 v1 from mon.0 v2:10.97.206.93:3300/0
2019-10-29 17:37:56.507 7f0e214dc700 0 client.0 ms_handle_reset on
v2:10.97.206.93:6912/22258
2019-10-29 17:37:56.743 7f0e16363700 0 mgr[dashboard]
[29/Oct/2019:17:37:56] ENGINE Error in HTTPServer.tick
Traceback (most recent call last):
File
"/usr/lib/python2.7/dist-packages/cherrypy/wsgiserver/__init__.py", line
2021, in start
self.tick()
File
"/usr/lib/python2.7/dist-packages/cherrypy/wsgiserver/__init__.py", line
2090, in tick
s, ssl_env = self.ssl_adapter.wrap(s)
File
"/usr/lib/python2.7/dist-packages/cherrypy/wsgiserver/ssl_builtin.py",
line 67, in wrap
server_side=True)
File "/usr/lib/python2.7/ssl.py", line 369, in wrap_socket
_context=self)
File "/usr/lib/python2.7/ssl.py", line 599, in __init__
self.do_handshake()
File "/usr/lib/python2.7/ssl.py", line 828, in do_handshake
self._sslobj.do_handshake()
error: [Errno 0] Error
^C
This looks like a severe issue.
Am 29.10.2019 um 17:22 schrieb Bryan Stillwell:
On Oct 29, 2019, at 9:44 AM, Thomas Schneider
<74cmonty(a)gmail.com> wrote:
in my unhealthy cluster I cannot run several ceph
osd command because
they hang, e.g.
ceph osd df
ceph osd pg dump
Also, ceph balancer status hangs.
How can I fix this issue?
Check the status of your ceph-mgr processes (restart
them if needed and check the logs for more details). Those are responsible for handling
those commands in recent releases.
Bryan