[ceph-users] Re: Several ceph osd commands hang

29 Oct 2019

I would look into a potential network problem.  Check for errors on both the server side
and on the switch side.

Otherwise I'm not really sure what's going on.  Someone else will have to jump
into the conversation.

Bryan

On Oct 29, 2019, at 10:38 AM, Thomas Schneider &lt;74cmonty(a)gmail.com&gt; wrote:
...

 Notice: This email is from an external sender.

 Thanks.

 2 of 4 MGR nodes are sick.
 I have stopped MGR services on both nodes.

 When I start the service again on node A, I get this in its log:
 root@ld5508:~# tail -f /var/log/ceph/ceph-mgr.ld5508.log
 2019-10-29 17:32:02.024 7fe20e881700  0 --1- 10.97.206.96:0/201758478 >>
 v1:10.97.206.96:7055/17961 conn(0x564582ad5180 0x5645991ca800 :-1
 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
 connect got BADAUTHORIZER
 2019-10-29 17:32:02.028 7fe20e881700  0 --1- 10.97.206.96:0/201758478 >>
 v1:10.97.206.96:7055/17961 conn(0x5645977b5180 0x564582b2e800 :-1
 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
 connect got BADAUTHORIZER
 2019-10-29 17:32:02.032 7fe20e881700  0 --1- 10.97.206.96:0/201758478 >>
 v1:10.97.206.96:7055/17961 conn(0x564582ad5180 0x5645991ca800 :-1
 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
 connect got BADAUTHORIZER
 2019-10-29 17:32:02.040 7fe20e881700  0 --1- 10.97.206.96:0/201758478 >>
 v1:10.97.206.96:7055/17961 conn(0x5645977b5180 0x564582b2e800 :-1
 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
 connect got BADAUTHORIZER
 2019-10-29 17:32:02.044 7fe20e881700  0 --1- 10.97.206.96:0/201758478 >>
 v1:10.97.206.96:7055/17961 conn(0x564582ad5180 0x5645991ca800 :-1
 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
 connect got BADAUTHORIZER
 2019-10-29 17:32:02.048 7fe20e881700  0 --1- 10.97.206.96:0/201758478 >>
 v1:10.97.206.96:7055/17961 conn(0x5645977b5180 0x564582b2e800 :-1
 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
 connect got BADAUTHORIZER
 2019-10-29 17:32:02.052 7fe20e881700  0 --1- 10.97.206.96:0/201758478 >>
 v1:10.97.206.96:7055/17961 conn(0x564582ad5180 0x5645991ca800 :-1
 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
 connect got BADAUTHORIZER
 2019-10-29 17:32:02.060 7fe20e881700  0 --1- 10.97.206.96:0/201758478 >>
 v1:10.97.206.96:7055/17961 conn(0x5645977b5180 0x564582b2e800 :-1
 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
 connect got BADAUTHORIZER
 2019-10-29 17:32:02.064 7fe209fe8700 -1 received  signal: Terminated
 from /sbin/init  (PID: 1) UID: 0
 2019-10-29 17:32:02.064 7fe209fe8700 -1 mgr handle_signal *** Got signal
 Terminated ***
 2019-10-29 17:37:54.319 7f0e26fc1dc0  0 set uid:gid to 64045:64045
 (ceph:ceph)
 2019-10-29 17:37:54.319 7f0e26fc1dc0  0 ceph version 14.2.4
 (65249672c6e6d843510e7e01f8a4b976dcac3db1) nautilus (stable), process
 ceph-mgr, pid 250399
 2019-10-29 17:37:54.319 7f0e26fc1dc0  0 pidfile_write: ignore empty
 --pid-file
 2019-10-29 17:37:54.331 7f0e26fc1dc0  1 mgr[py] Loading python module
 'ansible'
 2019-10-29 17:37:54.503 7f0e26fc1dc0  1 mgr[py] Loading python module
 'balancer'
 2019-10-29 17:37:54.531 7f0e26fc1dc0  1 mgr[py] Loading python module
 'crash'
 2019-10-29 17:37:54.551 7f0e26fc1dc0  1 mgr[py] Loading python module
 'dashboard'
 2019-10-29 17:37:54.915 7f0e26fc1dc0  1 mgr[py] Loading python module
 'deepsea'
 2019-10-29 17:37:55.071 7f0e26fc1dc0  1 mgr[py] Loading python module
 'devicehealth'
 2019-10-29 17:37:55.103 7f0e26fc1dc0  1 mgr[py] Loading python module
 'influx'
 2019-10-29 17:37:55.127 7f0e26fc1dc0  1 mgr[py] Loading python module
 'insights'
 2019-10-29 17:37:55.207 7f0e26fc1dc0  1 mgr[py] Loading python module
 'iostat'
 2019-10-29 17:37:55.227 7f0e26fc1dc0  1 mgr[py] Loading python module
 'localpool'
 2019-10-29 17:37:55.247 7f0e26fc1dc0  1 mgr[py] Loading python module
 'orchestrator_cli'
 2019-10-29 17:37:55.295 7f0e26fc1dc0  1 mgr[py] Loading python module
 'pg_autoscaler'
 2019-10-29 17:37:55.347 7f0e26fc1dc0  1 mgr[py] Loading python module
 'progress'
 2019-10-29 17:37:55.387 7f0e26fc1dc0  1 mgr[py] Loading python module
 'prometheus'
 2019-10-29 17:37:55.599 7f0e26fc1dc0  1 mgr[py] Loading python module
 'rbd_support'
 2019-10-29 17:37:55.647 7f0e26fc1dc0  1 mgr[py] Loading python module
 'restful'
 2019-10-29 17:37:55.959 7f0e26fc1dc0  1 mgr[py] Loading python module
 'selftest'
 2019-10-29 17:37:55.983 7f0e26fc1dc0  1 mgr[py] Loading python module
 'status'
 2019-10-29 17:37:56.015 7f0e26fc1dc0  1 mgr[py] Loading python module
 'telegraf'
 2019-10-29 17:37:56.051 7f0e26fc1dc0  1 mgr[py] Loading python module
 'telemetry'
 2019-10-29 17:37:56.331 7f0e26fc1dc0  1 mgr[py] Loading python module
 'test_orchestrator'
 2019-10-29 17:37:56.399 7f0e26fc1dc0  1 mgr[py] Loading python module
 'volumes'
 2019-10-29 17:37:56.459 7f0e26fc1dc0  1 mgr[py] Loading python module
 'zabbix'
 2019-10-29 17:37:56.503 7f0e21cdd700  1 mgr load Constructed class from
 module: dashboard
 2019-10-29 17:37:56.503 7f0e214dc700  0 ms_deliver_dispatch: unhandled
 message 0x56346f978400 mon_map magic: 0 v1 from mon.0 v2:10.97.206.93:3300/0
 2019-10-29 17:37:56.507 7f0e214dc700  0 client.0 ms_handle_reset on
 v2:10.97.206.93:6912/22258
 2019-10-29 17:37:56.743 7f0e16363700  0 mgr[dashboard]
 [29/Oct/2019:17:37:56] ENGINE Error in HTTPServer.tick
 Traceback (most recent call last):
  File
 "/usr/lib/python2.7/dist-packages/cherrypy/wsgiserver/__init__.py", line
 2021, in start
    self.tick()
  File
 "/usr/lib/python2.7/dist-packages/cherrypy/wsgiserver/__init__.py", line
 2090, in tick
    s, ssl_env = self.ssl_adapter.wrap(s)
  File
 "/usr/lib/python2.7/dist-packages/cherrypy/wsgiserver/ssl_builtin.py",
 line 67, in wrap
    server_side=True)
  File "/usr/lib/python2.7/ssl.py", line 369, in wrap_socket
    _context=self)
  File "/usr/lib/python2.7/ssl.py", line 599, in __init__
    self.do_handshake()
  File "/usr/lib/python2.7/ssl.py", line 828, in do_handshake
    self._sslobj.do_handshake()
 error: [Errno 0] Error

 ^C

 This looks like a severe issue.

 Am 29.10.2019 um 17:22 schrieb Bryan Stillwell:
  On Oct 29, 2019, at 9:44 AM, Thomas Schneider
&lt;74cmonty(a)gmail.com&gt; wrote:
  in my unhealthy cluster I cannot run several ceph
osd command because
 they hang, e.g.
 ceph osd df
 ceph osd pg dump

 Also, ceph balancer status hangs.

 How can I fix this issue?  Check the status of your ceph-mgr processes (restart
them if needed and check the logs for more details).  Those are responsible for handling
those commands in recent releases.

 Bryan 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Several ceph osd commands hang