Re: Cluster became unresponsive: e5 handle_auth_request failed to assign global_id - ceph-users

28 Jul 2020

No, they are stored locally on ESXi data storage on top of hardware RAID5
built with SAS/SATA (different hardware on hosts).

Also, I've tried going back to the snapshot taken just after all monitors
and OSDs were added to cluster. The host boots fine and is working as it
should, however, after the next reboot this problem appears (no changes to
configuration were made).

And another thing - even if docker container for mgr is running and gives
no errors in logs neither inside the container nor on the parent host the
mgr doesn't bind to any ports it should: 6800, 6801 and 8443 for dashboard.
Not sure if it is the reason or the consequence of this problem.

вт, 28 июл. 2020 г. в 11:37, Anthony D'Atri &lt;anthony.datri(a)gmail.com&gt;om>:

...
  Are your mon DBs on SSDs?

  On Jul 27, 2020, at 7:28 AM, Илья Борисович
Волошин <  i.voloshin(a)simplesolution.pro&gt; wrote:

 Here are all the active ports on mon1 (with the exception of sshd and  ntpd):

 # netstat -npl
 Proto Recv-Q Send-Q Local Address           Foreign Address         State
    PID/Program name
 tcp        0      0 <mon1_ip>:3300      0.0.0.0:*               LISTEN
 1582/ceph-mon
 tcp        0      0 <mon1_ip>:6789          0.0.0.0:*   LISTEN
      1582/ceph-mon
 tcp6       0      0 :::9093                 :::*  LISTEN
      908/alertmanager
 tcp6       0      0 :::9094                 :::*  LISTEN
      908/alertmanager
 tcp6       0      0 :::9095                 :::*  LISTEN
      896/prometheus
 tcp6       0      0 :::9100                 :::*  LISTEN
      906/node_exporter
 tcp6       0      0 :::3000                 :::*  LISTEN
      882/grafana-server
 udp6       0      0 :::9094                 :::*
     908/alertmanager

 I've tried telnet from mon1 host, can connect to 3300 and 6789:

 # telnet <mon1_ip> 3300 
  Trying <mon1_ip>...
 Connected to <mon1_ip>.
 Escape character is '^]'.
 ceph v2

 # telnet <mon1_ip> 6789
 Trying <mon1_ip>...
 Connected to <mon1_ip>.
 Escape character is '^]'.
 ceph v027QQ

 6800 and 6801 refuse connection:

 # telnet <mon1_ip> 6800
 Trying <mon1_ip>...
 telnet: Unable to connect to remote host: Connection refused

 I don't see any errors in the log related to failures to bind... and all
 CEPH systemd services are running as far as I can tell:

 # systemctl list-units -a | grep ceph
  ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5(a)alertmanager.mon1.service
                            loaded    active   running   Ceph
 alertmanager.mon1 for e30397f0-cc32-11ea-8c8e-000c29469cd5
  ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5(a)crash.mon1.service
                             loaded    active   running   Ceph crash.mon1
 for e30397f0-cc32-11ea-8c8e-000c29469cd5
  ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5(a)grafana.mon1.service
                             loaded    active   running   Ceph  grafana.mon1
  for e30397f0-cc32-11ea-8c8e-000c29469cd5
  ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5(a)mgr.mon1.peevkl.service
                            loaded    active   running   Ceph
 mgr.mon1.peevkl for e30397f0-cc32-11ea-8c8e-000c29469cd5
  ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5(a)mon.mon1.service
                             loaded    active   running   Ceph mon.mon1  for
  e30397f0-cc32-11ea-8c8e-000c29469cd5
  ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5(a)node-exporter.mon1.service
                             loaded    active   running   Ceph
 node-exporter.mon1 for e30397f0-cc32-11ea-8c8e-000c29469cd5
  ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5(a)prometheus.mon1.service
                            loaded    active   running   Ceph
 prometheus.mon1 for e30397f0-cc32-11ea-8c8e-000c29469cd5
  system-ceph\x2de30397f0\x2dcc32\x2d11ea\x2d8c8e\x2d000c29469cd5.slice
                            loaded    active   active
 system-ceph\x2de30397f0\x2dcc32\x2d11ea\x2d8c8e\x2d000c29469cd5.slice
  ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5.target
                             loaded    active   active    Ceph cluster
 e30397f0-cc32-11ea-8c8e-000c29469cd5
  ceph.target
                            loaded    active   active    All Ceph clusters
 and services

 Here are currently active docker images:

 # docker ps
 CONTAINER ID        IMAGE                        COMMAND
 CREATED             STATUS              PORTS               NAMES
 dfd8dbeccf1e        ceph/ceph:v15                "/usr/bin/ceph-mgr -…"
 41 minutes ago      Up 41 minutes
 ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-mgr.mon1.peevkl
 9452d1db7ffb        ceph/ceph:v15                "/usr/bin/ceph-mon -…"  
3
  hours ago         Up 3 hours
 ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-mon.mon1
 703ec4a43824        prom/prometheus:v2.18.1      "/bin/prometheus --c…"  
3
  hours ago         Up 3 hours
 ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-prometheus.mon1
 d816ec5e645f        ceph/ceph:v15                "/usr/bin/ceph-crash…"  
3
  hours ago         Up 3 hours
 ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-crash.mon1
 38d283ba6424        ceph/ceph-grafana:latest     "/bin/sh -c 'grafana…"
  3
  hours ago         Up 3 hours
 ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-grafana.mon1
 cc119ec8f09a        prom/node-exporter:v0.18.1   "/bin/node_exporter …"  
3
  hours ago         Up 3 hours
 ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-node-exporter.mon1
 aa1d339c4100        prom/alertmanager:v0.20.0    "/bin/alertmanager -…"  
3
  hours ago         Up 3 hours
 ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-alertmanager.mon1

 iptables are active, I tried setting all chain policies to ACCEPT (didn't
 help), the rules are as such:

    0     0 CEPH       tcp  --  *      *       0.0.0.0/0
 0.0.0.0/0            tcp dpt:6789
 5060  303K CEPH       tcp  --  *      *       0.0.0.0/0
 0.0.0.0/0            multiport dports 6800:7300

 Chain CEPH includes addresses for monitors and OSDs.

 пн, 27 июл. 2020 г. в 17:07, Dino Godor &lt;dg(a)terralink.de&gt;de>:

> Hi,
>
> have you tried to locally connect to the ports with netcat (or telnet)?
>
> Is the process listening ? (something like netstat -4ln or the current
> equivalent thereof)
>
> Is the old (new) Firewall maybe still running ?
>
>
> On 27.07.20 16:00, Илья Борисович Волошин wrote:
>> Hello,
>>
>> I've created an Octopus 15.2.4 cluster with 3 monitors and 3 OSDs (6
> hosts
>> in total, all ESXi VMs). It lived through a couple of reboots without
>> problem, then I've reconfigured the main host a bit:
>> set iptables-legacy as current option in update-alternatives (this is a
>> Debian10 system), applied a basic ruleset of iptables and restarted
> docker.
>>
>> After that the cluster became unresponsive (any ceph command hangs
>> indefinitely). I can use admin socket to manipulate config though.
> Setting
>> debug_ms to 5 I see this in the logs (timestamps cut for readability):
>>
>> 7f4096f41700  5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0]
>>
>> [v2:<mon2_ip>:3300/0,v1:<mon2_ip>:6789/0] conn(0x55c21b975800
>> 0x55c21ab45180 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=0 rx=0 tx=
>> 0).send_message enqueueing message m=0x55c21bd84a00 type=67
> mon_probe(probe
>> e30397f0-cc32-11ea-8c8e-000c29469cd5 name mon1 mon_release octopus) v7
>> 7f4098744700  1 --  >>
>> [v2:<mon1_ip>:6800/561959008,v1:<mon1_ip>:6801/561959008]
>> conn(0x55c21b974400 msgr2=0x55c21ab45600 unknown :-1
> s=STATE_CONNECTING_RE
>> l=0).process reconnect failed to v2:81.200.2
>> .152:6800/561959008
>> 7f4098744700  2 --  >>
>> [v2:<mon1_ip>:6800/561959008,v1:<mon1_ip>:6801/561959008]
>> conn(0x55c21b974400 msgr2=0x55c21ab45600 unknown :-1
> s=STATE_CONNECTING_RE
>> l=0).process connection refused!
>>
>> and this:
>>
>> 7f4098744700  2 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0]
>>
>>  conn(0x55c21ba38c00 0x55c21bcc5a80 secure :-1 s=AUTH_ACCEPTING pgs=0
> cs=0
>> l=1 rx=0 tx=0)._fault on lossy channel, failing
>> 7f4098744700  1 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0]
>>
>>  conn(0x55c21ba38c00 0x55c21bcc5a80 secure :-1 s=AUTH_ACCEPTING pgs=0
> cs=0
>> l=1 rx=0 tx=0).stop
>> 7f4098744700  5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0]
>>
>>  conn(0x55c21ba38c00 0x55c21bcc5a80 secure :-1 s=AUTH_ACCEPTING pgs=0
> cs=0
>> l=1 rx=0 tx=0).reset_recv_state
>> 7f4098744700  5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0]
>>
>>  conn(0x55c21ba38c00 0x55c21bcc5a80 secure :-1 s=AUTH_ACCEPTING pgs=0
> cs=0
>> l=1 rx=0 tx=0).reset_security
>> 7f409373a700  1 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0]
>>
>>  conn(0x55c21c0d2800 0x55c21bcc3f80 unknown :-1 s=NONE pgs=0 cs=0 l=0
> rx=0
>> tx=0).accept
>> 7f4098744700  1 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0]
>>
>>  conn(0x55c21c0d2800 0x55c21bcc3f80 unknown :-1 s=BANNER_ACCEPTING  pgs=0
 >> cs=0 l=0 rx=0
tx=0)._handle_peer_banner_payload supported=0 required=0
>> 7f4098744700  5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0]
>>
>>  conn(0x55c21c0d2800 0x55c21bcc3f80 unknown :-1 s=HELLO_ACCEPTING pgs=0
>> cs=0 l=0 rx=0 tx=0).handle_hello received hello: peer_type=8
>> peer_addr_for_me=v2:<mon1_ip>:3300/0
>> 7f4098744700  5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0]
>>
>>  conn(0x55c21c0d2800 0x55c21bcc3f80 unknown :-1 s=HELLO_ACCEPTING pgs=0
>> cs=0 l=0 rx=0 tx=0).handle_hello getsockname says I am <mon1_ip>:3300
> when
>> talking to v2:<mon1_ip>:49012/0
>> 7f4098744700  1 mon.mon1@0(probing) e5 handle_auth_request failed to
> assign
>> global_id
>>
>> Config (the result of ceph --admin-daemon
>> /run/ceph/e30397f0-cc32-11ea-8c8e-000c29469cd5/ceph-mon.mon1.asok  config
 >> show):
>> https://pastebin.com/kifMXs9H
>>
>> I can connect to ports 3300 and 6789 with telnet; 6800 and 6801 return
>> 'process connection refused'
>>
>> Setting all iptables policies to ACCEPT didn't change anything.
>>
>> Where should I start digging to fix this problem? I'd like to at least
>> understand why this happened before putting the cluster into  production.
    Any help is appreciated.
 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 

_______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io