No, they are stored locally on ESXi data storage on top of hardware RAID5
built with SAS/SATA (different hardware on hosts).
Also, I've tried going back to the snapshot taken just after all monitors
and OSDs were added to cluster. The host boots fine and is working as it
should, however, after the next reboot this problem appears (no changes to
configuration were made).
And another thing - even if docker container for mgr is running and gives
no errors in logs neither inside the container nor on the parent host the
mgr doesn't bind to any ports it should: 6800, 6801 and 8443 for dashboard.
Not sure if it is the reason or the consequence of this problem.
вт, 28 июл. 2020 г. в 11:37, Anthony D'Atri <anthony.datri(a)gmail.com>om>:
Are your mon DBs on SSDs?
On Jul 27, 2020, at 7:28 AM, Илья Борисович
Волошин <
i.voloshin(a)simplesolution.pro> wrote:
Here are all the active ports on mon1 (with the exception of sshd and
ntpd):
# netstat -npl
Proto Recv-Q Send-Q Local Address Foreign Address State
PID/Program name
tcp 0 0 <mon1_ip>:3300 0.0.0.0:* LISTEN
1582/ceph-mon
tcp 0 0 <mon1_ip>:6789 0.0.0.0:*
LISTEN
1582/ceph-mon
tcp6 0 0 :::9093 :::*
LISTEN
908/alertmanager
tcp6 0 0 :::9094 :::*
LISTEN
908/alertmanager
tcp6 0 0 :::9095 :::*
LISTEN
896/prometheus
tcp6 0 0 :::9100 :::*
LISTEN
906/node_exporter
tcp6 0 0 :::3000 :::*
LISTEN
882/grafana-server
udp6 0 0 :::9094 :::*
908/alertmanager
I've tried telnet from mon1 host, can connect to 3300 and 6789:
# telnet <mon1_ip> 3300
Trying <mon1_ip>...
Connected to <mon1_ip>.
Escape character is '^]'.
ceph v2
# telnet <mon1_ip> 6789
Trying <mon1_ip>...
Connected to <mon1_ip>.
Escape character is '^]'.
ceph v027QQ
6800 and 6801 refuse connection:
# telnet <mon1_ip> 6800
Trying <mon1_ip>...
telnet: Unable to connect to remote host: Connection refused
I don't see any errors in the log related to failures to bind... and all
CEPH systemd services are running as far as I can tell:
# systemctl list-units -a | grep ceph
ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5(a)alertmanager.mon1.service
loaded active running Ceph
alertmanager.mon1 for e30397f0-cc32-11ea-8c8e-000c29469cd5
ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5(a)crash.mon1.service
loaded active running Ceph crash.mon1
for e30397f0-cc32-11ea-8c8e-000c29469cd5
ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5(a)grafana.mon1.service
loaded active running Ceph
grafana.mon1
for e30397f0-cc32-11ea-8c8e-000c29469cd5
ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5(a)mgr.mon1.peevkl.service
loaded active running Ceph
mgr.mon1.peevkl for e30397f0-cc32-11ea-8c8e-000c29469cd5
ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5(a)mon.mon1.service
loaded active running Ceph mon.mon1
for
e30397f0-cc32-11ea-8c8e-000c29469cd5
ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5(a)node-exporter.mon1.service
loaded active running Ceph
node-exporter.mon1 for e30397f0-cc32-11ea-8c8e-000c29469cd5
ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5(a)prometheus.mon1.service
loaded active running Ceph
prometheus.mon1 for e30397f0-cc32-11ea-8c8e-000c29469cd5
system-ceph\x2de30397f0\x2dcc32\x2d11ea\x2d8c8e\x2d000c29469cd5.slice
loaded active active
system-ceph\x2de30397f0\x2dcc32\x2d11ea\x2d8c8e\x2d000c29469cd5.slice
ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5.target
loaded active active Ceph cluster
e30397f0-cc32-11ea-8c8e-000c29469cd5
ceph.target
loaded active active All Ceph clusters
and services
Here are currently active docker images:
# docker ps
CONTAINER ID IMAGE COMMAND
CREATED STATUS PORTS NAMES
dfd8dbeccf1e ceph/ceph:v15 "/usr/bin/ceph-mgr -…"
41 minutes ago Up 41 minutes
ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-mgr.mon1.peevkl
9452d1db7ffb ceph/ceph:v15 "/usr/bin/ceph-mon -…"
3
hours ago Up 3 hours
ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-mon.mon1
703ec4a43824 prom/prometheus:v2.18.1 "/bin/prometheus --c…"
3
hours ago Up 3 hours
ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-prometheus.mon1
d816ec5e645f ceph/ceph:v15 "/usr/bin/ceph-crash…"
3
hours ago Up 3 hours
ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-crash.mon1
38d283ba6424 ceph/ceph-grafana:latest "/bin/sh -c 'grafana…"
3
hours ago Up 3 hours
ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-grafana.mon1
cc119ec8f09a prom/node-exporter:v0.18.1 "/bin/node_exporter …"
3
hours ago Up 3 hours
ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-node-exporter.mon1
aa1d339c4100 prom/alertmanager:v0.20.0 "/bin/alertmanager -…"
3
hours ago Up 3 hours
ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-alertmanager.mon1
iptables are active, I tried setting all chain policies to ACCEPT (didn't
help), the rules are as such:
0 0 CEPH tcp -- * * 0.0.0.0/0
0.0.0.0/0 tcp dpt:6789
5060 303K CEPH tcp -- * * 0.0.0.0/0
0.0.0.0/0 multiport dports 6800:7300
Chain CEPH includes addresses for monitors and OSDs.
пн, 27 июл. 2020 г. в 17:07, Dino Godor <dg(a)terralink.de>de>:
> Hi,
>
> have you tried to locally connect to the ports with netcat (or telnet)?
>
> Is the process listening ? (something like netstat -4ln or the current
> equivalent thereof)
>
> Is the old (new) Firewall maybe still running ?
>
>
> On 27.07.20 16:00, Илья Борисович Волошин wrote:
>> Hello,
>>
>> I've created an Octopus 15.2.4 cluster with 3 monitors and 3 OSDs (6
> hosts
>> in total, all ESXi VMs). It lived through a couple of reboots without
>> problem, then I've reconfigured the main host a bit:
>> set iptables-legacy as current option in update-alternatives (this is a
>> Debian10 system), applied a basic ruleset of iptables and restarted
> docker.
>>
>> After that the cluster became unresponsive (any ceph command hangs
>> indefinitely). I can use admin socket to manipulate config though.
> Setting
>> debug_ms to 5 I see this in the logs (timestamps cut for readability):
>>
>> 7f4096f41700 5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0]
>>
>> [v2:<mon2_ip>:3300/0,v1:<mon2_ip>:6789/0] conn(0x55c21b975800
>> 0x55c21ab45180 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=0 rx=0 tx=
>> 0).send_message enqueueing message m=0x55c21bd84a00 type=67
> mon_probe(probe
>> e30397f0-cc32-11ea-8c8e-000c29469cd5 name mon1 mon_release octopus) v7
>> 7f4098744700 1 -- >>
>> [v2:<mon1_ip>:6800/561959008,v1:<mon1_ip>:6801/561959008]
>> conn(0x55c21b974400 msgr2=0x55c21ab45600 unknown :-1
> s=STATE_CONNECTING_RE
>> l=0).process reconnect failed to v2:81.200.2
>> .152:6800/561959008
>> 7f4098744700 2 -- >>
>> [v2:<mon1_ip>:6800/561959008,v1:<mon1_ip>:6801/561959008]
>> conn(0x55c21b974400 msgr2=0x55c21ab45600 unknown :-1
> s=STATE_CONNECTING_RE
>> l=0).process connection refused!
>>
>> and this:
>>
>> 7f4098744700 2 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0]
>>
>> conn(0x55c21ba38c00 0x55c21bcc5a80 secure :-1 s=AUTH_ACCEPTING pgs=0
> cs=0
>> l=1 rx=0 tx=0)._fault on lossy channel, failing
>> 7f4098744700 1 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0]
>>
>> conn(0x55c21ba38c00 0x55c21bcc5a80 secure :-1 s=AUTH_ACCEPTING pgs=0
> cs=0
>> l=1 rx=0 tx=0).stop
>> 7f4098744700 5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0]
>>
>> conn(0x55c21ba38c00 0x55c21bcc5a80 secure :-1 s=AUTH_ACCEPTING pgs=0
> cs=0
>> l=1 rx=0 tx=0).reset_recv_state
>> 7f4098744700 5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0]
>>
>> conn(0x55c21ba38c00 0x55c21bcc5a80 secure :-1 s=AUTH_ACCEPTING pgs=0
> cs=0
>> l=1 rx=0 tx=0).reset_security
>> 7f409373a700 1 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0]
>>
>> conn(0x55c21c0d2800 0x55c21bcc3f80 unknown :-1 s=NONE pgs=0 cs=0 l=0
> rx=0
>> tx=0).accept
>> 7f4098744700 1 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0]
>>
>> conn(0x55c21c0d2800 0x55c21bcc3f80 unknown :-1 s=BANNER_ACCEPTING
pgs=0
>> cs=0 l=0 rx=0
tx=0)._handle_peer_banner_payload supported=0 required=0
>> 7f4098744700 5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0]
>>
>> conn(0x55c21c0d2800 0x55c21bcc3f80 unknown :-1 s=HELLO_ACCEPTING pgs=0
>> cs=0 l=0 rx=0 tx=0).handle_hello received hello: peer_type=8
>> peer_addr_for_me=v2:<mon1_ip>:3300/0
>> 7f4098744700 5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0]
>>
>> conn(0x55c21c0d2800 0x55c21bcc3f80 unknown :-1 s=HELLO_ACCEPTING pgs=0
>> cs=0 l=0 rx=0 tx=0).handle_hello getsockname says I am <mon1_ip>:3300
> when
>> talking to v2:<mon1_ip>:49012/0
>> 7f4098744700 1 mon.mon1@0(probing) e5 handle_auth_request failed to
> assign
>> global_id
>>
>> Config (the result of ceph --admin-daemon
>> /run/ceph/e30397f0-cc32-11ea-8c8e-000c29469cd5/ceph-mon.mon1.asok
config
>> show):
>>
https://pastebin.com/kifMXs9H
>>
>> I can connect to ports 3300 and 6789 with telnet; 6800 and 6801 return
>> 'process connection refused'
>>
>> Setting all iptables policies to ACCEPT didn't change anything.
>>
>> Where should I start digging to fix this problem? I'd like to at least
>> understand why this happened before putting the cluster into
production.
Any help is appreciated.
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io