Dear gents,
to get handy with cephadm upgrade path and in general (we heavily use old
style "ceph-deploy" Octopus based production clusters), we decided to do
some tests with a vanilla cluster running 15.2.11 based on Centos8 on top
of vSphere. Deployment of Octopus cluster runs very well and we are
excited about this new technique and all the possibilities. No errors no
clues... :-)
Unfortunately upgrade fails to Pacific (16.2.0 or 16.2.1) either original
docker or quay.ceph.io/ceph-ci/ceph:pacific images all the time. We use a
small setup (3 mons, 2 mgrs, some osds) This is the upgrade behaviour:
Upgrade of both MGR's seems to be ok but we get this:
2021-04-29T15:35:19.903111+0200 mgr.c0n00.vnxaqu [DBG] daemon
mgr.c0n00.vnxaqu container digest correct
2021-04-29T15:35:19.903206+0200 mgr.c0n00.vnxaqu [DBG] daemon
mgr.c0n00.vnxaqu deployed by correct version
2021-04-29T15:35:19.903298+0200 mgr.c0n00.vnxaqu [DBG] daemon
mgr.c0n01.gstlmw container digest correct
2021-04-29T15:35:19.903378+0200 mgr.c0n00.vnxaqu [DBG] daemon
mgr.c0n01.gstlmw *not deployed by correct version*
After this the upgrade process stucks completely. Although you have a
running cluster (minus one monitor daemon):
[root@c0n00 ~]# ceph -s
cluster:
id: 5541c866-a8fe-11eb-b604-005056b8f1bf
health: HEALTH_WARN
* 3 hosts fail cephadm check*
services:
mon: 2 daemons, quorum c0n00,c0n02 (age 68m)
mgr: c0n00.bmtvpr(active, since 68m), standbys: c0n01.jwfuca
osd: 4 osds: 4 up (since 63m), 4 in (since 62m)
[..]
progress:
Upgrade to 16.2.1-257-g717ce59b (0s)
[=...........................]
{
"target_image": "
quay.ceph.io/ceph-ci/ceph@sha256:d0f624287378fe63fc4c30bccc9f82bfe0e42e62381c0a3d0d3d86d985f5d788",
"in_progress": true,
"services_complete": [
"mgr"
],
"progress": "2/19 ceph daemons upgraded",
"message": "Error: UPGRADE_EXCEPTION: Upgrade: failed due to an
unexpected exception"
[root@c0n00 ~]# ceph orch ps
NAME HOST PORTS STATUS REFRESHED AGE
VERSION IMAGE ID CONTAINER ID
alertmanager.c0n00 c0n00 running (56m) 4m ago 16h
0.20.0 0881eb8f169f 30d9eff06ce2
crash.c0n00 c0n00 running (56m) 4m ago 16h
15.2.11 9d01da634b8f 91d3e4d0e14d
crash.c0n01 c0n01 host is offline 16h ago 16h
15.2.11 9d01da634b8f 0ff4a20021df
crash.c0n02 c0n02 host is offline 16h ago 16h
15.2.11 9d01da634b8f 0253e6bb29a0
crash.c0n03 c0n03 host is offline 16h ago 16h
15.2.11 9d01da634b8f 291ce4f8b854
grafana.c0n00 c0n00 running (56m) 4m ago 16h
6.7.4 80728b29ad3f 46d77b695da5
mgr.c0n00.bmtvpr c0n00 *:8443,9283 running (56m) 4m ago 16h
16.2.1-257-g717ce59b 3be927f015dd 94a7008ccb4f
mgr.c0n01.jwfuca c0n01 host is offline 16h ago 16h
16.2.1-257-g717ce59b 3be927f015dd 766ada65efa9
mon.c0n00 c0n00 running (56m) 4m ago 16h
15.2.11 9d01da634b8f b9f270cd99e2
mon.c0n02 c0n02 host is offline 16h ago 16h
15.2.11 9d01da634b8f a90c21bfd49e
node-exporter.c0n00 c0n00 running (56m) 4m ago 16h
0.18.1 e5a616e4b9cf eb1306811c6c
node-exporter.c0n01 c0n01 host is offline 16h ago 16h
0.18.1 e5a616e4b9cf 093a72542d3e
node-exporter.c0n02 c0n02 host is offline 16h ago 16h
0.18.1 e5a616e4b9cf 785531f5d6cf
node-exporter.c0n03 c0n03 host is offline 16h ago 16h
0.18.1 e5a616e4b9cf 074fac77e17c
osd.0 c0n02 host is offline 16h ago 16h
15.2.11 9d01da634b8f c075bd047c0a
osd.1 c0n01 host is offline 16h ago 16h
15.2.11 9d01da634b8f 616aeda28504
osd.2 c0n03 host is offline 16h ago 16h
15.2.11 9d01da634b8f b36453730c83
osd.3 c0n00 running (56m) 4m ago 16h
15.2.11 9d01da634b8f e043abf53206
prometheus.c0n00 c0n00 running (56m) 4m ago 16h
2.18.1 de242295e225 7cb50c04e26a
After some digging into daemon logs we found Tracebacks (please see below).
We also noticed that we successfully reach each host per ssh -F .... !!!
We've done tcpdumps while upgrading and every SYN gets its SYNACK... ;-)
Because we get no errors while deploying fresh Octopus cluster by
cephadm (from
https://github.com/ceph/ceph/raw/octopus/src/cephadm/cephadm and cephadm
prepare host is always OK) it might be a missing Python Lib or something
that's not checked cephadm itself?
Thank you for any hint.
Christoph Ackermann
Traceback:
Traceback (most recent call last):
File "/lib/python3.6/site-packages/execnet/gateway_bootstrap.py", line
48, in bootstrap_exec
s = io.read(1)
File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 402,
in read
raise EOFError("expected %d bytes, got %d" % (numbytes, len(buf)))
EOFError: expected 1 bytes, got 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/share/ceph/mgr/cephadm/serve.py", line 1166, in
_remote_connection
conn, connr = self.mgr._get_connection(addr)
File "/usr/share/ceph/mgr/cephadm/module.py", line 1202, in
_get_connection
sudo=True if self.ssh_user != 'root' else False)
File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line
34, in __init__
self.gateway = self._make_gateway(hostname)
File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line
44, in _make_gateway
self._make_connection_string(hostname)
File "/lib/python3.6/site-packages/execnet/multi.py", line 134, in
makegateway
gw = gateway_bootstrap.bootstrap(io, spec)
File "/lib/python3.6/site-packages/execnet/gateway_bootstrap.py", line
102, in bootstrap
bootstrap_exec(io, spec)
File "/lib/python3.6/site-packages/execnet/gateway_bootstrap.py", line
53, in bootstrap_exec
raise HostNotFound(io.remoteaddress)
execnet.gateway_bootstrap.HostNotFound: -F /tmp/cephadm-conf-61otabz_ -i
/tmp/cephadm-identity-rt2nm0t4 root@c0n02
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/share/ceph/mgr/cephadm/utils.py", line 73, in do_work
return f(*arg)
File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 60, in
create_from_spec_one
replace_osd_ids=osd_id_claims.get(host, []), env_vars=env_vars
File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 75, in
create_single_host
out, err, code = self._run_ceph_volume_command(host, cmd,
env_vars=env_vars)
File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 295, in
_run_ceph_volume_command
error_ok=True)
File "/usr/share/ceph/mgr/cephadm/serve.py", line 1003, in _run_cephadm
with self._remote_connection(host, addr) as tpl:
File "/lib64/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/usr/share/ceph/mgr/cephadm/serve.py", line 1197, in
_remote_connection
raise OrchestratorError(msg) from e
orchestrator._interface.OrchestratorError: Failed to connect to c0n02
(c0n02).
Please make sure that the host is reachable and accepts connections using
the cephadm SSH key
To add the cephadm SSH key to the host:
ceph cephadm get-pub-key > ~/ceph.pub
ssh-copy-id -f -i ~/ceph.pub root@c0n02
To check that the host is reachable:
ceph cephadm get-ssh-config > ssh_config
ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
chmod 0600 ~/cephadm_private_key
ssh -F ssh_config -i ~/cephadm_private_key root@c0n02