cephadm upgrade from v15.11 to pacific fails all the times - ceph-users

30 Apr 2021

Dear gents,

to get handy with cephadm upgrade path and in general (we heavily use old
style "ceph-deploy" Octopus based production clusters), we decided to do
some tests with a vanilla cluster running 15.2.11 based on Centos8 on top
of vSphere.  Deployment of Octopus cluster runs very well and  we are
excited about this new technique and all the possibilities.  No errors no
clues... :-)

Unfortunately upgrade fails to Pacific (16.2.0 or 16.2.1) either original
docker or quay.ceph.io/ceph-ci/ceph:pacific images all the time.  We use a
small setup (3 mons, 2 mgrs, some osds) This is the upgrade behaviour:

Upgrade of both MGR's seems to be ok but we get this:

2021-04-29T15:35:19.903111+0200 mgr.c0n00.vnxaqu [DBG] daemon
mgr.c0n00.vnxaqu container digest correct
2021-04-29T15:35:19.903206+0200 mgr.c0n00.vnxaqu [DBG] daemon
mgr.c0n00.vnxaqu deployed by correct version
2021-04-29T15:35:19.903298+0200 mgr.c0n00.vnxaqu [DBG] daemon
mgr.c0n01.gstlmw container digest correct
2021-04-29T15:35:19.903378+0200 mgr.c0n00.vnxaqu [DBG] daemon
mgr.c0n01.gstlmw *not deployed by correct version*

After this the upgrade process stucks completely.  Although you have a
running cluster (minus one  monitor daemon):

[root@c0n00 ~]# ceph -s
  cluster:
    id:     5541c866-a8fe-11eb-b604-005056b8f1bf
    health: HEALTH_WARN
           * 3 hosts fail cephadm check*
  services:
    mon: 2 daemons, quorum c0n00,c0n02 (age 68m)
    mgr: c0n00.bmtvpr(active, since 68m), standbys: c0n01.jwfuca
    osd: 4 osds: 4 up (since 63m), 4 in (since 62m)
[..]
  progress:
    Upgrade to 16.2.1-257-g717ce59b (0s)

      [=...........................]

{

    "target_image": "
quay.ceph.io/ceph-ci/ceph@sha256:d0f624287378fe63fc4c30bccc9f82bfe0e42e62381c0a3d0d3d86d985f5d788",

    "in_progress": true,

    "services_complete": [
        "mgr"

    ],

    "progress": "2/19 ceph daemons upgraded",

    "message": "Error: UPGRADE_EXCEPTION: Upgrade: failed due to an
unexpected exception"

[root@c0n00 ~]# ceph orch ps
NAME                 HOST   PORTS        STATUS           REFRESHED  AGE
 VERSION               IMAGE ID      CONTAINER ID
alertmanager.c0n00   c0n00               running (56m)    4m ago     16h
 0.20.0                0881eb8f169f  30d9eff06ce2
crash.c0n00          c0n00               running (56m)    4m ago     16h
 15.2.11               9d01da634b8f  91d3e4d0e14d
crash.c0n01          c0n01               host is offline  16h ago    16h
 15.2.11               9d01da634b8f  0ff4a20021df
crash.c0n02          c0n02               host is offline  16h ago    16h
 15.2.11               9d01da634b8f  0253e6bb29a0
crash.c0n03          c0n03               host is offline  16h ago    16h
 15.2.11               9d01da634b8f  291ce4f8b854
grafana.c0n00        c0n00               running (56m)    4m ago     16h
 6.7.4                 80728b29ad3f  46d77b695da5
mgr.c0n00.bmtvpr     c0n00  *:8443,9283  running (56m)    4m ago     16h
 16.2.1-257-g717ce59b  3be927f015dd  94a7008ccb4f
mgr.c0n01.jwfuca     c0n01               host is offline  16h ago    16h
 16.2.1-257-g717ce59b  3be927f015dd  766ada65efa9
mon.c0n00            c0n00               running (56m)    4m ago     16h
 15.2.11               9d01da634b8f  b9f270cd99e2
mon.c0n02            c0n02               host is offline  16h ago    16h
 15.2.11               9d01da634b8f  a90c21bfd49e
node-exporter.c0n00  c0n00               running (56m)    4m ago     16h
 0.18.1                e5a616e4b9cf  eb1306811c6c
node-exporter.c0n01  c0n01               host is offline  16h ago    16h
 0.18.1                e5a616e4b9cf  093a72542d3e
node-exporter.c0n02  c0n02               host is offline  16h ago    16h
 0.18.1                e5a616e4b9cf  785531f5d6cf
node-exporter.c0n03  c0n03               host is offline  16h ago    16h
 0.18.1                e5a616e4b9cf  074fac77e17c
osd.0                c0n02               host is offline  16h ago    16h
 15.2.11               9d01da634b8f  c075bd047c0a
osd.1                c0n01               host is offline  16h ago    16h
 15.2.11               9d01da634b8f  616aeda28504
osd.2                c0n03               host is offline  16h ago    16h
 15.2.11               9d01da634b8f  b36453730c83
osd.3                c0n00               running (56m)    4m ago     16h
 15.2.11               9d01da634b8f  e043abf53206
prometheus.c0n00     c0n00               running (56m)    4m ago     16h
 2.18.1                de242295e225  7cb50c04e26a

After some digging into daemon logs we found Tracebacks (please see below).
We also noticed that we successfully reach each host per ssh -F ....  !!!
We've done tcpdumps while upgrading and every SYN gets its SYNACK... ;-)

Because we get  no errors while deploying fresh Octopus cluster by
cephadm (from
https://github.com/ceph/ceph/raw/octopus/src/cephadm/cephadm  and cephadm
prepare host is always OK)  it might be a missing Python Lib  or something
that's not checked cephadm itself?

Thank you for any hint.

Christoph Ackermann

Traceback:

 Traceback (most recent call last):
   File "/lib/python3.6/site-packages/execnet/gateway_bootstrap.py", line
48, in bootstrap_exec
     s = io.read(1)
   File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 402,
in read
     raise EOFError("expected %d bytes, got %d" % (numbytes, len(buf)))
 EOFError: expected 1 bytes, got 0

 During handling of the above exception, another exception occurred:

 Traceback (most recent call last):
   File "/usr/share/ceph/mgr/cephadm/serve.py", line 1166, in
_remote_connection
     conn, connr = self.mgr._get_connection(addr)
   File "/usr/share/ceph/mgr/cephadm/module.py", line 1202, in
_get_connection
     sudo=True if self.ssh_user != 'root' else False)
   File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line
34, in __init__
     self.gateway = self._make_gateway(hostname)
   File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line
44, in _make_gateway
     self._make_connection_string(hostname)
   File "/lib/python3.6/site-packages/execnet/multi.py", line 134, in
makegateway
     gw = gateway_bootstrap.bootstrap(io, spec)
   File "/lib/python3.6/site-packages/execnet/gateway_bootstrap.py", line
102, in bootstrap
     bootstrap_exec(io, spec)
   File "/lib/python3.6/site-packages/execnet/gateway_bootstrap.py", line
53, in bootstrap_exec
     raise HostNotFound(io.remoteaddress)
 execnet.gateway_bootstrap.HostNotFound: -F /tmp/cephadm-conf-61otabz_ -i
/tmp/cephadm-identity-rt2nm0t4 root@c0n02

 The above exception was the direct cause of the following exception:

 Traceback (most recent call last):
   File "/usr/share/ceph/mgr/cephadm/utils.py", line 73, in do_work
     return f(*arg)
   File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 60, in
create_from_spec_one
     replace_osd_ids=osd_id_claims.get(host, []), env_vars=env_vars
   File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 75, in
create_single_host
     out, err, code = self._run_ceph_volume_command(host, cmd,
env_vars=env_vars)
   File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 295, in
_run_ceph_volume_command
     error_ok=True)
   File "/usr/share/ceph/mgr/cephadm/serve.py", line 1003, in _run_cephadm
     with self._remote_connection(host, addr) as tpl:
   File "/lib64/python3.6/contextlib.py", line 81, in __enter__
     return next(self.gen)
   File "/usr/share/ceph/mgr/cephadm/serve.py", line 1197, in
_remote_connection
     raise OrchestratorError(msg) from e
 orchestrator._interface.OrchestratorError: Failed to connect to c0n02
(c0n02).
 Please make sure that the host is reachable and accepts connections using
the cephadm SSH key

 To add the cephadm SSH key to the host:
...
  ceph cephadm get-pub-key > ~/ceph.pub
 ssh-copy-id -f -i ~/ceph.pub root@c0n02 
 To check that the host is reachable:
...
  ceph cephadm get-ssh-config > ssh_config
 ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
 chmod 0600 ~/cephadm_private_key
 ssh -F ssh_config -i ~/cephadm_private_key root@c0n02