Upgrading from Pacific to Quincy fails with "Unexpected error" - ceph-users

6 Apr 2023

Hi all,

I have a problem regarding upgrading Ceph cluster from Pacific to Quincy
version with cephadm. I have successfully upgraded the cluster to the
latest Pacific (16.2.11). But when I run the following command to upgrade
the cluster to 17.2.5, After upgrading 3/4 mgrs, the upgrade process stops
with "Unexpected error". (everything is on a private network)

ceph orch upgrade start my-private-repo/quay-io/ceph/ceph:v17.2.5

I also tried the 17.2.4 version.

cephadm fails to check the hosts' status and marks them as offline:

cephadm 2023-04-06T10:19:59.998510+0000 mgr.host9.arhpnd (mgr.4516356) 5782
: cephadm [DBG]  host host4 (x.x.x.x) failed check
cephadm 2023-04-06T10:19:59.998553+0000 mgr.host9.arhpnd (mgr.4516356) 5783
: cephadm [DBG] Host "host4" marked as offline. Skipping daemon refresh
cephadm 2023-04-06T10:19:59.998581+0000 mgr.host9.arhpnd (mgr.4516356) 5784
: cephadm [DBG] Host "host4" marked as offline. Skipping gather facts
refresh
cephadm 2023-04-06T10:19:59.998609+0000 mgr.host9.arhpnd (mgr.4516356) 5785
: cephadm [DBG] Host "host4" marked as offline. Skipping network refresh
cephadm 2023-04-06T10:19:59.998633+0000 mgr.host9.arhpnd (mgr.4516356) 5786
: cephadm [DBG] Host "host4" marked as offline. Skipping device refresh
cephadm 2023-04-06T10:19:59.998659+0000 mgr.host9.arhpnd (mgr.4516356) 5787
: cephadm [DBG] Host "host4" marked as offline. Skipping osdspec preview
refresh
cephadm 2023-04-06T10:19:59.998682+0000 mgr.host9.arhpnd (mgr.4516356) 5788
: cephadm [DBG] Host "host4" marked as offline. Skipping autotune
cluster 2023-04-06T10:20:00.000151+0000 mon.host8 (mon.0) 158587 : cluster
[ERR] Health detail: HEALTH_ERR 9 hosts fail cephadm check; Upgrade: failed
due to an unexpected exception
cluster 2023-04-06T10:20:00.000191+0000 mon.host8 (mon.0) 158588 : cluster
[ERR] [WRN] CEPHADM_HOST_CHECK_FAILED: 9 hosts fail cephadm check
cluster 2023-04-06T10:20:00.000202+0000 mon.host8 (mon.0) 158589 : cluster
[ERR]     host host7 (x.x.x.x) failed check: Unable to reach remote host
host7. Process exited with non-zero exit status 3
cluster 2023-04-06T10:20:00.000213+0000 mon.host8 (mon.0) 158590 : cluster
[ERR]     host host2 (x.x.x.x) failed check: Unable to reach remote host
host2. Process exited with non-zero exit status 3
cluster 2023-04-06T10:20:00.000220+0000 mon.host8 (mon.0) 158591 : cluster
[ERR]     host host8 (x.x.x.x) failed check: Unable to reach remote host
host8. Process exited with non-zero exit status 3
cluster 2023-04-06T10:20:00.000228+0000 mon.host8 (mon.0) 158592 : cluster
[ERR]     host host4 (x.x.x.x) failed check: Unable to reach remote host
host4. Process exited with non-zero exit status 3
cluster 2023-04-06T10:20:00.000240+0000 mon.host8 (mon.0) 158593 : cluster
[ERR]     host host3 (x.x.x.x) failed check: Unable to reach remote host
host3. Process exited with non-zero exit status 3

and here are some outputs of the commands:

[root@host8 ~]# ceph -s
  cluster:
    id:     xxx
    health: HEALTH_ERR
            9 hosts fail cephadm check
            Upgrade: failed due to an unexpected exception

  services:
    mon: 5 daemons, quorum host8,host1,host7,host2,host9 (age 2w)
    mgr: host9.arhpnd(active, since 105m), standbys: host8.jowfih,
host1.warjsr, host2.qyavjj
    mds: 1/1 daemons up, 3 standby
    osd: 37 osds: 37 up (since 8h), 37 in (since 3w)

  data:

  io:
    client:

  progress:
    Upgrade to 17.2.5 (0s)
      [............................]

[root@host8 ~]# ceph orch upgrade status
{
    "target_image": "my-private-repo/quay-io/ceph/ceph@sha256
:34c763383e3323c6bb35f3f2229af9f466518d9db926111277f5e27ed543c427",
    "in_progress": true,
    "which": "Upgrading all daemon types on all hosts",
    "services_complete": [],
    "progress": "3/59 daemons upgraded",
    "message": "Error: UPGRADE_EXCEPTION: Upgrade: failed due to an
unexpected exception",
    "is_paused": true
}
[root@host8 ~]# ceph cephadm check-host host7
check-host failed:
Host 'host7' not found. Use 'ceph orch host ls' to see all managed hosts.
[root@host8 ~]# ceph versions
{
    "mon": {
        "ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
pacific (stable)": 5
    },
    "mgr": {
        "ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
pacific (stable)": 1,
        "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 3
    },
    "osd": {
        "ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
pacific (stable)": 37
    },
    "mds": {
        "ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
pacific (stable)": 4
    },
    "overall": {
        "ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
pacific (stable)": 47,
        "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 3
    }
}

The strange thing is I can rollback the cluster status by failing to
not-upgraded mgr like this:

ceph mgr fail
ceph orch upgrade start my-private-repo/quay-io/ceph/ceph:v16.2.11

Would you happen to have any idea about this?

Best regards,
Reza