'ceph orch upgrade...' causes an rbd outage on a proxmox cluster - ceph-users

2 Feb 2023

Hi everyone,

I have a ceph test cluster and a proxmox test cluster (for try upgrade in test before the
prod).
My ceph cluster is made up of three servers running debian 11, with two separate networks
(cluster_network and public_network, in VLANs).
In ceph version 16.2.10 (cephadm with docker).
Each server has one MGR, one MON and 8 OSDs.
  cluster:
    id:     xxx
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ceph01,ceph03,ceph02 (age 2h)
    mgr: ceph03(active, since 77m), standbys: ceph01, ceph02
    osd: 24 osds: 24 up (since 7w), 24 in (since 6M)

  data:
    pools:   3 pools, 65 pgs
    objects: 29.13k objects, 113 GiB
    usage:   344 GiB used, 52 TiB / 52 TiB avail
    pgs:     65 active+clean

  io:
    client:   1.3 KiB/s wr, 0 op/s rd, 0 op/s wr

The proxmox cluster is also made up of 3 servers running proxmox 7.2-7. The ceph storage
used is RBD (on the ceph public_network). I added the RBD datastores simply via the GUI.

So far so good. I have several VMs, on each of the proxmox.

When I update ceph to 16.2.11, that's where things go wrong.
I don't like when the update does everything for me without control, so I did a
"staggered upgrade", following the official procedure
(https://docs.ceph.com/en/pacific/cephadm/upgrade/#staggered-upgrade). As the version
I'm starting from doesn't support staggered upgrade, I follow the procedure
(https://docs.ceph.com/en/pacific/cephadm/upgrade/#upgrading-to-a-version-th…).
When I do the "ceph orch redeploy" of the two standby MGRs, everything is fine.
I do the "sudo ceph mgr fail", everything is fine (it switches well to an mgr
which was standby, so I get an MGR 16.2.11).
However, when I do the "sudo ceph orch upgrade start --image
quay.io/ceph/ceph:v16.2.11 --daemon-types mgr", it updates me the last MGR which was
not updated (so far everything is still fine), but it does a last restart of all the MGRs
to finish, and there, the proxmox visibly loses the RBD and turns off all my VMs.
Here is the message in the proxmox syslog:
Feb  2 16:20:52 pmox01 QEMU[436706]: terminate called after throwing an instance of
'std::system_error'
Feb  2 16:20:52 pmox01 QEMU[436706]:   what():  Resource deadlock avoided
Feb  2 16:20:52 pmox01 kernel: [17038607.686686] vmbr0: port 2(tap102i0) entered disabled
state
Feb  2 16:20:52 pmox01 kernel: [17038607.779049] vmbr0: port 2(tap102i0) entered disabled
state
Feb  2 16:20:52 pmox01 systemd[1]: 102.scope: Succeeded.
Feb  2 16:20:52 pmox01 systemd[1]: 102.scope: Consumed 43.136s CPU time.
Feb  2 16:20:53 pmox01 qmeventd[446872]: Starting cleanup for 102
Feb  2 16:20:53 pmox01 qmeventd[446872]: Finished cleanup for 102

For ceph, everything is fine, it does the update, and tells me everything is OK in the
end.
Ceph is now on 16.2.11 and the health is OK.

When I redo a downgrade of the MGRs, I have the problem again and when I start the
procedure again, I still have the problem. It's very reproducible.
According to my tests, the "sudo ceph orch upgrade" command always gives me
trouble, even when trying a real staggered upgrade from and to version 16.2.11 with the
command:
sudo ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.11 --daemon-types mgr --hosts
ceph01 --limit 1

Does anyone have an idea?

Thank you everyone !
Pierre.