Cephadm - Error ENOENT: Module not found - ceph-users

28 Mar 2023

Hello, 
After a successful upgrade of a Ceph cluster from 16.2.7 to 16.2.11, I needed to downgrade
it back to 16.2.7 as I found an issue with the new version. 

I expected that running the downgrade with:`ceph orch upgrade start --ceph-version 16.2.7`
should have worked fine. However, it blocked right after the downgrade of the first MGR
daemon. In fact, the downgraded daemon is not able to use the cephadm module anymore. Any
`ceph orch` command fails with the following error:

```
$ ceph orch ps
Error ENOENT: Module not found
```
And the downgrade process is therefore blocked. 

These are the logs of the MGR when issuing the command:

```
Mar 28 12:13:15 astano03
ceph-c57586c4-8e44-11eb-a116-248a07aa8d2e-mgr-astano03-qtzccn[2232770]: debug
2023-03-28T10:13:15.557+0000 7f828fe8c700  0 log_channel(audit) log [DBG] :
from='client.3136173 -' entity='client.admin' cmd=[{"prefix":
"orch ps", "target": ["mon-mgr", ""]}]: dispatch
Mar 28 12:13:15 astano03
ceph-c57586c4-8e44-11eb-a116-248a07aa8d2e-mgr-astano03-qtzccn[2232770]: debug
2023-03-28T10:13:15.558+0000 7f829068d700  0 [orchestrator DEBUG root] _oremote
orchestrator -> cephadm.list_daemons(*(None, None), **{'daemon_id': None,
'host': None, 'refresh': False})
Mar 28 12:13:15 astano03
ceph-c57586c4-8e44-11eb-a116-248a07aa8d2e-mgr-astano03-qtzccn[2232770]: debug
2023-03-28T10:13:15.558+0000 7f829068d700 -1 no module 'cephadm'
Mar 28 12:13:15 astano03
ceph-c57586c4-8e44-11eb-a116-248a07aa8d2e-mgr-astano03-qtzccn[2232770]: debug
2023-03-28T10:13:15.558+0000 7f829068d700  0 [orchestrator DEBUG root] _oremote
orchestrator -> cephadm.get_feature_set(*(), **{})
Mar 28 12:13:15 astano03
ceph-c57586c4-8e44-11eb-a116-248a07aa8d2e-mgr-astano03-qtzccn[2232770]: debug
2023-03-28T10:13:15.558+0000 7f829068d700 -1 no module 'cephadm'
Mar 28 12:13:15 astano03
ceph-c57586c4-8e44-11eb-a116-248a07aa8d2e-mgr-astano03-qtzccn[2232770]: debug
2023-03-28T10:13:15.558+0000 7f829068d700 -1 mgr.server reply reply (2) No such file or
directory Module not found
```

Other interesting MGR logs are:
```
 2023-03-28T11:05:59.519+0000 7fcd16314700  4 mgr get_store get_store key:
mgr/cephadm/upgrade_state
 2023-03-28T11:05:59.519+0000 7fcd16314700 -1 mgr load Failed to construct class in
'cephadm'
 2023-03-28T11:05:59.519+0000 7fcd16314700 -1 mgr load Traceback (most recent call last):
e "/usr/share/ceph/mgr/cephadm/module.py", line 450, in __init__
elf.upgrade = CephadmUpgrade(self)
e "/usr/share/ceph/mgr/cephadm/upgrade.py", line 111, in __init__
elf.upgrade_state: Optional[UpgradeState] = UpgradeState.from_json(json.loads(t))
e "/usr/share/ceph/mgr/cephadm/upgrade.py", line 92, in from_json
eturn cls(**c)
rror: __init__() got an unexpected keyword argument 'daemon_types'

 2023-03-28T11:05:59.521+0000 7fcd16314700 -1 mgr operator() Failed to run module in
active mode ('cephadm')
```
Which seem to relate to the new feature of staggered upgrades.

Please note that before, everything was working fine with version 16.2.7.

I am currently stuck in this situation with only one MGR daemon on version 16.2.11 which
is the only one still working fine:

```
[root@astano01 ~]# ceph orch ps | grep mgr
mgr.astano02.mzmewn                    astano02  *:8443,9283  running (5d)     43s ago  
2y     455M        -  16.2.11  7a63bce27215  e2d7806acf16
mgr.astano03.qtzccn                    astano03  *:8443,9283  running (3m)     22s ago 
95m     383M        -  16.2.7   463ec4b1fdc0  cc0d88864fa1
```

Does anyone already faced this issue or knows how can I make the 16.2.7 MGR load the
cephadm module correctly?

Thanks in advance for any help!