ceph orch cannot refresh - ceph-users

8 Jan 2023

Dear Ceph users,

after a host failure in my cluster (quincy 17.2.3 managed by cephadm) it 
seems that ceph orch got somehow stuck and it cannot operate. For 
example, it seems that it cannot refresh the status of several services 
since about 20 hours:

# ceph orch ls
NAME                       PORTS        RUNNING  REFRESHED   AGE 
PLACEMENT
alertmanager               ?:9093,9094      1/1  3m ago      3M 
count:1
crash                                      9/10  20h ago     3M   * 

grafana                    ?:3000           1/1  3m ago      3M 
count:1
mds.wizard_fs                               0/3  <deleting>  13h 
bofur;balin;aka;count:3
mds.wizardfs                                2/3  20h ago     70m 
bofur;balin;aka;count:3
mgr                                         2/2  20h ago     15m 
bofur;balin;count:2
mon                                         4/5  20h ago     93m 
bofur;balin;aka;romolo;dwalin;count:5
node-exporter              ?:9100          9/10  20h ago     3M   * 

osd                                          24  3m ago      - 
<unmanaged>
osd.all-available-devices                    72  20h ago     4w   * 

prometheus                 ?:9095           1/1  3m ago      3M   count:1

The failed machine (named bifur) is offline but still in the cluster 
since I'm planning to restore it:

# ceph orch host ls
HOST     ADDR           LABELS               STATUS
aka      172.16.253.7   _admin
balin    172.16.253.3
bifur    172.16.253.5   _admin               Offline
bofur    172.16.253.2   _admin
dwalin   172.16.253.10
ogion    172.16.253.6   _no_autotune_memory
prestno  172.16.253.9
remolo   172.16.253.1
rokanan  172.16.253.8
romolo   172.16.253.4
10 hosts in cluster

Since this machine hosted a mon I tried to redeploy it with:

# ceph orch apply mon --placement="5 bofur balin aka romolo dwalin"

but even if ceph orch ls shows that the mons should currently be on the 
machines specified buy --placement (see above) it seems that somehow the 
mon on bifur is somehow still present in ceph orch status, e.g.

# ceph orch restart mon
Scheduled to restart mon.aka on host 'aka'
Scheduled to restart mon.balin on host 'balin'
Scheduled to restart mon.bifur on host 'bifur'
Scheduled to restart mon.bofur on host 'bofur'
Scheduled to restart mon.romolo on host 'romolo'

I manually restarted all the mon and mgr daemons on online hosts to no 
avail. At this point I am clueless, so any help is greatly appreciated.

Nicola