have you tried a mgr failover? 'ceph mgr fail' should do the trick,
because restarting a mgr daemon won't fail it over. You should be able
to see hints in the active mgr logs what is failing, e.g. cephadm logs
Zitat von Nicola Mori <mori(a)fi.infn.it>it>:
Dear Ceph users,
after a host failure in my cluster (quincy 17.2.3 managed by
cephadm) it seems that ceph orch got somehow stuck and it cannot
operate. For example, it seems that it cannot refresh the status of
several services since about 20 hours:
# ceph orch ls
NAME PORTS RUNNING REFRESHED AGE PLACEMENT
alertmanager ?:9093,9094 1/1 3m ago 3M count:1
crash 9/10 20h ago 3M *
grafana ?:3000 1/1 3m ago 3M
mds.wizard_fs 0/3 <deleting> 13h
mds.wizardfs 2/3 20h ago 70m
mgr 2/2 20h ago 15m
mon 4/5 20h ago 93m
node-exporter ?:9100 9/10 20h ago 3M *
osd 24 3m ago -
osd.all-available-devices 72 20h ago 4w *
prometheus ?:9095 1/1 3m ago 3M
The failed machine (named bifur) is offline but still in the cluster
since I'm planning to restore it:
# ceph orch host ls
HOST ADDR LABELS STATUS
aka 172.16.253.7 _admin
bifur 172.16.253.5 _admin Offline
bofur 172.16.253.2 _admin
ogion 172.16.253.6 _no_autotune_memory
10 hosts in cluster
Since this machine hosted a mon I tried to redeploy it with:
# ceph orch apply mon --placement="5 bofur balin aka romolo dwalin"
but even if ceph orch ls shows that the mons should currently be on
the machines specified buy --placement (see above) it seems that
somehow the mon on bifur is somehow still present in ceph orch
# ceph orch restart mon
Scheduled to restart mon.aka on host 'aka'
Scheduled to restart mon.balin on host 'balin'
Scheduled to restart mon.bifur on host 'bifur'
Scheduled to restart mon.bofur on host 'bofur'
Scheduled to restart mon.romolo on host 'romolo'
I manually restarted all the mon and mgr daemons on online hosts to
no avail. At this point I am clueless, so any help is greatly