Hey folks,
I'm working through some basic ops drills, and noticed what I think is an
inconsistency in the Cephadm Docs. Some Googling appears to show this is a
known thing, but I didn't find a clear direction on cooking up a solution
yet.
On a cluster with 5 mons, 2 were abruptly removed when their host OS
decided to do scheduled maintenance without asking first. Those hosts only
had mons running on them (and mds/crash/node exporter), so I still have 3
mon quorum and the cluster is happy.
It's not clear to me how I add these hosts back in as mons though. In the
troubleshooting docs it describes bringing all mons down, then extracting a
monmap. I tried this through various iterations of bringing all down,
bringing one back up and entering the container; bringing all down and
trying to use ceph-mon from cephadm shell and so on. I either got rocksdb
lock issues presumably because a mon node was running, or an error that the
path to the mon data didn't exist, presumably for the opposite reason.
Is there guidance on the container-friendly way to perform the monmap
maintenance?
I did think that because I still have quorum, I could simply do ceph orch
apply mon label:mon instead, but I am nervous this might upset my remaining
mons. Looking at the ceph orch ls output I see:
root@kida:/# ceph orch ls
NAME PORTS RUNNING REFRESHED AGE PLACEMENT
alertmanager 1/1 7m ago 2h count:1
crash 5/5 9m ago 2h *
grafana 1/1 7m ago 2h count:1
mds.media 3/3 9m ago 2h
thebends;okcomputer;amnesiac
mgr 2/2 9m ago 2h count:2
mon 3/5 9m ago 2h label:mon
node-exporter 5/5 9m ago 2h *
osd.all-available-devices 5/10 9m ago 2h *
prometheus 1/1 7m ago 2h count:1
root@kida:/#
So is it expecting 2 more mons, or has it autoscaled down cleverly?
Looking at ceph orch ps I see:
root@kida:/# ceph orch ps
NAME HOST PORTS STATUS
REFRESHED AGE VERSION IMAGE ID CONTAINER ID
alertmanager.kida kida *:9093,9094 running (2h) 8m
ago 2h 0.20.0 0881eb8f169f 89c604455194
crash.amnesiac amnesiac running (11h) 8m
ago 11h 16.2.4 8d91d370c2b8 bff086c930db
crash.kida kida running (2h) 8m
ago 2h 16.2.4 8d91d370c2b8 b0ac059be109
crash.kingoflimbs kingoflimbs running (13h) 8m
ago 13h 16.2.4 8d91d370c2b8 b0955309a8b9
crash.okcomputer okcomputer running (2h) 10m
ago 2h 16.2.4 8d91d370c2b8 a75cf65ef235
crash.thebends thebends running (2h) 8m
ago 2h 16.2.4 8d91d370c2b8 befe9c1015f3
grafana.kida kida *:3000 running (2h) 8m
ago 2h 6.7.4 ae5c36c3d3cd f85747138299
mds.media.amnesiac.uujwlk amnesiac running (11h) 8m
ago 2h 16.2.4 8d91d370c2b8 512a2fcc0f97
mds.media.okcomputer.nednib okcomputer running (2h) 10m
ago 2h 16.2.4 8d91d370c2b8 10c6244a9308
mds.media.thebends.pqsfeb thebends running (2h) 8m
ago 2h 16.2.4 8d91d370c2b8 c1b75831a973
mgr.kida.kchysa kida *:9283 running (2h) 8m
ago 2h 16.2.4 8d91d370c2b8 602acc0d8df3
mgr.okcomputer.rjtrqw okcomputer *:8443,9283 running (2h) 10m
ago 2h 16.2.4 8d91d370c2b8 605a8a25a604
mon.amnesiac amnesiac stopped 8m
ago 2h <unknown> <unknown> <unknown>
mon.kida kida running (2h) 8m
ago 2h 16.2.4 8d91d370c2b8 a441563a978d
mon.kingoflimbs kingoflimbs stopped 8m
ago 2h <unknown> <unknown> <unknown>
mon.okcomputer okcomputer running (2h) 10m
ago 2h 16.2.4 8d91d370c2b8 c4297efafe27
mon.thebends thebends running (2h) 8m
ago 2h 16.2.4 8d91d370c2b8 e2394d5f152b
node-exporter.amnesiac amnesiac *:9100 running (11h) 8m
ago 2h 0.18.1 e5a616e4b9cf da3c69057c4f
node-exporter.kida kida *:9100 running (2h) 8m
ago 2h 0.18.1 e5a616e4b9cf 5c9219a29257
node-exporter.kingoflimbs kingoflimbs *:9100 running (13h) 8m
ago 2h 0.18.1 e5a616e4b9cf c2236491fb6e
node-exporter.okcomputer okcomputer *:9100 running (2h) 10m
ago 2h 0.18.1 e5a616e4b9cf 2e53a82eed32
node-exporter.thebends thebends *:9100 running (2h) 8m
ago 2h 0.18.1 e5a616e4b9cf def6bdd359d6
osd.0 kida running (2h) 8m
ago 2h 16.2.4 8d91d370c2b8 c1419a29ddd8
osd.1 kida running (85m) 8m
ago 2h 16.2.4 8d91d370c2b8 dcb172c628ec
osd.2 thebends running (2h) 8m
ago 2h 16.2.4 8d91d370c2b8 4826e3da8d14
osd.3 okcomputer running (2h) 10m
ago 2h 16.2.4 8d91d370c2b8 5424d437c270
osd.4 thebends running (2h) 8m
ago 2h 16.2.4 8d91d370c2b8 47e682c3727d
prometheus.kida kida *:9095 running (2h) 8m
ago 2h 2.18.1 de242295e225 4c8e7fdd89a8
root@kida:/#
So those mon containers are still there, stopped. ceph orch daemon restart
mon.amnesiac gives notice that a restart is scheduled on that mon. The
container status updates in ceph orch ps to running, but version, image ID
and container ID are <unknown> and I don't see that mon unit in any status
output or log. cephadm unit --name mon.amnesiac restart --fsid
yadda-yadda-yadda errors with daemon not found, it seems like the cephadm
cli command is scoped to the daemons running on the same host it's being
executed on, rather than cluster-wide like ceph orch.
Any clues offered to further investigation are welcomed.
Best regards
Phil
Show replies by date