Failed adding back a node - ceph-users

28 Mar 2024

Hello.

We're rebuilding our OSD nodes.
Once cluster worked without any issues, this one is being stubborn

I attempted to add one back to the cluster and seeing the error below
in out logs:

cephadm ['--image',
'registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160', 'pull']
2024-03-27 19:30:53,901 7f49792ed740 DEBUG /bin/podman: 4.6.1
2024-03-27 19:30:53,905 7f49792ed740 INFO Pulling container image
registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160...
2024-03-27 19:30:54,045 7f49792ed740 DEBUG /bin/podman: Trying to pull
registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160...
2024-03-27 19:30:54,266 7f49792ed740 DEBUG /bin/podman: Error:
initializing source
docker://registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160: reading
manifest 16.2.10-160 in registry.redhat.io/rhceph/rhceph-5-rhel8:
manifest unknown
2024-03-27 19:30:54,270 7f49792ed740 INFO Non-zero exit code 125 from
/bin/podman pull registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160
2024-03-27 19:30:54,270 7f49792ed740 INFO /bin/podman: stderr Trying
to pull registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160...
2024-03-27 19:30:54,270 7f49792ed740 INFO /bin/podman: stderr Error:
initializing source
docker://registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160: reading
manifest 16.2.10-160 in registry.redhat.io/rhceph/rhceph-5-rhel8:
manifest unknown
2024-03-27 19:30:54,270 7f49792ed740 ERROR ERROR: Failed command:
/bin/podman pull registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160

$ ceph versions
{
    "mon": {
        "ceph version 16.2.10-208.el8cp
(791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 1,
        "ceph version 16.2.10-248.el8cp
(0edb63afd9bd3edb333364f2e0031b77e62f4896) pacific (stable)": 2
    },
    "mgr": {
        "ceph version 16.2.10-208.el8cp
(791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 1,
        "ceph version 16.2.10-248.el8cp
(0edb63afd9bd3edb333364f2e0031b77e62f4896) pacific (stable)": 2
    },
    "osd": {
        "ceph version 16.2.10-160.el8cp
(6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)": 160
    },
    "mds": {},
    "rgw": {
        "ceph version 16.2.10-208.el8cp
(791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 3
    },
    "overall": {
        "ceph version 16.2.10-160.el8cp
(6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)": 160,
        "ceph version 16.2.10-208.el8cp
(791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 5,
        "ceph version 16.2.10-248.el8cp
(0edb63afd9bd3edb333364f2e0031b77e62f4896) pacific (stable)": 4
    }
}

I don't understand why it's trying to pull 16.2.10-160 which doesn't exist.

registry.redhat.io/rhceph/rhceph-5-dashboard-rhel8 5 93b3137e7a65 11
months ago 696 MB
registry.redhat.io/rhceph/rhceph-5-rhel8 5-416 838cea16e15c 11 months
ago 1.02 GB
registry.redhat.io/openshift4/ose-prometheus v4.6 ec2d358ca73c 17
months ago 397 MB

This happens using cepadm-ansible as well as
$ ceph orch ls --export --service_name xxx > xxx.yml
$ sudo ceph orch apply -i xxx.yml

I tried ceph orch daemon add osd host:/dev/sda
which surprisingly created a volume on host:/dev/sda and created an
osd i can see in
$ ceph osd tree

but It did not get added to host I suspect because of the same Podman
error and now I'm unable remove it.
$ ceph orch osd rm
does not work even with the --force flag.

I stopped the removal with
$ ceph orch osd rm stop
after 10+ minutes

I'm considering running $ ceph osd purge osd# --force but worried it
may only make things worse.
ceph -s shows that osd but not up or in.

Thanks, and looking forward to any advice!