From the ceph versions output I can see
"osd": {
"ceph version 16.2.10-160.el8cp
(6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)": 160
},
It seems like all the OSD daemons on this cluster are using that
16.2.10-160 image, and I'm guessing most of them are running, so it must
have existed at some point. Curious if `ceph config dump | grep
container_image` will show a different image setting for the OSD. Anyway,
in terms of moving forward it might be best to try to get all the daemons
onto an image you know works. I also see both 16.2.10-208 and 16.2.10-248
listed as versions, which implies there are two different images being used
even between the other daemons. Unless there's a reason for all these
different images, I'd just pick the most up to date one, that you know can
be pulled on all hosts, and do a `ceph orch upgrade start --image
<image-name>`. That would get all the daemons on that single image, and
might fix the broken OSDs that are failing to pull the 16.2.10-160 image.
On Wed, Mar 27, 2024 at 8:56 PM Alex <mr.alexey(a)gmail.com> wrote:
Hello.
We're rebuilding our OSD nodes.
Once cluster worked without any issues, this one is being stubborn
I attempted to add one back to the cluster and seeing the error below
in out logs:
cephadm ['--image',
'registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160', 'pull']
2024-03-27 19:30:53,901 7f49792ed740 DEBUG /bin/podman: 4.6.1
2024-03-27 19:30:53,905 7f49792ed740 INFO Pulling container image
registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160...
2024-03-27 19:30:54,045 7f49792ed740 DEBUG /bin/podman: Trying to pull
registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160...
2024-03-27 19:30:54,266 7f49792ed740 DEBUG /bin/podman: Error:
initializing source
docker://registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160: reading
manifest 16.2.10-160 in registry.redhat.io/rhceph/rhceph-5-rhel8:
manifest unknown
2024-03-27 19:30:54,270 7f49792ed740 INFO Non-zero exit code 125 from
/bin/podman pull registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160
2024-03-27
<http://registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-1602024-03-27>
19:30:54,270 7f49792ed740 INFO /bin/podman: stderr Trying
to pull registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160...
2024-03-27 19:30:54,270 7f49792ed740 INFO /bin/podman: stderr Error:
initializing source
docker://registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160: reading
manifest 16.2.10-160 in registry.redhat.io/rhceph/rhceph-5-rhel8:
manifest unknown
2024-03-27 19:30:54,270 7f49792ed740 ERROR ERROR: Failed command:
/bin/podman pull registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160
$ ceph versions
{
"mon": {
"ceph version 16.2.10-208.el8cp
(791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 1,
"ceph version 16.2.10-248.el8cp
(0edb63afd9bd3edb333364f2e0031b77e62f4896) pacific (stable)": 2
},
"mgr": {
"ceph version 16.2.10-208.el8cp
(791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 1,
"ceph version 16.2.10-248.el8cp
(0edb63afd9bd3edb333364f2e0031b77e62f4896) pacific (stable)": 2
},
"osd": {
"ceph version 16.2.10-160.el8cp
(6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)": 160
},
"mds": {},
"rgw": {
"ceph version 16.2.10-208.el8cp
(791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 3
},
"overall": {
"ceph version 16.2.10-160.el8cp
(6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)": 160,
"ceph version 16.2.10-208.el8cp
(791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 5,
"ceph version 16.2.10-248.el8cp
(0edb63afd9bd3edb333364f2e0031b77e62f4896) pacific (stable)": 4
}
}
I don't understand why it's trying to pull 16.2.10-160 which doesn't exist.
registry.redhat.io/rhceph/rhceph-5-dashboard-rhel8 5 93b3137e7a65 11
months ago 696 MB
registry.redhat.io/rhceph/rhceph-5-rhel8 5-416 838cea16e15c 11 months
ago 1.02 GB
registry.redhat.io/openshift4/ose-prometheus v4.6 ec2d358ca73c 17
months ago 397 MB
This happens using cepadm-ansible as well as
$ ceph orch ls --export --service_name xxx > xxx.yml
$ sudo ceph orch apply -i xxx.yml
I tried ceph orch daemon add osd host:/dev/sda
which surprisingly created a volume on host:/dev/sda and created an
osd i can see in
$ ceph osd tree
but It did not get added to host I suspect because of the same Podman
error and now I'm unable remove it.
$ ceph orch osd rm
does not work even with the --force flag.
I stopped the removal with
$ ceph orch osd rm stop
after 10+ minutes
I'm considering running $ ceph osd purge osd# --force but worried it
may only make things worse.
ceph -s shows that osd but not up or in.
Thanks, and looking forward to any advice!
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io