It seems like it maybe didn't actually do the redeploy as it should log
something saying it's actually doing it on top of the line saying it
scheduled it. To confirm, the upgrade is paused ("ceph orch upgrade status"
reports is_paused as false)? If so, maybe try doing a mgr failover ("ceph
mgr fail") and then check "ceph orch ps" and "ceph orch device
ls" a few
minutes later and look at the REFRESHED column. If any of those are giving
amounts of time farther back then when you did the failover, there's
probably something going on on the host(s) where it says it hasn't
refreshed recently that's sticking things up (you'd have to go on that host
and look for hanging cephadm commands). Lastly, you could look at the
/var/lib/ceph/<fsid>/<mds-daemon-name>/unit.run file on the hosts where the
mds daemons are deployed. The (very long) last podman/docker run line in
that file should have the image name of the image the daemon is being
deployed with. So you could use that to confirm if cephadm ever actually
tried a redeploy of the mds with the new image. You could also check the
journal logs for the mds. Cephadm reports the sytemd unit name for the
daemon as part of "cephadm ls" output if you put a copy of the cephadm
binary, un "cephadm ls" with it, grab the systemd unit name for the mds
daemon form that output, you could use that to check the journal logs which
should tell the last restart time and why it's gone down.
On Mon, Apr 10, 2023 at 4:25 PM Thomas Widhalm <widhalmt(a)widhalm.or.at>
wrote:
I did what you told me.
I also see in the log, that the command went through:
2023-04-10T19:58:46.522477+0000 mgr.ceph04.qaexpv [INF] Schedule
redeploy daemon mds.mds01.ceph06.rrxmks
2023-04-10T20:01:03.360559+0000 mgr.ceph04.qaexpv [INF] Schedule
redeploy daemon mds.mds01.ceph05.pqxmvt
2023-04-10T20:01:21.787635+0000 mgr.ceph04.qaexpv [INF] Schedule
redeploy daemon mds.mds01.ceph07.omdisd
But the MDS never start. They stay in error state. I tried to redeploy
and start them a few times. Even restarted one host where a MDS should run.
mds.mds01.ceph03.xqwdjy ceph03 error 32m ago
2M - - <unknown> <unknown> <unknown>
mds.mds01.ceph04.hcmvae ceph04 error 31m ago
2h - - <unknown> <unknown> <unknown>
mds.mds01.ceph05.pqxmvt ceph05 error 32m ago
9M - - <unknown> <unknown> <unknown>
mds.mds01.ceph06.rrxmks ceph06 error 32m ago
10w - - <unknown> <unknown> <unknown>
mds.mds01.ceph07.omdisd ceph07 error 32m ago
2M - - <unknown> <unknown> <unknown>
And other ideas? Or am I missing something.
Cheers,
Thomas
On 10.04.23 21:53, Adam King wrote:
Will also note that the normal upgrade process
scales down the mds
service to have only 1 mds per fs before upgrading it, so maybe
something you'd want to do as well if the upgrade didn't do it already.
It does so by setting the max_mds to 1 for the fs.
On Mon, Apr 10, 2023 at 3:51 PM Adam King <adking(a)redhat.com
<mailto:adking@redhat.com>> wrote:
You could try pausing the upgrade and manually "upgrading" the mds
daemons by redeploying them on the new image. Something like "ceph
orch daemon redeploy <mds-daemon-name> --image <17.2.6 image>"
(daemon names should match those in "ceph orch ps" output). If you
do that for all of them and then get them into an up state you
should be able to resume the upgrade and have it complete.
On Mon, Apr 10, 2023 at 3:25 PM Thomas Widhalm
<widhalmt(a)widhalm.or.at <mailto:widhalmt@widhalm.or.at>> wrote:
Hi,
If you remember, I hit bug
https://tracker.ceph.com/issues/58489
<https://tracker.ceph.com/issues/58489> so I
was very relieved when 17.2.6 was released and started to update
immediately.
But now I'm stuck again with my broken MDS. MDS won't get into
up:active
without the update but the update waits for them to get into
up:active
state. Seems like a deadlock / chicken-egg problem to me.
Since I'm still relatively new to Ceph, could you help me?
What I see when watching the update status:
{
"target_image":
"
quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635
<
http://quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c…
",
"in_progress": true,
"which": "Upgrading all daemon types on all hosts",
"services_complete": [
"crash",
"mgr",
"mon",
"osd"
],
"progress": "18/40 daemons upgraded",
"message": "Error: UPGRADE_OFFLINE_HOST: Upgrade: Failed
to connect
to host ceph01 at addr (192.168.23.61)",
"is_paused": false
}
(The offline host was one host that broke during the upgrade. I
fixed
that in the meantime and the update went on.)
And in the log:
2023-04-10T19:23:48.750129+0000 mgr.ceph04.qaexpv [INF] Upgrade:
Waiting
for mds.mds01.ceph04.hcmvae to be up:active (currently up:replay)
2023-04-10T19:23:58.758141+0000 mgr.ceph04.qaexpv [WRN] Upgrade:
No mds
is up; continuing upgrade procedure to poke things in the right
direction
Please give me a hint what I can do.
Cheers,
Thomas
--
http://www.widhalm.or.at <http://www.widhalm.or.at>
GnuPG : 6265BAE6 , A84CB603
Threema: H7AV7D33
Telegram, Signal: widhalmt(a)widhalm.or.at
<mailto:widhalmt@widhalm.or.at>
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
<mailto:ceph-users@ceph.io>
To unsubscribe send an email to ceph-users-leave(a)ceph.io
<mailto:ceph-users-leave@ceph.io>