[ceph-users] Re: Orchestration seems not to work

7 Jun 2023

It looks like an old answer from the list just solved my problem!

I found https://www.mail-archive.com/ceph-users@ceph.io/msg14418.html .

So I tried

ceph config rm mds.mds01.ceph03.xqwdjy container_image
ceph config rm mgr.ceph06.xbduuf container_image

And BOOM. It worked.

Thanks for all the help you got me!

Cheers,
Thomas

On 07.06.23 15:32, Thomas Widhalm wrote:
...
  I found something else, that might help with
identifying the problem. 
 When I look into which containers are used I see the following:

 global: 

quay.io/ceph/ceph@sha256:0560b16bec6e84345f29fb6693cd2430884e6efff16a95d5bdd0bb06d7661c45,
 mon: 

quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635,
 mgr: 

quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635,
 mgr.ceph06.xbduuf: 9cea3956c04b,
 osd: 

quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635,
 mds.mds01.ceph03.xqwdjy: 2abc8cd2afe6,
 mds.mds01.ceph04.hcmvae: 

quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635,
 mds.mds01.ceph04.krxszj: 

quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635,
 mds.mds01.ceph05.pqxmvt: 

quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635,
 mds.mds01.ceph05.szzppy: 

quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635,
 mds.mds01.ceph06.huavsw: 

quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635,
 mds.mds01.ceph06.rrxmks: 

quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635,
 mds.mds01.ceph07.omdisd: 

quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635,
 client.crash: 

quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635

 What puzzles me is that global has another container ID than the others. 
 The services mgr.ceph06.xbduuf and mds.mds01.ceph03.xqwdjy which are 
 listed with other IDs don't even exist any more. I even removed the 
 remaining mgr on ceph06 and deployed a new one.

 I tried changing deployment rules to default (host, no hosts listed, no 
 count set) and shut down the cluster yet again. Still I have 2 months 
 old data in "ceph orch ps". Same in the Dashboard.

 Any other ideas I could check for?

 On 25.05.23 15:04, Thomas Widhalm wrote:
  What caught my eye is that this is also true for
Disks on Hosts.

 I added another disk to an OSD host. I can zap it with cephadm, I can 
 even make it an OSD with "ceph orch daemon add osd ceph06:/dev/sdb" 
 and it will be listed as new OSD in Ceph Dashboard.

 But, when I look at the "Physical Disks" part of Ceph dashboard or run 
 "ceph orch device ls --refresh" I still don't see the new disk. Even 
 more, I see the already added disk listed as /dev/sdb even when it's 
 now /dev/sdd after adding the other one. "Refreshed" is the same 
 timeframe I see in "ceph orch ps".

 I assume there's a global problem not letting me refresh information 
 about services and not change anything.

 I'm still stuck and wouldn't know where to look. I tried to verify if 
 all hosts are available by ssh from the others and that all daemons 
 have their keyrings. As far as I can tell, this part works, even when 
 I'm not yet a specialist in debugging Ceph.

 On 04.05.23 16:55, Adam King wrote:
  what does specifically `ceph log last 200 debug
cephadm` spit out? 
 The log lines you've posted so far I don't think are generated by the 
 orchestrator so curious what the last actions it took was (and how 
 long ago).

 On Thu, May 4, 2023 at 10:35 AM Thomas Widhalm 
 &lt;widhalmt(a)widhalm.or.at <mailto:widhalmt@widhalm.or.at>> wrote:

     To completely rule out hung processes, I managed to get another 
 short
     shutdown.

     Now I'm seeing lots of:

     mgr.server handle_open ignoring open from mds.mds01.ceph01.usujbi
     v2:192.168.23.61:6800/2922006253
     <http://192.168.23.61:6800/2922006253>; not ready for session
     (expect reconnect)
     mgr finish mon failed to return metadata for 
 mds.mds01.ceph02.otvipq:
     (2) No such file or directory

     log lines. Seems like it now realises that some of these 
 informations
     are stale. But it looks like it's just waiting for it to come 
 back and
     not do anything about it.

     On 04.05.23 14:48, Eugen Block wrote:
  Hi,

 try setting debug logs for the mgr:

 ceph config set mgr mgr/cephadm/log_level debug

 This should provide more details what the mgr is trying and where
      it's
  failing, hopefully. Last week this helped to
identify an issue
      between a
  lower pacific issue for me.
 Do you see anything in the cephadm.log pointing to the mgr 
  actually
  trying something?

 Zitat von Thomas Widhalm &lt;widhalmt(a)widhalm.or.at
      <mailto:widhalmt@widhalm.or.at>>:

> Hi,
>
> I'm in the process of upgrading my cluster from 17.2.5 to 17.2.6
      but
 > the following problem existed when I was
still everywhere on
      17.2.5 .
 >
> I had a major issue in my cluster which could be solved with a
      lot of
 > your help and even more trial and error.
Right now it seems that
      most
 > is already fixed but I can't rule out
that there's still some
      problem
 > hidden. The very issue I'm asking about
started during the 
  repair.
 >
> When I want to orchestrate the cluster, it logs the command 
  but it
 > doesn't do anything. No matter if I use
ceph dashboard or "ceph
      orch"
 > in "cephadm shell". I don't get
any error message when I try to
      deploy
 > new services, redeploy them etc. The log only
says 
  "scheduled" and
 > that's it. Same when I change placement
rules. Usually I use
      tags. But
 > since they don't work anymore, too, I
tried host and 
  umanaged. No
 > success. The only way I can actually start
and stop containers
      is via
 > systemctl from the host itself.
>
> When I run "ceph orch ls" or "ceph orch ps" I see services I
      deployed
 > for testing being deleted (for weeks now).
Ans especially a lot
      of old
 > MDS are listed as "error" or
"starting". The list doesn't match
> reality at all because I had to start them by hand.
>
> I tried "ceph mgr fail" and even a complete shutdown of the 
  whole
 > cluster with all nodes including all mgs, mds
even osd - 
  everything
 > during a maintenance window. Didn't
change anything.
>
> Could you help me? To be honest I'm still rather new to Ceph and
      since
 > I didn't find anything in the logs that
caught my eye I would be
> thankful for hints how to debug.
>
> Cheers,
> Thomas
> --
> http://www.widhalm.or.at <http://www.widhalm.or.at>
> GnuPG : 6265BAE6 , A84CB603
> Threema: H7AV7D33
> Telegram, Signal: widhalmt(a)widhalm.or.at
      <mailto:widhalmt@widhalm.or.at>

 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
      <mailto:ceph-users@ceph.io>
  To unsubscribe send an email to
ceph-users-leave(a)ceph.io
      <mailto:ceph-users-leave@ceph.io>
     _______________________________________________
     ceph-users mailing list -- ceph-users(a)ceph.io
     <mailto:ceph-users@ceph.io>
     To unsubscribe send an email to ceph-users-leave(a)ceph.io
     <mailto:ceph-users-leave@ceph.io>

 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io

 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Orchestration seems not to work