[ceph-users] Re: Orchestration seems not to work

15 May 2023

This is why I even tried a full cluster shutdown. All Hosts were out, so 
there's not a possibility that there's any process hanging. After I 
started the nodes, it's just the same as before. All refresh times show 
"4 weeks". Like it stopped simoultanously on all nodes.

Some time ago we had a small change in name resolution so I thought, 
maybe the orchestrator can't connect via ssh anymore. But I tried all 
the steps in 
https://docs.ceph.com/docs/master/cephadm/troubleshooting/#ssh-errors . 
The only thing that's slightly suspicous is that, it said, it added the 
host key to known hosts. But since I tried via "cephadm shell" I guess, 
the known hosts are just not replicated to these containers. ssh works, 
too. (And I would have suspected that I get a warning if that failed)

I don't see any information about the orchestrator module having 
crashed. It's running as always.

 From the the prior problem I had some issues in my cephfs pools. So, 
maybe there's something broken in the .mgr pool? Could that be a reason 
for this behaviour? I googled a while but didn't find any way how to 
check that explicitly.

On 15.05.23 19:15, Adam King wrote:
...
  This is sort of similar to what I said in a previous
email, but the only
 way I've seen this happen in other setups is through hanging cephadm
 commands. The debug process has been, do a mgr failover, wait a few
 minutes, see in "ceph orch ps" and "ceph orch device ls" which hosts
have
 and have not been refreshed (the REFRESHED column should be some lower
 value on the hosts where it refreshed), go to the hosts where it did not
 refresh and check "ps aux | grep cephadm" looking for long running (and
 therefore most likely hung) processes. I would still expect that's the most
 likely thing you're experiencing here. I haven't seen any other causes for
 cephadm to not refresh unless the module crashed, but that would be
 explicitly stated in the cluster health.

 On Mon, May 15, 2023 at 11:44 AM Thomas Widhalm &lt;widhalmt(a)widhalm.or.at&gt;
 wrote:

  Hi,

 I tried a lot of different approaches but I didn't have any success so far.

 "ceph orch ps" still doesn't get refreshed.

 Some examples:

 mds.mds01.ceph06.huavsw  ceph06               starting              -
 -        -        -  <unknown>  <unknown>     <unknown>
 mds.mds01.ceph06.rrxmks  ceph06               error            4w ago
 3M        -        -  <unknown>  <unknown>     <unknown>
 mds.mds01.ceph07.omdisd  ceph07               error            4w ago
 4M        -        -  <unknown>  <unknown>     <unknown>
 mds.mds01.ceph07.vvqyma  ceph07               starting              -
 -        -        -  <unknown>  <unknown>     <unknown>
 mgr.ceph04.qaexpv        ceph04  *:8443,9283  running (4w)     4w ago
 10M     551M        -  17.2.6     9cea3956c04b  33df84e346a0
 mgr.ceph05.jcmkbb        ceph05  *:8443,9283  running (4w)     4w ago
 4M     441M        -  17.2.6     9cea3956c04b  1ad485df4399
 mgr.ceph06.xbduuf        ceph06  *:8443,9283  running (4w)     4w ago
 4M     432M        -  17.2.6     9cea3956c04b  5ba5fd95dc48
 mon.ceph04               ceph04               running (4w)     4w ago
 4M     223M    2048M  17.2.6     9cea3956c04b  8b6116dd216f
 mon.ceph05               ceph05               running (4w)     4w ago
 4M     326M    2048M  17.2.6     9cea3956c04b  70520d737f29

 Debug Log doesn't show anything that could help me, either.

 2023-05-15T14:48:40.852088+0000 mgr.ceph05.jcmkbb (mgr.83897390) 1376 :
 cephadm [INF] Schedule start daemon mds.mds01.ceph04.hcmvae
 2023-05-15T14:48:43.620700+0000 mgr.ceph05.jcmkbb (mgr.83897390) 1380 :
 cephadm [INF] Schedule redeploy daemon mds.mds01.ceph04.hcmvae
 2023-05-15T14:48:45.124822+0000 mgr.ceph05.jcmkbb (mgr.83897390) 1392 :
 cephadm [INF] Schedule start daemon mds.mds01.ceph04.krxszj
 2023-05-15T14:48:46.493902+0000 mgr.ceph05.jcmkbb (mgr.83897390) 1394 :
 cephadm [INF] Schedule redeploy daemon mds.mds01.ceph04.krxszj
 2023-05-15T15:05:25.637079+0000 mgr.ceph05.jcmkbb (mgr.83897390) 2629 :
 cephadm [INF] Saving service mds.mds01 spec with placement count:2
 2023-05-15T15:07:27.625773+0000 mgr.ceph05.jcmkbb (mgr.83897390) 2780 :
 cephadm [INF] Saving service mds.fs_name spec with placement count:3
 2023-05-15T15:07:42.120912+0000 mgr.ceph05.jcmkbb (mgr.83897390) 2795 :
 cephadm [INF] Saving service mds.mds01 spec with placement count:3

 I'm seeing all the commands I give but I don't get any more information
 on why it's not actually happening.

 I tried to change different scheduling mechanisms. Host, Tag, unmanaged
 and back again. I turned off orchestration and resumed. I failed mgr. I
 even had full cluster stops (in the past). I made sure all daemons run
 the same version. (If you remember, upgrade failed underway).

 So my only way of getting daemons only is manually. I added two more
 hosts, tagged them. But there isn't a single daemon started there.

 Could you help me again with how to debug orchestration not working?

 On 04.05.23 15:12, Thomas Widhalm wrote:
  Thanks.

 I set the log level to debug, try a few steps and then come back.

 On 04.05.23 14:48, Eugen Block wrote:
  Hi,

 try setting debug logs for the mgr:

 ceph config set mgr mgr/cephadm/log_level debug

 This should provide more details what the mgr is trying and where it's
 failing, hopefully. Last week this helped to identify an issue between
 a lower pacific issue for me.
 Do you see anything in the cephadm.log pointing to the mgr actually
 trying something?

 Zitat von Thomas Widhalm &lt;widhalmt(a)widhalm.or.at&gt;at>:

> Hi,
>
> I'm in the process of upgrading my cluster from 17.2.5 to 17.2.6 but
> the following problem existed when I was still everywhere on 17.2.5 .
>
> I had a major issue in my cluster which could be solved with a lot of
> your help and even more trial and error. Right now it seems that most
> is already fixed but I can't rule out that there's still some problem
> hidden. The very issue I'm asking about started during the repair.
>
> When I want to orchestrate the cluster, it logs the command but it
> doesn't do anything. No matter if I use ceph dashboard or "ceph orch"
> in "cephadm shell". I don't get any error message when I try to
> deploy new services, redeploy them etc. The log only says "scheduled"
> and that's it. Same when I change placement rules. Usually I use
> tags. But since they don't work anymore, too, I tried host and
> umanaged. No success. The only way I can actually start and stop
> containers is via systemctl from the host itself.
>
> When I run "ceph orch ls" or "ceph orch ps" I see services I
deployed
> for testing being deleted (for weeks now). Ans especially a lot of
> old MDS are listed as "error" or "starting". The list doesn't
match
> reality at all because I had to start them by hand.
>
> I tried "ceph mgr fail" and even a complete shutdown of the whole
> cluster with all nodes including all mgs, mds even osd - everything
> during a maintenance window. Didn't change anything.
>
> Could you help me? To be honest I'm still rather new to Ceph and
> since I didn't find anything in the logs that caught my eye I would
> be thankful for hints how to debug.
>
> Cheers,
> Thomas
> --
> http://www.widhalm.or.at
> GnuPG : 6265BAE6 , A84CB603
> Threema: H7AV7D33
> Telegram, Signal: widhalmt(a)widhalm.or.at

 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io

 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io
  _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io

  _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Orchestration seems not to work