[ceph-users] Re: Orchestration seems not to work

25 May 2023

Hi,

So sorry I didn't see your reply. Had some tough weeks (father in law 
died and that gave us some turmoil) I just came back to debugging and 
didn't realize until now that you did in fact answer my e-mail.

I just ran your script on the host that is running the active manager. 
Thanks a lot for the excellent howto (and the script).

Connecting to each and every host (including itself) I get the "ok" reply:

[ceph: root@ceph06 /]# python3 /mnt/connect.py connect --address 
192.168.23.65 --priv-key-file identity_key --pub-key-file identity_pub 
--ssh-config-file ssh_config
return code: 0

stdout:

stderr:

Same with hostname.

As I wrote earlier today. The problem is not just with the 
services/daemons, it's also with physical disks. Maybe that helps with 
finding the problem.

I'll run the script on all the other hosts, too. Just wanted to give a 
reply as soon as I have the first reliable answer.

On 15.05.23 22:55, Adam King wrote:
...
  Okay, thanks for verifying that bit, sorry to have
gone about it so long. I
 guess we could look at connection issues next. I wrote a short python
 script that  tries to connect to hosts using asyncssh closely to how
 cephadm does it (
 https://github.com/adk3798/testing_scripts/blob/main/asyncssh-connect.py).
 Maybe if you try that within a shell it will give some insights on if these
 connections, specifically through asyncssh using the keys/settings as
 cephadm does, is working alright. I used it in a shell with --no-hosts from
 the host with the active mgr to try and be as close to internal cephadm
 operations as possible. connect.py is the same as the linked file on github
 here.

 [root@vm-00 ~]# cephadm shell --no-hosts --mount connect.py
 Inferring fsid 9dcee730-f32f-11ed-ba89-52540033ae03
 Using recent ceph image

quay.io/adk3798/ceph@sha256:7168a3334fa16cd9d091462327c73f20548f113f9f404a316322aa4779a7639c
 [ceph: root@vm-00 /]#
 [ceph: root@vm-00 /]#
 [ceph: root@vm-00 /]# python3 -m pip install asyncssh
 WARNING: Running pip install with root privileges is generally not a good
 idea. Try `__main__.py install --user` instead.
 Collecting asyncssh
    Using cached

https://files.pythonhosted.org/packages/1e/9f/ad61867b12823f6e2c0ef2b80a704…
 Requirement already satisfied: cryptography>=3.1 in
 /usr/lib64/python3.6/site-packages (from asyncssh)
 Collecting typing-extensions>=3.6 (from asyncssh)
    Using cached

https://files.pythonhosted.org/packages/45/6b/44f7f8f1e110027cf88956b59f2fa…
 Requirement already satisfied: six>=1.4.1 in
 /usr/lib/python3.6/site-packages (from cryptography>=3.1->asyncssh)
 Requirement already satisfied: cffi!=1.11.3,>=1.8 in
 /usr/lib64/python3.6/site-packages (from cryptography>=3.1->asyncssh)
 Requirement already satisfied: pycparser in
 /usr/lib/python3.6/site-packages (from
 cffi!=1.11.3,>=1.8->cryptography>=3.1->asyncssh)
 Installing collected packages: typing-extensions, asyncssh
 Successfully installed asyncssh-2.13.1 typing-extensions-4.1.1
 [ceph: root@vm-00 /]# ceph config-key get mgr/cephadm/ssh_identity_key >
 identity_key
 [ceph: root@vm-00 /]# ceph config-key get mgr/cephadm/ssh_identity_pub >
 identity_pub
 [ceph: root@vm-00 /]# ceph cephadm get-ssh-config > ssh_config
 [ceph: root@vm-00 /]# ceph orch host ls
 HOST   ADDR             LABELS  STATUS
 vm-00  192.168.122.121  _admin
 vm-01  192.168.122.209
 vm-02  192.168.122.128
 3 hosts in cluster
 [ceph: root@vm-00 /]# python3 /mnt/connect.py connect --address
 192.168.122.209 --priv-key-file identity_key --pub-key-file identity_pub
 --ssh-config-file ssh_config
 return code: 0

 stdout:

 stderr:

 [ceph: root@vm-00 /]#

 The output won't look nice in an email, but basically it was just
 installing the asyncssh library, gathering the ssh keys/config from
 cephadm, picking an IP from the "ceph orch host ls" output and then running
 the script. By default it just runs "true" on the host, so in a success
 case it's just return code 0 and no output, but that could obviously be
 changed by modifying the python script to do something else for the cmd. If
 it fails, you get a traceback

 [ceph: root@vm-00 /]# python3 /mnt/connect.py connect --address
 192.168.122.201 --priv-key-file identity_key --pub-key-file identity_pub
 --ssh-config-file ssh_config
 Traceback (most recent call last):
    File "/mnt/connect.py", line 100, in <module>
      main()
    File "/mnt/connect.py", line 93, in main
      r = args.func(args)
    File "/mnt/connect.py", line 8, in try_connection
      async_run(_connect(args))
    File "/mnt/connect.py", line 14, in async_run
      return loop.run_until_complete(coro)
    File "/usr/lib64/python3.6/asyncio/base_events.py", line 484, in
 run_until_complete
      return future.result()
    File "/mnt/connect.py", line 27, in _connect
      preferred_auth=['publickey'], options=ssh_options)
    File "/usr/local/lib/python3.6/site-packages/asyncssh/connection.py",
 line 8045, in connect
      timeout=new_options.connect_timeout)
    File "/usr/lib64/python3.6/asyncio/tasks.py", line 358, in wait_for
      return fut.result()
    File "/usr/local/lib/python3.6/site-packages/asyncssh/connection.py",
 line 432, in _connect
      flags=flags, local_addr=local_addr)
    File "/usr/lib64/python3.6/asyncio/base_events.py", line 794, in
 create_connection
      raise exceptions[0]
    File "/usr/lib64/python3.6/asyncio/base_events.py", line 781, in
 create_connection
      yield from self.sock_connect(sock, address)
    File "/usr/lib64/python3.6/asyncio/selector_events.py", line 439, in
 sock_connect
      return (yield from fut)
    File "/usr/lib64/python3.6/asyncio/selector_events.py", line 469, in
 _sock_connect_cb
      raise OSError(err, 'Connect call failed %s' % (address,))
 OSError: [Errno 113] Connect call failed ('192.168.122.201', 22)

 By trying this for connecting to each host in the cluster, given how close
 it is to how cephadm is operating, it should help verify with relative
 certainty if this is connection related or not. Will add the important bit
 that I'm using the user "root". If you're using a non-root user, the
script
 takes a "--user" option.

 On Mon, May 15, 2023 at 3:36 PM Thomas Widhalm &lt;widhalmt(a)widhalm.or.at&gt;
 wrote:

  I just checked every single host. The only
processes of cephadm running
 where "cephadm shell" from debugging. I closed all of them, so now I can
 verify, there's not a single cephadm process running on any of my ceph
 hosts. (and since I found the shell processes, I can verify I didn't
 have a typo ;-) )

 Regarding broken record: I'm extremly thankful for your support. And I
 should have checked that earlier. We all know that sometimes it's the
 least probable things that go sideways. So checking the things you're
 sure to be ok is always a good idea. Thanks for being adamant about
 that. But now we can be sure, at least.

 On 15.05.23 21:27, Adam King wrote:
  If it persisted through a full restart, it's
possible the conditions
 that caused the hang are still present after the fact. The two known
 causes I'm aware of are lack of space in the root partition and hanging
 mount points. Both would show up as processes in "ps aux | grep cephadm"
 though. The latter could possibly be related to cephfs pool issues if
 you have something mounted on one of the host hosts. Still hard to say
 without knowing what exactly got stuck. For clarity, without restarting
 or changing anything else, can you verify  if "ps aux | grep cephadm"
 shows anything on the nodes. I know I'm a bit of a broken record on
 mentioning the hanging processes stuff, but outside of module crashes
 which don't appear to be present here, 100% of other cases of this type
 of thing happening I've looked at before have had those processes
 sitting around.

 On Mon, May 15, 2023 at 3:10 PM Thomas Widhalm &lt;widhalmt(a)widhalm.or.at
 <mailto:widhalmt@widhalm.or.at>> wrote:

      This is why I even tried a full cluster shutdown. All Hosts were
      out, so
      there's not a possibility that there's any process hanging. After I
      started the nodes, it's just the same as before. All refresh times
  show
       "4 weeks". Like it stopped
simoultanously on all nodes.

      Some time ago we had a small change in name resolution so I thought,
      maybe the orchestrator can't connect via ssh anymore. But I tried all
      the steps in

 https://docs.ceph.com/docs/master/cephadm/troubleshooting/#ssh-errors <
 https://docs.ceph.com/docs/master/cephadm/troubleshooting/#ssh-errors> .
       The only thing that's slightly suspicous
is that, it said, it added
  the
       host key to known hosts. But since I tried
via "cephadm shell" I
  guess,
       the known hosts are just not replicated to
these containers. ssh
  works,
       too. (And I would have suspected that I get
a warning if that failed)

      I don't see any information about the orchestrator module having
      crashed. It's running as always.

        From the the prior problem I had some issues in my cephfs pools.
  So,
       maybe there's something broken in the
.mgr pool? Could that be a
  reason
       for this behaviour? I googled a while but
didn't find any way how to
      check that explicitly.

      On 15.05.23 19:15, Adam King wrote:
  This is sort of similar to what I said in a
previous email, but
       the only
       > way I've seen this happen in other setups is through hanging
  cephadm
        > commands. The debug process has been,
do a mgr failover, wait a
  few
   minutes,
see in "ceph orch ps" and "ceph orch device ls" which
       hosts have
  and have not been refreshed (the REFRESHED column
should be some
       lower
  value on the hosts where it refreshed), go to the
hosts where it
       did not
  refresh and check "ps aux | grep
cephadm" looking for long
       running (and
  therefore most likely hung) processes. I would
still expect
       that's the most
  likely thing you're experiencing here. I
haven't seen any other
       causes for
       > cephadm to not refresh unless the module crashed, but that would
  be

explicitly stated in the cluster health.

 On Mon, May 15, 2023 at 11:44 AM Thomas Widhalm
       &lt;widhalmt(a)widhalm.or.at <mailto:widhalmt@widhalm.or.at>>
  wrote:

> Hi,
>
> I tried a lot of different approaches but I didn't have any
       success so far.
 >
> "ceph orch ps" still doesn't get refreshed.
>
> Some examples:
>
> mds.mds01.ceph06.huavsw  ceph06               starting
            -
 > -        -        -  <unknown> 
<unknown>     <unknown>
> mds.mds01.ceph06.rrxmks  ceph06               error
       4w ago
 > 3M        -        -  <unknown> 
<unknown>     <unknown>
> mds.mds01.ceph07.omdisd  ceph07               error
       4w ago
 > 4M        -        -  <unknown> 
<unknown>     <unknown>
> mds.mds01.ceph07.vvqyma  ceph07               starting
            -
 > -        -        -  <unknown> 
<unknown>     <unknown>
> mgr.ceph04.qaexpv        ceph04  *:8443,9283  running (4w)
         4w ago
 > 10M     551M        -  17.2.6    
9cea3956c04b  33df84e346a0
> mgr.ceph05.jcmkbb        ceph05  *:8443,9283  running (4w)
         4w ago
 > 4M     441M        -  17.2.6     9cea3956c04b
 1ad485df4399
> mgr.ceph06.xbduuf        ceph06  *:8443,9283  running (4w)
         4w ago
 > 4M     432M        -  17.2.6     9cea3956c04b
 5ba5fd95dc48
> mon.ceph04               ceph04               running (4w)
         4w ago
 > 4M     223M    2048M  17.2.6     9cea3956c04b
 8b6116dd216f
> mon.ceph05               ceph05               running (4w)
         4w ago
 > 4M     326M    2048M  17.2.6     9cea3956c04b
 70520d737f29
>
> Debug Log doesn't show anything that could help me, either.
>
> 2023-05-15T14:48:40.852088+0000 mgr.ceph05.jcmkbb (mgr.83897390)
       1376 :
 > cephadm [INF] Schedule start daemon
mds.mds01.ceph04.hcmvae
> 2023-05-15T14:48:43.620700+0000 mgr.ceph05.jcmkbb (mgr.83897390)
       1380 :
 > cephadm [INF] Schedule redeploy daemon
mds.mds01.ceph04.hcmvae
> 2023-05-15T14:48:45.124822+0000 mgr.ceph05.jcmkbb (mgr.83897390)
       1392 :
 > cephadm [INF] Schedule start daemon
mds.mds01.ceph04.krxszj
> 2023-05-15T14:48:46.493902+0000 mgr.ceph05.jcmkbb (mgr.83897390)
       1394 :
 > cephadm [INF] Schedule redeploy daemon
mds.mds01.ceph04.krxszj
> 2023-05-15T15:05:25.637079+0000 mgr.ceph05.jcmkbb (mgr.83897390)
       2629 :
       >> cephadm [INF] Saving service mds.mds01 spec with placement
  count:2
  >
2023-05-15T15:07:27.625773+0000 mgr.ceph05.jcmkbb (mgr.83897390)
       2780 :
       >> cephadm [INF] Saving service mds.fs_name spec with placement
  count:3
  >
2023-05-15T15:07:42.120912+0000 mgr.ceph05.jcmkbb (mgr.83897390)
       2795 :
       >> cephadm [INF] Saving service mds.mds01 spec with placement
  count:3
  >
> I'm seeing all the commands I give but I don't get any more
       information
 > on why it's not actually happening.
>
> I tried to change different scheduling mechanisms. Host, Tag,
       unmanaged
 > and back again. I turned off orchestration
and resumed. I failed
       mgr. I
 > even had full cluster stops (in the past). I
made sure all
       daemons run
       >> the same version. (If you remember, upgrade failed underway).
       >>
       >> So my only way of getting daemons only is manually. I added two
  more
        >> hosts, tagged them. But there
isn't a single daemon started
  there.
        >>
       >> Could you help me again with how to debug orchestration not
  working?
        >>
       >>
       >> On 04.05.23 15:12, Thomas Widhalm wrote:
       >>> Thanks.
       >>>
       >>> I set the log level to debug, try a few steps and then come
  back.
  >>
>> On 04.05.23 14:48, Eugen Block wrote:
>>> Hi,
>>>
>>> try setting debug logs for the mgr:
>>>
>>> ceph config set mgr mgr/cephadm/log_level debug
>>>
>>> This should provide more details what the mgr is trying and
       where it's
 >>> failing, hopefully. Last week this
helped to identify an issue
       between
 >>> a lower pacific issue for me.
>>> Do you see anything in the cephadm.log pointing to the mgr
       actually
 >>> trying something?
>>>
>>>
>>> Zitat von Thomas Widhalm &lt;widhalmt(a)widhalm.or.at
       <mailto:widhalmt@widhalm.or.at>>:
 >>>
>>>> Hi,
>>>>
>>>> I'm in the process of upgrading my cluster from 17.2.5 to
       17.2.6 but
 >>>> the following problem existed
when I was still everywhere on
       17.2.5 .
 >>>>
>>>> I had a major issue in my cluster which could be solved with
       a lot of
 >>>> your help and even more trial and
error. Right now it seems
       that most
 >>>> is already fixed but I can't
rule out that there's still some
       problem
 >>>> hidden. The very issue I'm
asking about started during the
       repair.
 >>>>
>>>> When I want to orchestrate the cluster, it logs the command
       but it
 >>>> doesn't do anything. No
matter if I use ceph dashboard or
       "ceph orch"
       >>>>> in "cephadm shell". I don't get any error
message when I try
  to

>>>> deploy new services, redeploy them etc. The log only says
       "scheduled"
       >>>>> and that's it. Same when I change placement rules. Usually
I
  use
        >>>>> tags. But since they
don't work anymore, too, I tried host and
       >>>>> umanaged. No success. The only way I can actually start and
  stop

>>>> containers is via systemctl from the host itself.
>>>>
>>>> When I run "ceph orch ls" or "ceph orch ps" I see
services I
       deployed
 >>>> for testing being deleted (for
weeks now). Ans especially a
       lot of
 >>>> old MDS are listed as
"error" or "starting". The list doesn't
       match
       >>>>> reality at all because I had to start them by hand.
       >>>>>
       >>>>> I tried "ceph mgr fail" and even a complete shutdown
of the
  whole

>>>> cluster with all nodes including all mgs, mds even osd -
       everything
       >>>>> during a maintenance window. Didn't change anything.
       >>>>>
       >>>>> Could you help me? To be honest I'm still rather new to
Ceph
  and

>>>> since I didn't find anything in the logs that caught my eye I
       would
 >>>> be thankful for hints how to
debug.
>>>>
>>>> Cheers,
>>>> Thomas
>>>> --
>>>> http://www.widhalm.or.at <http://www.widhalm.or.at>
>>>> GnuPG : 6265BAE6 , A84CB603
>>>> Threema: H7AV7D33
>>>> Telegram, Signal: widhalmt(a)widhalm.or.at
       <mailto:widhalmt@widhalm.or.at>
 >>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io
       <mailto:ceph-users@ceph.io>
 >>> To unsubscribe send an email to
ceph-users-leave(a)ceph.io
       <mailto:ceph-users-leave@ceph.io>
 >>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
       <mailto:ceph-users@ceph.io>
 >> To unsubscribe send an email to
ceph-users-leave(a)ceph.io
       <mailto:ceph-users-leave@ceph.io>
 >
_______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
       <mailto:ceph-users@ceph.io>
 > To unsubscribe send an email to
ceph-users-leave(a)ceph.io
       <mailto:ceph-users-leave@ceph.io>
 >
 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
       <mailto:ceph-users@ceph.io>
  To unsubscribe send an email to
ceph-users-leave(a)ceph.io
       <mailto:ceph-users-leave@ceph.io>

  _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Orchestration seems not to work