I just checked every single host. The only processes of cephadm running
where "cephadm shell" from debugging. I closed all of them, so now I can
verify, there's not a single cephadm process running on any of my ceph
hosts. (and since I found the shell processes, I can verify I didn't
have a typo ;-) )
Regarding broken record: I'm extremly thankful for your support. And I
should have checked that earlier. We all know that sometimes it's the
least probable things that go sideways. So checking the things you're
sure to be ok is always a good idea. Thanks for being adamant about
that. But now we can be sure, at least.
On 15.05.23 21:27, Adam King wrote:
If it persisted through a full restart, it's
possible the conditions
that caused the hang are still present after the fact. The two known
causes I'm aware of are lack of space in the root partition and hanging
mount points. Both would show up as processes in "ps aux | grep cephadm"
though. The latter could possibly be related to cephfs pool issues if
you have something mounted on one of the host hosts. Still hard to say
without knowing what exactly got stuck. For clarity, without restarting
or changing anything else, can you verify if "ps aux | grep cephadm"
shows anything on the nodes. I know I'm a bit of a broken record on
mentioning the hanging processes stuff, but outside of module crashes
which don't appear to be present here, 100% of other cases of this type
of thing happening I've looked at before have had those processes
sitting around.
On Mon, May 15, 2023 at 3:10 PM Thomas Widhalm <widhalmt(a)widhalm.or.at
<mailto:widhalmt@widhalm.or.at>> wrote:
This is why I even tried a full cluster shutdown. All Hosts were
out, so
there's not a possibility that there's any process hanging. After I
started the nodes, it's just the same as before. All refresh times show
"4 weeks". Like it stopped simoultanously on all nodes.
Some time ago we had a small change in name resolution so I thought,
maybe the orchestrator can't connect via ssh anymore. But I tried all
the steps in
https://docs.ceph.com/docs/master/cephadm/troubleshooting/#ssh-errors
<https://docs.ceph.com/docs/master/cephadm/troubleshooting/#ssh-errors> .
The only thing that's slightly suspicous is that, it said, it added the
host key to known hosts. But since I tried via "cephadm shell" I guess,
the known hosts are just not replicated to these containers. ssh works,
too. (And I would have suspected that I get a warning if that failed)
I don't see any information about the orchestrator module having
crashed. It's running as always.
From the the prior problem I had some issues in my cephfs pools. So,
maybe there's something broken in the .mgr pool? Could that be a reason
for this behaviour? I googled a while but didn't find any way how to
check that explicitly.
On 15.05.23 19:15, Adam King wrote:
This is sort of similar to what I said in a
previous email, but
the only
way I've seen this happen in other setups is
through hanging cephadm
commands. The debug process has been, do a mgr failover, wait a few
minutes, see in "ceph orch ps" and "ceph orch device ls" which
hosts have
and have not been refreshed (the REFRESHED column
should be some
lower
value on the hosts where it refreshed), go to the
hosts where it
did not
refresh and check "ps aux | grep
cephadm" looking for long
running (and
therefore most likely hung) processes. I would
still expect
that's the most
likely thing you're experiencing here. I
haven't seen any other
causes for
cephadm to not refresh unless the module crashed,
but that would be
explicitly stated in the cluster health.
On Mon, May 15, 2023 at 11:44 AM Thomas Widhalm
<widhalmt(a)widhalm.or.at <mailto:widhalmt@widhalm.or.at>>
wrote:
> Hi,
>
> I tried a lot of different approaches but I didn't have any
success so far.
>
> "ceph orch ps" still doesn't get refreshed.
>
> Some examples:
>
> mds.mds01.ceph06.huavsw ceph06 starting
-
> - - - <unknown>
<unknown> <unknown>
> mds.mds01.ceph06.rrxmks ceph06 error
4w ago
> 3M - - <unknown>
<unknown> <unknown>
> mds.mds01.ceph07.omdisd ceph07 error
4w ago
> 4M - - <unknown>
<unknown> <unknown>
> mds.mds01.ceph07.vvqyma ceph07 starting
-
> - - - <unknown>
<unknown> <unknown>
> mgr.ceph04.qaexpv ceph04 *:8443,9283 running (4w)
4w ago
> 10M 551M - 17.2.6
9cea3956c04b 33df84e346a0
> mgr.ceph05.jcmkbb ceph05 *:8443,9283 running (4w)
4w ago
> 4M 441M - 17.2.6
9cea3956c04b 1ad485df4399
> mgr.ceph06.xbduuf ceph06 *:8443,9283 running (4w)
4w ago
> 4M 432M - 17.2.6
9cea3956c04b 5ba5fd95dc48
> mon.ceph04 ceph04 running (4w)
4w ago
> 4M 223M 2048M 17.2.6
9cea3956c04b 8b6116dd216f
> mon.ceph05 ceph05 running (4w)
4w ago
> 4M 326M 2048M 17.2.6
9cea3956c04b 70520d737f29
>
> Debug Log doesn't show anything that could help me, either.
>
> 2023-05-15T14:48:40.852088+0000 mgr.ceph05.jcmkbb (mgr.83897390)
1376 :
> cephadm [INF] Schedule start daemon
mds.mds01.ceph04.hcmvae
> 2023-05-15T14:48:43.620700+0000 mgr.ceph05.jcmkbb (mgr.83897390)
1380 :
> cephadm [INF] Schedule redeploy daemon
mds.mds01.ceph04.hcmvae
> 2023-05-15T14:48:45.124822+0000 mgr.ceph05.jcmkbb (mgr.83897390)
1392 :
> cephadm [INF] Schedule start daemon
mds.mds01.ceph04.krxszj
> 2023-05-15T14:48:46.493902+0000 mgr.ceph05.jcmkbb (mgr.83897390)
1394 :
> cephadm [INF] Schedule redeploy daemon
mds.mds01.ceph04.krxszj
> 2023-05-15T15:05:25.637079+0000 mgr.ceph05.jcmkbb (mgr.83897390)
2629 :
> cephadm [INF] Saving service mds.mds01 spec
with placement count:2
> 2023-05-15T15:07:27.625773+0000 mgr.ceph05.jcmkbb (mgr.83897390)
2780 :
> cephadm [INF] Saving service mds.fs_name spec
with placement count:3
> 2023-05-15T15:07:42.120912+0000 mgr.ceph05.jcmkbb (mgr.83897390)
2795 :
> cephadm [INF] Saving service mds.mds01 spec
with placement count:3
>
> I'm seeing all the commands I give but I don't get any more
information
> on why it's not actually happening.
>
> I tried to change different scheduling mechanisms. Host, Tag,
unmanaged
> and back again. I turned off orchestration
and resumed. I failed
mgr. I
> even had full cluster stops (in the past). I
made sure all
daemons run
> the same version. (If you remember, upgrade
failed underway).
>
> So my only way of getting daemons only is manually. I added two more
> hosts, tagged them. But there isn't a single daemon started there.
>
> Could you help me again with how to debug orchestration not working?
>
>
> On 04.05.23 15:12, Thomas Widhalm wrote:
>> Thanks.
>>
>> I set the log level to debug, try a few steps and then come back.
>>
>> On 04.05.23 14:48, Eugen Block wrote:
>>> Hi,
>>>
>>> try setting debug logs for the mgr:
>>>
>>> ceph config set mgr mgr/cephadm/log_level debug
>>>
>>> This should provide more details what the mgr is trying and
where it's
>>> failing, hopefully. Last week this
helped to identify an issue
between
>>> a lower pacific issue for me.
>>> Do you see anything in the cephadm.log pointing to the mgr
actually
>>> trying something?
>>>
>>>
>>> Zitat von Thomas Widhalm <widhalmt(a)widhalm.or.at
<mailto:widhalmt@widhalm.or.at>>:
>>>
>>>> Hi,
>>>>
>>>> I'm in the process of upgrading my cluster from 17.2.5 to
17.2.6 but
>>>> the following problem existed
when I was still everywhere on
17.2.5 .
>>>>
>>>> I had a major issue in my cluster which could be solved with
a lot of
>>>> your help and even more trial and
error. Right now it seems
that most
>>>> is already fixed but I can't
rule out that there's still some
problem
>>>> hidden. The very issue I'm
asking about started during the
repair.
>>>>
>>>> When I want to orchestrate the cluster, it logs the command
but it
>>>> doesn't do anything. No
matter if I use ceph dashboard or
"ceph orch"
>>>> in "cephadm shell". I
don't get any error message when I try to
>>>> deploy new services, redeploy them etc. The log only says
"scheduled"
>>>> and that's it. Same when I
change placement rules. Usually I use
>>>> tags. But since they don't work anymore, too, I tried host and
>>>> umanaged. No success. The only way I can actually start and stop
>>>> containers is via systemctl from the host itself.
>>>>
>>>> When I run "ceph orch ls" or "ceph orch ps" I see
services I
deployed
>>>> for testing being deleted (for
weeks now). Ans especially a
lot of
>>>> old MDS are listed as
"error" or "starting". The list doesn't
match
>>>> reality at all because I had to
start them by hand.
>>>>
>>>> I tried "ceph mgr fail" and even a complete shutdown of the
whole
>>>> cluster with all nodes including all mgs, mds even osd -
everything
>>>> during a maintenance window.
Didn't change anything.
>>>>
>>>> Could you help me? To be honest I'm still rather new to Ceph and
>>>> since I didn't find anything in the logs that caught my eye I
would
>>>> be thankful for hints how to
debug.
>>>>
>>>> Cheers,
>>>> Thomas
>>>> --
>>>>
http://www.widhalm.or.at <http://www.widhalm.or.at>
>>>> GnuPG : 6265BAE6 , A84CB603
>>>> Threema: H7AV7D33
>>>> Telegram, Signal: widhalmt(a)widhalm.or.at
<mailto:widhalmt@widhalm.or.at>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io
<mailto:ceph-users@ceph.io>
>>> To unsubscribe send an email to
ceph-users-leave(a)ceph.io
<mailto:ceph-users-leave@ceph.io>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
<mailto:ceph-users@ceph.io>
>> To unsubscribe send an email to
ceph-users-leave(a)ceph.io
<mailto:ceph-users-leave@ceph.io>
>
_______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
<mailto:ceph-users@ceph.io>
> To unsubscribe send an email to
ceph-users-leave(a)ceph.io
<mailto:ceph-users-leave@ceph.io>
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
<mailto:ceph-users@ceph.io>
To unsubscribe send an email to
ceph-users-leave(a)ceph.io
<mailto:ceph-users-leave@ceph.io>