Orchestration seems not to work

List overview All Threads
Download

newer

older

RGW: perf dump. What is...

Encryption per user Howto

Thomas Widhalm

4 May 2023 4 May '23

5:08 p.m.

Hi, I'm in the process of upgrading my cluster from 17.2.5 to 17.2.6 but the following problem existed when I was still everywhere on 17.2.5 . I had a major issue in my cluster which could be solved with a lot of your help and even more trial and error. Right now it seems that most is already fixed but I can't rule out that there's still some problem hidden. The very issue I'm asking about started during the repair. When I want to orchestrate the cluster, it logs the command but it doesn't do anything. No matter if I use ceph dashboard or "ceph orch" in "cephadm shell". I don't get any error message when I try to deploy new services, redeploy them etc. The log only says "scheduled" and that's it. Same when I change placement rules. Usually I use tags. But since they don't work anymore, too, I tried host and umanaged. No success. The only way I can actually start and stop containers is via systemctl from the host itself. When I run "ceph orch ls" or "ceph orch ps" I see services I deployed for testing being deleted (for weeks now). Ans especially a lot of old MDS are listed as "error" or "starting". The list doesn't match reality at all because I had to start them by hand. I tried "ceph mgr fail" and even a complete shutdown of the whole cluster with all nodes including all mgs, mds even osd - everything during a maintenance window. Didn't change anything. Could you help me? To be honest I'm still rather new to Ceph and since I didn't find anything in the logs that caught my eye I would be thankful for hints how to debug. Cheers, Thomas -- http://www.widhalm.or.at GnuPG : 6265BAE6 , A84CB603 Threema: H7AV7D33 Telegram, Signal: widhalmt(a)widhalm.or.at

Attachments:

OpenPGP_signature.sig (application/pgp-signature — 840 bytes)

Show replies by date

Eugen Block

4 May 4 May

5:18 p.m.

...

Thomas Widhalm

5:42 p.m.

Thanks. I set the log level to debug, try a few steps and then come back. On 04.05.23 14:48, Eugen Block wrote:

...

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Thomas Widhalm

15 May 15 May

8:13 p.m.

Hi, I tried a lot of different approaches but I didn't have any success so far. "ceph orch ps" still doesn't get refreshed. Some examples: mds.mds01.ceph06.huavsw ceph06 starting - - - - <unknown> <unknown> <unknown> mds.mds01.ceph06.rrxmks ceph06 error 4w ago 3M - - <unknown> <unknown> <unknown> mds.mds01.ceph07.omdisd ceph07 error 4w ago 4M - - <unknown> <unknown> <unknown> mds.mds01.ceph07.vvqyma ceph07 starting - - - - <unknown> <unknown> <unknown> mgr.ceph04.qaexpv ceph04 *:8443,9283 running (4w) 4w ago 10M 551M - 17.2.6 9cea3956c04b 33df84e346a0 mgr.ceph05.jcmkbb ceph05 *:8443,9283 running (4w) 4w ago 4M 441M - 17.2.6 9cea3956c04b 1ad485df4399 mgr.ceph06.xbduuf ceph06 *:8443,9283 running (4w) 4w ago 4M 432M - 17.2.6 9cea3956c04b 5ba5fd95dc48 mon.ceph04 ceph04 running (4w) 4w ago 4M 223M 2048M 17.2.6 9cea3956c04b 8b6116dd216f mon.ceph05 ceph05 running (4w) 4w ago 4M 326M 2048M 17.2.6 9cea3956c04b 70520d737f29 Debug Log doesn't show anything that could help me, either. 2023-05-15T14:48:40.852088+0000 mgr.ceph05.jcmkbb (mgr.83897390) 1376 : cephadm [INF] Schedule start daemon mds.mds01.ceph04.hcmvae 2023-05-15T14:48:43.620700+0000 mgr.ceph05.jcmkbb (mgr.83897390) 1380 : cephadm [INF] Schedule redeploy daemon mds.mds01.ceph04.hcmvae 2023-05-15T14:48:45.124822+0000 mgr.ceph05.jcmkbb (mgr.83897390) 1392 : cephadm [INF] Schedule start daemon mds.mds01.ceph04.krxszj 2023-05-15T14:48:46.493902+0000 mgr.ceph05.jcmkbb (mgr.83897390) 1394 : cephadm [INF] Schedule redeploy daemon mds.mds01.ceph04.krxszj 2023-05-15T15:05:25.637079+0000 mgr.ceph05.jcmkbb (mgr.83897390) 2629 : cephadm [INF] Saving service mds.mds01 spec with placement count:2 2023-05-15T15:07:27.625773+0000 mgr.ceph05.jcmkbb (mgr.83897390) 2780 : cephadm [INF] Saving service mds.fs_name spec with placement count:3 2023-05-15T15:07:42.120912+0000 mgr.ceph05.jcmkbb (mgr.83897390) 2795 : cephadm [INF] Saving service mds.mds01 spec with placement count:3 I'm seeing all the commands I give but I don't get any more information on why it's not actually happening. I tried to change different scheduling mechanisms. Host, Tag, unmanaged and back again. I turned off orchestration and resumed. I failed mgr. I even had full cluster stops (in the past). I made sure all daemons run the same version. (If you remember, upgrade failed underway). So my only way of getting daemons only is manually. I added two more hosts, tagged them. But there isn't a single daemon started there. Could you help me again with how to debug orchestration not working? On 04.05.23 15:12, Thomas Widhalm wrote:

...

Thanks. I set the log level to debug, try a few steps and then come back. On 04.05.23 14:48, Eugen Block wrote:

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Adam King

9:45 p.m.

This is sort of similar to what I said in a previous email, but the only way I've seen this happen in other setups is through hanging cephadm commands. The debug process has been, do a mgr failover, wait a few minutes, see in "ceph orch ps" and "ceph orch device ls" which hosts have and have not been refreshed (the REFRESHED column should be some lower value on the hosts where it refreshed), go to the hosts where it did not refresh and check "ps aux | grep cephadm" looking for long running (and therefore most likely hung) processes. I would still expect that's the most likely thing you're experiencing here. I haven't seen any other causes for cephadm to not refresh unless the module crashed, but that would be explicitly stated in the cluster health. On Mon, May 15, 2023 at 11:44 AM Thomas Widhalm <widhalmt(a)widhalm.or.at> wrote:

...

Thanks. I set the log level to debug, try a few steps and then come back. On 04.05.23 14:48, Eugen Block wrote:

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Thomas Widhalm

11:40 p.m.

This is why I even tried a full cluster shutdown. All Hosts were out, so there's not a possibility that there's any process hanging. After I started the nodes, it's just the same as before. All refresh times show "4 weeks". Like it stopped simoultanously on all nodes. Some time ago we had a small change in name resolution so I thought, maybe the orchestrator can't connect via ssh anymore. But I tried all the steps in https://docs.ceph.com/docs/master/cephadm/troubleshooting/#ssh-errors . The only thing that's slightly suspicous is that, it said, it added the host key to known hosts. But since I tried via "cephadm shell" I guess, the known hosts are just not replicated to these containers. ssh works, too. (And I would have suspected that I get a warning if that failed) I don't see any information about the orchestrator module having crashed. It's running as always. From the the prior problem I had some issues in my cephfs pools. So, maybe there's something broken in the .mgr pool? Could that be a reason for this behaviour? I googled a while but didn't find any way how to check that explicitly. On 15.05.23 19:15, Adam King wrote:

...

Thanks. I set the log level to debug, try a few steps and then come back. On 04.05.23 14:48, Eugen Block wrote:

Hi, try setting debug logs for the mgr: ceph config set mgr mgr/cephadm/log_level debug This should provide more details what the mgr is trying and where it's failing, hopefully. Last week this helped to identify an issue between a lower pacific issue for me. Do you see anything in the cephadm.log pointing to the mgr actually trying something? Zitat von Thomas Widhalm <widhalmt(a)widhalm.or.at>at>: > Hi, > > I'm in the process of upgrading my cluster from 17.2.5 to 17.2.6 but > the following problem existed when I was still everywhere on 17.2.5 . > > I had a major issue in my cluster which could be solved with a lot of > your help and even more trial and error. Right now it seems that most > is already fixed but I can't rule out that there's still some problem > hidden. The very issue I'm asking about started during the repair. > > When I want to orchestrate the cluster, it logs the command but it > doesn't do anything. No matter if I use ceph dashboard or "ceph orch" > in "cephadm shell". I don't get any error message when I try to > deploy new services, redeploy them etc. The log only says "scheduled" > and that's it. Same when I change placement rules. Usually I use > tags. But since they don't work anymore, too, I tried host and > umanaged. No success. The only way I can actually start and stop > containers is via systemctl from the host itself. > > When I run "ceph orch ls" or "ceph orch ps" I see services I deployed > for testing being deleted (for weeks now). Ans especially a lot of > old MDS are listed as "error" or "starting". The list doesn't match > reality at all because I had to start them by hand. > > I tried "ceph mgr fail" and even a complete shutdown of the whole > cluster with all nodes including all mgs, mds even osd - everything > during a maintenance window. Didn't change anything. > > Could you help me? To be honest I'm still rather new to Ceph and > since I didn't find anything in the logs that caught my eye I would > be thankful for hints how to debug. > > Cheers, > Thomas > -- > http://www.widhalm.or.at > GnuPG : 6265BAE6 , A84CB603 > Threema: H7AV7D33 > Telegram, Signal: widhalmt(a)widhalm.or.at _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Adam King

11:57 p.m.

...

most

likely thing you're experiencing here. I haven't seen any other causes

for

cephadm to not refresh unless the module crashed, but that would be explicitly stated in the cluster health. On Mon, May 15, 2023 at 11:44 AM Thomas Widhalm <widhalmt(a)widhalm.or.at> wrote: > Hi, > > I tried a lot of different approaches but I didn't have any success so

far.

"ceph orch ps" still doesn't get refreshed. Some examples: mds.mds01.ceph06.huavsw ceph06 starting - - - - <unknown> <unknown> <unknown> mds.mds01.ceph06.rrxmks ceph06 error 4w ago 3M - - <unknown> <unknown> <unknown> mds.mds01.ceph07.omdisd ceph07 error 4w ago 4M - - <unknown> <unknown> <unknown> mds.mds01.ceph07.vvqyma ceph07 starting - - - - <unknown> <unknown> <unknown> mgr.ceph04.qaexpv ceph04 *:8443,9283 running (4w) 4w ago 10M 551M - 17.2.6 9cea3956c04b 33df84e346a0 mgr.ceph05.jcmkbb ceph05 *:8443,9283 running (4w) 4w ago 4M 441M - 17.2.6 9cea3956c04b 1ad485df4399 mgr.ceph06.xbduuf ceph06 *:8443,9283 running (4w) 4w ago 4M 432M - 17.2.6 9cea3956c04b 5ba5fd95dc48 mon.ceph04 ceph04 running (4w) 4w ago 4M 223M 2048M 17.2.6 9cea3956c04b 8b6116dd216f mon.ceph05 ceph05 running (4w) 4w ago 4M 326M 2048M 17.2.6 9cea3956c04b 70520d737f29 Debug Log doesn't show anything that could help me, either. 2023-05-15T14:48:40.852088+0000 mgr.ceph05.jcmkbb (mgr.83897390) 1376 : cephadm [INF] Schedule start daemon mds.mds01.ceph04.hcmvae 2023-05-15T14:48:43.620700+0000 mgr.ceph05.jcmkbb (mgr.83897390) 1380 : cephadm [INF] Schedule redeploy daemon mds.mds01.ceph04.hcmvae 2023-05-15T14:48:45.124822+0000 mgr.ceph05.jcmkbb (mgr.83897390) 1392 : cephadm [INF] Schedule start daemon mds.mds01.ceph04.krxszj 2023-05-15T14:48:46.493902+0000 mgr.ceph05.jcmkbb (mgr.83897390) 1394 : cephadm [INF] Schedule redeploy daemon mds.mds01.ceph04.krxszj 2023-05-15T15:05:25.637079+0000 mgr.ceph05.jcmkbb (mgr.83897390) 2629 : cephadm [INF] Saving service mds.mds01 spec with placement count:2 2023-05-15T15:07:27.625773+0000 mgr.ceph05.jcmkbb (mgr.83897390) 2780 : cephadm [INF] Saving service mds.fs_name spec with placement count:3 2023-05-15T15:07:42.120912+0000 mgr.ceph05.jcmkbb (mgr.83897390) 2795 : cephadm [INF] Saving service mds.mds01 spec with placement count:3 I'm seeing all the commands I give but I don't get any more information on why it's not actually happening. I tried to change different scheduling mechanisms. Host, Tag, unmanaged and back again. I turned off orchestration and resumed. I failed mgr. I even had full cluster stops (in the past). I made sure all daemons run the same version. (If you remember, upgrade failed underway). So my only way of getting daemons only is manually. I added two more hosts, tagged them. But there isn't a single daemon started there. Could you help me again with how to debug orchestration not working? On 04.05.23 15:12, Thomas Widhalm wrote:

Thanks. I set the log level to debug, try a few steps and then come back. On 04.05.23 14:48, Eugen Block wrote: > Hi, > > try setting debug logs for the mgr: > > ceph config set mgr mgr/cephadm/log_level debug > > This should provide more details what the mgr is trying and where it's > failing, hopefully. Last week this helped to identify an issue between > a lower pacific issue for me. > Do you see anything in the cephadm.log pointing to the mgr actually > trying something? > > > Zitat von Thomas Widhalm <widhalmt(a)widhalm.or.at>at>: > >> Hi, >> >> I'm in the process of upgrading my cluster from 17.2.5 to 17.2.6 but >> the following problem existed when I was still everywhere on 17.2.5 . >> >> I had a major issue in my cluster which could be solved with a lot of >> your help and even more trial and error. Right now it seems that most >> is already fixed but I can't rule out that there's still some problem >> hidden. The very issue I'm asking about started during the repair. >> >> When I want to orchestrate the cluster, it logs the command but it >> doesn't do anything. No matter if I use ceph dashboard or "ceph orch" >> in "cephadm shell". I don't get any error message when I try to >> deploy new services, redeploy them etc. The log only says "scheduled" >> and that's it. Same when I change placement rules. Usually I use >> tags. But since they don't work anymore, too, I tried host and >> umanaged. No success. The only way I can actually start and stop >> containers is via systemctl from the host itself. >> >> When I run "ceph orch ls" or "ceph orch ps" I see services I deployed >> for testing being deleted (for weeks now). Ans especially a lot of >> old MDS are listed as "error" or "starting". The list doesn't match >> reality at all because I had to start them by hand. >> >> I tried "ceph mgr fail" and even a complete shutdown of the whole >> cluster with all nodes including all mgs, mds even osd - everything >> during a maintenance window. Didn't change anything. >> >> Could you help me? To be honest I'm still rather new to Ceph and >> since I didn't find anything in the logs that caught my eye I would >> be thankful for hints how to debug. >> >> Cheers, >> Thomas >> -- >> http://www.widhalm.or.at >> GnuPG : 6265BAE6 , A84CB603 >> Threema: H7AV7D33 >> Telegram, Signal: widhalmt(a)widhalm.or.at > > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Thomas Widhalm

16 May 16 May

12:06 a.m.

I just checked every single host. The only processes of cephadm running where "cephadm shell" from debugging. I closed all of them, so now I can verify, there's not a single cephadm process running on any of my ceph hosts. (and since I found the shell processes, I can verify I didn't have a typo ;-) ) Regarding broken record: I'm extremly thankful for your support. And I should have checked that earlier. We all know that sometimes it's the least probable things that go sideways. So checking the things you're sure to be ok is always a good idea. Thanks for being adamant about that. But now we can be sure, at least. On 15.05.23 21:27, Adam King wrote:

...

If it persisted through a full restart, it's possible the conditions that caused the hang are still present after the fact. The two known causes I'm aware of are lack of space in the root partition and hanging mount points. Both would show up as processes in "ps aux | grep cephadm" though. The latter could possibly be related to cephfs pool issues if you have something mounted on one of the host hosts. Still hard to say without knowing what exactly got stuck. For clarity, without restarting or changing anything else, can you verify if "ps aux | grep cephadm" shows anything on the nodes. I know I'm a bit of a broken record on mentioning the hanging processes stuff, but outside of module crashes which don't appear to be present here, 100% of other cases of this type of thing happening I've looked at before have had those processes sitting around. On Mon, May 15, 2023 at 3:10 PM Thomas Widhalm <widhalmt(a)widhalm.or.at <mailto:widhalmt@widhalm.or.at>> wrote: This is why I even tried a full cluster shutdown. All Hosts were out, so there's not a possibility that there's any process hanging. After I started the nodes, it's just the same as before. All refresh times show "4 weeks". Like it stopped simoultanously on all nodes. Some time ago we had a small change in name resolution so I thought, maybe the orchestrator can't connect via ssh anymore. But I tried all the steps in https://docs.ceph.com/docs/master/cephadm/troubleshooting/#ssh-errors <https://docs.ceph.com/docs/master/cephadm/troubleshooting/#ssh-errors> . The only thing that's slightly suspicous is that, it said, it added the host key to known hosts. But since I tried via "cephadm shell" I guess, the known hosts are just not replicated to these containers. ssh works, too. (And I would have suspected that I get a warning if that failed) I don't see any information about the orchestrator module having crashed. It's running as always. From the the prior problem I had some issues in my cephfs pools. So, maybe there's something broken in the .mgr pool? Could that be a reason for this behaviour? I googled a while but didn't find any way how to check that explicitly. On 15.05.23 19:15, Adam King wrote:

This is sort of similar to what I said in a previous email, but

the only

way I've seen this happen in other setups is through hanging cephadm commands. The debug process has been, do a mgr failover, wait a few minutes, see in "ceph orch ps" and "ceph orch device ls" which

hosts have

and have not been refreshed (the REFRESHED column should be some

lower

value on the hosts where it refreshed), go to the hosts where it

did not

refresh and check "ps aux | grep cephadm" looking for long

running (and

therefore most likely hung) processes. I would still expect

that's the most

likely thing you're experiencing here. I haven't seen any other

causes for

cephadm to not refresh unless the module crashed, but that would be explicitly stated in the cluster health. On Mon, May 15, 2023 at 11:44 AM Thomas Widhalm

<widhalmt(a)widhalm.or.at <mailto:widhalmt@widhalm.or.at>>

wrote: > Hi, > > I tried a lot of different approaches but I didn't have any

success so far.

> > "ceph orch ps" still doesn't get refreshed. > > Some examples: > > mds.mds01.ceph06.huavsw ceph06 starting

> - - - <unknown> <unknown> <unknown> > mds.mds01.ceph06.rrxmks ceph06 error

4w ago

> 3M - - <unknown> <unknown> <unknown> > mds.mds01.ceph07.omdisd ceph07 error

4w ago

> 4M - - <unknown> <unknown> <unknown> > mds.mds01.ceph07.vvqyma ceph07 starting

> - - - <unknown> <unknown> <unknown> > mgr.ceph04.qaexpv ceph04 *:8443,9283 running (4w)

4w ago

> 10M 551M - 17.2.6 9cea3956c04b 33df84e346a0 > mgr.ceph05.jcmkbb ceph05 *:8443,9283 running (4w)

4w ago

> 4M 441M - 17.2.6 9cea3956c04b 1ad485df4399 > mgr.ceph06.xbduuf ceph06 *:8443,9283 running (4w)

4w ago

> 4M 432M - 17.2.6 9cea3956c04b 5ba5fd95dc48 > mon.ceph04 ceph04 running (4w)

4w ago

> 4M 223M 2048M 17.2.6 9cea3956c04b 8b6116dd216f > mon.ceph05 ceph05 running (4w)

4w ago

> 4M 326M 2048M 17.2.6 9cea3956c04b 70520d737f29 > > Debug Log doesn't show anything that could help me, either. > > 2023-05-15T14:48:40.852088+0000 mgr.ceph05.jcmkbb (mgr.83897390)

1376 :

> cephadm [INF] Schedule start daemon mds.mds01.ceph04.hcmvae > 2023-05-15T14:48:43.620700+0000 mgr.ceph05.jcmkbb (mgr.83897390)

1380 :

> cephadm [INF] Schedule redeploy daemon mds.mds01.ceph04.hcmvae > 2023-05-15T14:48:45.124822+0000 mgr.ceph05.jcmkbb (mgr.83897390)

1392 :

> cephadm [INF] Schedule start daemon mds.mds01.ceph04.krxszj > 2023-05-15T14:48:46.493902+0000 mgr.ceph05.jcmkbb (mgr.83897390)

1394 :

> cephadm [INF] Schedule redeploy daemon mds.mds01.ceph04.krxszj > 2023-05-15T15:05:25.637079+0000 mgr.ceph05.jcmkbb (mgr.83897390)

2629 :

> cephadm [INF] Saving service mds.mds01 spec with placement count:2 > 2023-05-15T15:07:27.625773+0000 mgr.ceph05.jcmkbb (mgr.83897390)

2780 :

> cephadm [INF] Saving service mds.fs_name spec with placement count:3 > 2023-05-15T15:07:42.120912+0000 mgr.ceph05.jcmkbb (mgr.83897390)

2795 :

> cephadm [INF] Saving service mds.mds01 spec with placement count:3 > > I'm seeing all the commands I give but I don't get any more

information

> on why it's not actually happening. > > I tried to change different scheduling mechanisms. Host, Tag,

unmanaged

> and back again. I turned off orchestration and resumed. I failed

mgr. I

> even had full cluster stops (in the past). I made sure all

daemons run

> the same version. (If you remember, upgrade failed underway). > > So my only way of getting daemons only is manually. I added two more > hosts, tagged them. But there isn't a single daemon started there. > > Could you help me again with how to debug orchestration not working? > > > On 04.05.23 15:12, Thomas Widhalm wrote: >> Thanks. >> >> I set the log level to debug, try a few steps and then come back. >> >> On 04.05.23 14:48, Eugen Block wrote: >>> Hi, >>> >>> try setting debug logs for the mgr: >>> >>> ceph config set mgr mgr/cephadm/log_level debug >>> >>> This should provide more details what the mgr is trying and

where it's

>>> failing, hopefully. Last week this helped to identify an issue

between

>>> a lower pacific issue for me. >>> Do you see anything in the cephadm.log pointing to the mgr

actually

>>> trying something? >>> >>> >>> Zitat von Thomas Widhalm <widhalmt(a)widhalm.or.at

<mailto:widhalmt@widhalm.or.at>>:

>>> >>>> Hi, >>>> >>>> I'm in the process of upgrading my cluster from 17.2.5 to

17.2.6 but

>>>> the following problem existed when I was still everywhere on

17.2.5 .

>>>> >>>> I had a major issue in my cluster which could be solved with

a lot of

>>>> your help and even more trial and error. Right now it seems

that most

>>>> is already fixed but I can't rule out that there's still some

problem

>>>> hidden. The very issue I'm asking about started during the

repair.

>>>> >>>> When I want to orchestrate the cluster, it logs the command

but it

>>>> doesn't do anything. No matter if I use ceph dashboard or

"ceph orch"

>>>> in "cephadm shell". I don't get any error message when I try to >>>> deploy new services, redeploy them etc. The log only says

"scheduled"

>>>> and that's it. Same when I change placement rules. Usually I use >>>> tags. But since they don't work anymore, too, I tried host and >>>> umanaged. No success. The only way I can actually start and stop >>>> containers is via systemctl from the host itself. >>>> >>>> When I run "ceph orch ls" or "ceph orch ps" I see services I

deployed

>>>> for testing being deleted (for weeks now). Ans especially a

lot of

>>>> old MDS are listed as "error" or "starting". The list doesn't

match

>>>> reality at all because I had to start them by hand. >>>> >>>> I tried "ceph mgr fail" and even a complete shutdown of the whole >>>> cluster with all nodes including all mgs, mds even osd -

everything

>>>> during a maintenance window. Didn't change anything. >>>> >>>> Could you help me? To be honest I'm still rather new to Ceph and >>>> since I didn't find anything in the logs that caught my eye I

would

>>>> be thankful for hints how to debug. >>>> >>>> Cheers, >>>> Thomas >>>> -- >>>> http://www.widhalm.or.at <http://www.widhalm.or.at> >>>> GnuPG : 6265BAE6 , A84CB603 >>>> Threema: H7AV7D33 >>>> Telegram, Signal: widhalmt(a)widhalm.or.at

<mailto:widhalmt@widhalm.or.at>

>>> >>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io>

>> >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

>> To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io>

> _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

> To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io>

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io>

Adam King

1:25 a.m.

Okay, thanks for verifying that bit, sorry to have gone about it so long. I guess we could look at connection issues next. I wrote a short python script that tries to connect to hosts using asyncssh closely to how cephadm does it ( https://github.com/adk3798/testing_scripts/blob/main/asyncssh-connect.py). Maybe if you try that within a shell it will give some insights on if these connections, specifically through asyncssh using the keys/settings as cephadm does, is working alright. I used it in a shell with --no-hosts from the host with the active mgr to try and be as close to internal cephadm operations as possible. connect.py is the same as the linked file on github here. [root@vm-00 ~]# cephadm shell --no-hosts --mount connect.py Inferring fsid 9dcee730-f32f-11ed-ba89-52540033ae03 Using recent ceph image quay.io/adk3798/ceph@sha256:7168a3334fa16cd9d091462327c73f20548f113f9f404a316322aa4779a7639c [ceph: root@vm-00 /]# [ceph: root@vm-00 /]# [ceph: root@vm-00 /]# python3 -m pip install asyncssh WARNING: Running pip install with root privileges is generally not a good idea. Try `__main__.py install --user` instead. Collecting asyncssh Using cached https://files.pythonhosted.org/packages/1e/9f/ad61867b12823f6e2c0ef2b80a704… Requirement already satisfied: cryptography>=3.1 in /usr/lib64/python3.6/site-packages (from asyncssh) Collecting typing-extensions>=3.6 (from asyncssh) Using cached https://files.pythonhosted.org/packages/45/6b/44f7f8f1e110027cf88956b59f2fa… Requirement already satisfied: six>=1.4.1 in /usr/lib/python3.6/site-packages (from cryptography>=3.1->asyncssh) Requirement already satisfied: cffi!=1.11.3,>=1.8 in /usr/lib64/python3.6/site-packages (from cryptography>=3.1->asyncssh) Requirement already satisfied: pycparser in /usr/lib/python3.6/site-packages (from cffi!=1.11.3,>=1.8->cryptography>=3.1->asyncssh) Installing collected packages: typing-extensions, asyncssh Successfully installed asyncssh-2.13.1 typing-extensions-4.1.1 [ceph: root@vm-00 /]# ceph config-key get mgr/cephadm/ssh_identity_key > identity_key [ceph: root@vm-00 /]# ceph config-key get mgr/cephadm/ssh_identity_pub > identity_pub [ceph: root@vm-00 /]# ceph cephadm get-ssh-config > ssh_config [ceph: root@vm-00 /]# ceph orch host ls HOST ADDR LABELS STATUS vm-00 192.168.122.121 _admin vm-01 192.168.122.209 vm-02 192.168.122.128 3 hosts in cluster [ceph: root@vm-00 /]# python3 /mnt/connect.py connect --address 192.168.122.209 --priv-key-file identity_key --pub-key-file identity_pub --ssh-config-file ssh_config return code: 0 stdout: stderr: [ceph: root@vm-00 /]# The output won't look nice in an email, but basically it was just installing the asyncssh library, gathering the ssh keys/config from cephadm, picking an IP from the "ceph orch host ls" output and then running the script. By default it just runs "true" on the host, so in a success case it's just return code 0 and no output, but that could obviously be changed by modifying the python script to do something else for the cmd. If it fails, you get a traceback [ceph: root@vm-00 /]# python3 /mnt/connect.py connect --address 192.168.122.201 --priv-key-file identity_key --pub-key-file identity_pub --ssh-config-file ssh_config Traceback (most recent call last): File "/mnt/connect.py", line 100, in <module> main() File "/mnt/connect.py", line 93, in main r = args.func(args) File "/mnt/connect.py", line 8, in try_connection async_run(_connect(args)) File "/mnt/connect.py", line 14, in async_run return loop.run_until_complete(coro) File "/usr/lib64/python3.6/asyncio/base_events.py", line 484, in run_until_complete return future.result() File "/mnt/connect.py", line 27, in _connect preferred_auth=['publickey'], options=ssh_options) File "/usr/local/lib/python3.6/site-packages/asyncssh/connection.py", line 8045, in connect timeout=new_options.connect_timeout) File "/usr/lib64/python3.6/asyncio/tasks.py", line 358, in wait_for return fut.result() File "/usr/local/lib/python3.6/site-packages/asyncssh/connection.py", line 432, in _connect flags=flags, local_addr=local_addr) File "/usr/lib64/python3.6/asyncio/base_events.py", line 794, in create_connection raise exceptions[0] File "/usr/lib64/python3.6/asyncio/base_events.py", line 781, in create_connection yield from self.sock_connect(sock, address) File "/usr/lib64/python3.6/asyncio/selector_events.py", line 439, in sock_connect return (yield from fut) File "/usr/lib64/python3.6/asyncio/selector_events.py", line 469, in _sock_connect_cb raise OSError(err, 'Connect call failed %s' % (address,)) OSError: [Errno 113] Connect call failed ('192.168.122.201', 22) By trying this for connecting to each host in the cluster, given how close it is to how cephadm is operating, it should help verify with relative certainty if this is connection related or not. Will add the important bit that I'm using the user "root". If you're using a non-root user, the script takes a "--user" option. On Mon, May 15, 2023 at 3:36 PM Thomas Widhalm <widhalmt(a)widhalm.or.at> wrote:

...

show

"4 weeks". Like it stopped simoultanously on all nodes. Some time ago we had a small change in name resolution so I thought, maybe the orchestrator can't connect via ssh anymore. But I tried all the steps in

https://docs.ceph.com/docs/master/cephadm/troubleshooting/#ssh-errors < https://docs.ceph.com/docs/master/cephadm/troubleshooting/#ssh-errors> .

The only thing that's slightly suspicous is that, it said, it added

the

host key to known hosts. But since I tried via "cephadm shell" I

guess,

the known hosts are just not replicated to these containers. ssh

works,

too. (And I would have suspected that I get a warning if that failed) I don't see any information about the orchestrator module having crashed. It's running as always. From the the prior problem I had some issues in my cephfs pools.

So,

maybe there's something broken in the .mgr pool? Could that be a

reason

for this behaviour? I googled a while but didn't find any way how to check that explicitly. On 15.05.23 19:15, Adam King wrote:

This is sort of similar to what I said in a previous email, but

the only > way I've seen this happen in other setups is through hanging

cephadm

> commands. The debug process has been, do a mgr failover, wait a

few

minutes, see in "ceph orch ps" and "ceph orch device ls" which

hosts have

and have not been refreshed (the REFRESHED column should be some

lower

value on the hosts where it refreshed), go to the hosts where it

did not

refresh and check "ps aux | grep cephadm" looking for long

running (and

therefore most likely hung) processes. I would still expect

that's the most

likely thing you're experiencing here. I haven't seen any other

causes for > cephadm to not refresh unless the module crashed, but that would

explicitly stated in the cluster health. On Mon, May 15, 2023 at 11:44 AM Thomas Widhalm

<widhalmt(a)widhalm.or.at <mailto:widhalmt@widhalm.or.at>>

wrote: > Hi, > > I tried a lot of different approaches but I didn't have any

success so far.

> > "ceph orch ps" still doesn't get refreshed. > > Some examples: > > mds.mds01.ceph06.huavsw ceph06 starting

> - - - <unknown> <unknown> <unknown> > mds.mds01.ceph06.rrxmks ceph06 error

4w ago

> 3M - - <unknown> <unknown> <unknown> > mds.mds01.ceph07.omdisd ceph07 error

4w ago

> 4M - - <unknown> <unknown> <unknown> > mds.mds01.ceph07.vvqyma ceph07 starting

> - - - <unknown> <unknown> <unknown> > mgr.ceph04.qaexpv ceph04 *:8443,9283 running (4w)

4w ago

> 10M 551M - 17.2.6 9cea3956c04b 33df84e346a0 > mgr.ceph05.jcmkbb ceph05 *:8443,9283 running (4w)

4w ago

> 4M 441M - 17.2.6 9cea3956c04b 1ad485df4399 > mgr.ceph06.xbduuf ceph06 *:8443,9283 running (4w)

4w ago

> 4M 432M - 17.2.6 9cea3956c04b 5ba5fd95dc48 > mon.ceph04 ceph04 running (4w)

4w ago

> 4M 223M 2048M 17.2.6 9cea3956c04b 8b6116dd216f > mon.ceph05 ceph05 running (4w)

4w ago

> 4M 326M 2048M 17.2.6 9cea3956c04b 70520d737f29 > > Debug Log doesn't show anything that could help me, either. > > 2023-05-15T14:48:40.852088+0000 mgr.ceph05.jcmkbb (mgr.83897390)

1376 :

> cephadm [INF] Schedule start daemon mds.mds01.ceph04.hcmvae > 2023-05-15T14:48:43.620700+0000 mgr.ceph05.jcmkbb (mgr.83897390)

1380 :

> cephadm [INF] Schedule redeploy daemon mds.mds01.ceph04.hcmvae > 2023-05-15T14:48:45.124822+0000 mgr.ceph05.jcmkbb (mgr.83897390)

1392 :

> cephadm [INF] Schedule start daemon mds.mds01.ceph04.krxszj > 2023-05-15T14:48:46.493902+0000 mgr.ceph05.jcmkbb (mgr.83897390)

1394 :

> cephadm [INF] Schedule redeploy daemon mds.mds01.ceph04.krxszj > 2023-05-15T15:05:25.637079+0000 mgr.ceph05.jcmkbb (mgr.83897390)

2629 : >> cephadm [INF] Saving service mds.mds01 spec with placement

count:2

> 2023-05-15T15:07:27.625773+0000 mgr.ceph05.jcmkbb (mgr.83897390)

2780 : >> cephadm [INF] Saving service mds.fs_name spec with placement

count:3

> 2023-05-15T15:07:42.120912+0000 mgr.ceph05.jcmkbb (mgr.83897390)

2795 : >> cephadm [INF] Saving service mds.mds01 spec with placement

count:3

> > I'm seeing all the commands I give but I don't get any more

information

> on why it's not actually happening. > > I tried to change different scheduling mechanisms. Host, Tag,

unmanaged

> and back again. I turned off orchestration and resumed. I failed

mgr. I

> even had full cluster stops (in the past). I made sure all

daemons run >> the same version. (If you remember, upgrade failed underway). >> >> So my only way of getting daemons only is manually. I added two

>> hosts, tagged them. But there isn't a single daemon started

there.

>> >> Could you help me again with how to debug orchestration not

working?

>> >> >> On 04.05.23 15:12, Thomas Widhalm wrote: >>> Thanks. >>> >>> I set the log level to debug, try a few steps and then come

back.

>> >> On 04.05.23 14:48, Eugen Block wrote: >>> Hi, >>> >>> try setting debug logs for the mgr: >>> >>> ceph config set mgr mgr/cephadm/log_level debug >>> >>> This should provide more details what the mgr is trying and

where it's

>>> failing, hopefully. Last week this helped to identify an issue

between

>>> a lower pacific issue for me. >>> Do you see anything in the cephadm.log pointing to the mgr

actually

>>> trying something? >>> >>> >>> Zitat von Thomas Widhalm <widhalmt(a)widhalm.or.at

<mailto:widhalmt@widhalm.or.at>>:

>>> >>>> Hi, >>>> >>>> I'm in the process of upgrading my cluster from 17.2.5 to

17.2.6 but

>>>> the following problem existed when I was still everywhere on

17.2.5 .

>>>> >>>> I had a major issue in my cluster which could be solved with

a lot of

>>>> your help and even more trial and error. Right now it seems

that most

>>>> is already fixed but I can't rule out that there's still some

problem

>>>> hidden. The very issue I'm asking about started during the

repair.

>>>> >>>> When I want to orchestrate the cluster, it logs the command

but it

>>>> doesn't do anything. No matter if I use ceph dashboard or

"ceph orch" >>>>> in "cephadm shell". I don't get any error message when I try

>>>> deploy new services, redeploy them etc. The log only says

"scheduled" >>>>> and that's it. Same when I change placement rules. Usually I

use

>>>>> tags. But since they don't work anymore, too, I tried host and >>>>> umanaged. No success. The only way I can actually start and

stop

>>>> containers is via systemctl from the host itself. >>>> >>>> When I run "ceph orch ls" or "ceph orch ps" I see services I

deployed

>>>> for testing being deleted (for weeks now). Ans especially a

lot of

>>>> old MDS are listed as "error" or "starting". The list doesn't

match >>>>> reality at all because I had to start them by hand. >>>>> >>>>> I tried "ceph mgr fail" and even a complete shutdown of the

whole

>>>> cluster with all nodes including all mgs, mds even osd -

everything >>>>> during a maintenance window. Didn't change anything. >>>>> >>>>> Could you help me? To be honest I'm still rather new to Ceph

and

>>>> since I didn't find anything in the logs that caught my eye I

would

<mailto:widhalmt@widhalm.or.at>

>>> >>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io>

>> >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

>> To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io>

> _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

> To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io>

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io>

Thomas Widhalm

25 May 25 May

11:03 p.m.

Hi, So sorry I didn't see your reply. Had some tough weeks (father in law died and that gave us some turmoil) I just came back to debugging and didn't realize until now that you did in fact answer my e-mail. I just ran your script on the host that is running the active manager. Thanks a lot for the excellent howto (and the script). Connecting to each and every host (including itself) I get the "ok" reply: [ceph: root@ceph06 /]# python3 /mnt/connect.py connect --address 192.168.23.65 --priv-key-file identity_key --pub-key-file identity_pub --ssh-config-file ssh_config return code: 0 stdout: stderr: Same with hostname. As I wrote earlier today. The problem is not just with the services/daemons, it's also with physical disks. Maybe that helps with finding the problem. I'll run the script on all the other hosts, too. Just wanted to give a reply as soon as I have the first reliable answer. On 15.05.23 22:55, Adam King wrote:

...

show

https://docs.ceph.com/docs/master/cephadm/troubleshooting/#ssh-errors < https://docs.ceph.com/docs/master/cephadm/troubleshooting/#ssh-errors> .

The only thing that's slightly suspicous is that, it said, it added

the

host key to known hosts. But since I tried via "cephadm shell" I

guess,

the known hosts are just not replicated to these containers. ssh

works,

So,

maybe there's something broken in the .mgr pool? Could that be a

reason

for this behaviour? I googled a while but didn't find any way how to check that explicitly. On 15.05.23 19:15, Adam King wrote:

This is sort of similar to what I said in a previous email, but

the only > way I've seen this happen in other setups is through hanging

cephadm

> commands. The debug process has been, do a mgr failover, wait a

few

minutes, see in "ceph orch ps" and "ceph orch device ls" which

hosts have

and have not been refreshed (the REFRESHED column should be some

lower

value on the hosts where it refreshed), go to the hosts where it

did not

refresh and check "ps aux | grep cephadm" looking for long

running (and

therefore most likely hung) processes. I would still expect

that's the most

likely thing you're experiencing here. I haven't seen any other

causes for > cephadm to not refresh unless the module crashed, but that would

explicitly stated in the cluster health. On Mon, May 15, 2023 at 11:44 AM Thomas Widhalm

<widhalmt(a)widhalm.or.at <mailto:widhalmt@widhalm.or.at>>

wrote: > Hi, > > I tried a lot of different approaches but I didn't have any

success so far.

> > "ceph orch ps" still doesn't get refreshed. > > Some examples: > > mds.mds01.ceph06.huavsw ceph06 starting

> - - - <unknown> <unknown> <unknown> > mds.mds01.ceph06.rrxmks ceph06 error

4w ago

> 3M - - <unknown> <unknown> <unknown> > mds.mds01.ceph07.omdisd ceph07 error

4w ago

> 4M - - <unknown> <unknown> <unknown> > mds.mds01.ceph07.vvqyma ceph07 starting

> - - - <unknown> <unknown> <unknown> > mgr.ceph04.qaexpv ceph04 *:8443,9283 running (4w)

4w ago

> 10M 551M - 17.2.6 9cea3956c04b 33df84e346a0 > mgr.ceph05.jcmkbb ceph05 *:8443,9283 running (4w)

4w ago

> 4M 441M - 17.2.6 9cea3956c04b 1ad485df4399 > mgr.ceph06.xbduuf ceph06 *:8443,9283 running (4w)

4w ago

> 4M 432M - 17.2.6 9cea3956c04b 5ba5fd95dc48 > mon.ceph04 ceph04 running (4w)

4w ago

> 4M 223M 2048M 17.2.6 9cea3956c04b 8b6116dd216f > mon.ceph05 ceph05 running (4w)

4w ago

> 4M 326M 2048M 17.2.6 9cea3956c04b 70520d737f29 > > Debug Log doesn't show anything that could help me, either. > > 2023-05-15T14:48:40.852088+0000 mgr.ceph05.jcmkbb (mgr.83897390)

1376 :

> cephadm [INF] Schedule start daemon mds.mds01.ceph04.hcmvae > 2023-05-15T14:48:43.620700+0000 mgr.ceph05.jcmkbb (mgr.83897390)

1380 :

> cephadm [INF] Schedule redeploy daemon mds.mds01.ceph04.hcmvae > 2023-05-15T14:48:45.124822+0000 mgr.ceph05.jcmkbb (mgr.83897390)

1392 :

> cephadm [INF] Schedule start daemon mds.mds01.ceph04.krxszj > 2023-05-15T14:48:46.493902+0000 mgr.ceph05.jcmkbb (mgr.83897390)

1394 :

> cephadm [INF] Schedule redeploy daemon mds.mds01.ceph04.krxszj > 2023-05-15T15:05:25.637079+0000 mgr.ceph05.jcmkbb (mgr.83897390)

2629 : >> cephadm [INF] Saving service mds.mds01 spec with placement

count:2

> 2023-05-15T15:07:27.625773+0000 mgr.ceph05.jcmkbb (mgr.83897390)

2780 : >> cephadm [INF] Saving service mds.fs_name spec with placement

count:3

> 2023-05-15T15:07:42.120912+0000 mgr.ceph05.jcmkbb (mgr.83897390)

2795 : >> cephadm [INF] Saving service mds.mds01 spec with placement

count:3

> > I'm seeing all the commands I give but I don't get any more

information

> on why it's not actually happening. > > I tried to change different scheduling mechanisms. Host, Tag,

unmanaged

> and back again. I turned off orchestration and resumed. I failed

mgr. I

> even had full cluster stops (in the past). I made sure all

daemons run >> the same version. (If you remember, upgrade failed underway). >> >> So my only way of getting daemons only is manually. I added two

>> hosts, tagged them. But there isn't a single daemon started

there.

>> >> Could you help me again with how to debug orchestration not

working?

>> >> >> On 04.05.23 15:12, Thomas Widhalm wrote: >>> Thanks. >>> >>> I set the log level to debug, try a few steps and then come

back.

where it's

>>> failing, hopefully. Last week this helped to identify an issue

between

>>> a lower pacific issue for me. >>> Do you see anything in the cephadm.log pointing to the mgr

actually

>>> trying something? >>> >>> >>> Zitat von Thomas Widhalm <widhalmt(a)widhalm.or.at

<mailto:widhalmt@widhalm.or.at>>:

>>> >>>> Hi, >>>> >>>> I'm in the process of upgrading my cluster from 17.2.5 to

17.2.6 but

>>>> the following problem existed when I was still everywhere on

17.2.5 .

>>>> >>>> I had a major issue in my cluster which could be solved with

a lot of

>>>> your help and even more trial and error. Right now it seems

that most

>>>> is already fixed but I can't rule out that there's still some

problem

>>>> hidden. The very issue I'm asking about started during the

repair.

>>>> >>>> When I want to orchestrate the cluster, it logs the command

but it

>>>> doesn't do anything. No matter if I use ceph dashboard or

"ceph orch" >>>>> in "cephadm shell". I don't get any error message when I try

>>>> deploy new services, redeploy them etc. The log only says

"scheduled" >>>>> and that's it. Same when I change placement rules. Usually I

use

>>>>> tags. But since they don't work anymore, too, I tried host and >>>>> umanaged. No success. The only way I can actually start and

stop

>>>> containers is via systemctl from the host itself. >>>> >>>> When I run "ceph orch ls" or "ceph orch ps" I see services I

deployed

>>>> for testing being deleted (for weeks now). Ans especially a

lot of

>>>> old MDS are listed as "error" or "starting". The list doesn't

match >>>>> reality at all because I had to start them by hand. >>>>> >>>>> I tried "ceph mgr fail" and even a complete shutdown of the

whole

>>>> cluster with all nodes including all mgs, mds even osd -

everything >>>>> during a maintenance window. Didn't change anything. >>>>> >>>>> Could you help me? To be honest I'm still rather new to Ceph

and

>>>> since I didn't find anything in the logs that caught my eye I

would

<mailto:widhalmt@widhalm.or.at>

>>> >>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io>

>> >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

>> To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io>

> _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

> To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io>

> _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io>

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Thomas Widhalm

11:46 p.m.

I now ran the command on everey host. And I did find two that couldn't connect. They were the last two I added and never got any daemons. I fixed that (copied (/etc/ceph and installed cephadm) and rebooted them but it didn't change a thing for now. All others could connect to all others without problems. By name and IP address. (Even to the two new ones that couldn't connect themselves) On 25.05.23 20:33, Thomas Widhalm wrote:

...

show

https://docs.ceph.com/docs/master/cephadm/troubleshooting/#ssh-errors < https://docs.ceph.com/docs/master/cephadm/troubleshooting/#ssh-errors> .

The only thing that's slightly suspicous is that, it said, it added

the

host key to known hosts. But since I tried via "cephadm shell" I

guess,

the known hosts are just not replicated to these containers. ssh

works,

So,

maybe there's something broken in the .mgr pool? Could that be a

reason

for this behaviour? I googled a while but didn't find any way how to check that explicitly. On 15.05.23 19:15, Adam King wrote: > This is sort of similar to what I said in a previous email, but the only > way I've seen this happen in other setups is through hanging

cephadm

> commands. The debug process has been, do a mgr failover, wait a

few

> minutes, see in "ceph orch ps" and "ceph orch device ls" which hosts have > and have not been refreshed (the REFRESHED column should be some lower > value on the hosts where it refreshed), go to the hosts where it did not > refresh and check "ps aux | grep cephadm" looking for long running (and > therefore most likely hung) processes. I would still expect that's the most > likely thing you're experiencing here. I haven't seen any other causes for > cephadm to not refresh unless the module crashed, but that would

> explicitly stated in the cluster health. > > On Mon, May 15, 2023 at 11:44 AM Thomas Widhalm <widhalmt(a)widhalm.or.at <mailto:widhalmt@widhalm.or.at>> > wrote: > >> Hi, >> >> I tried a lot of different approaches but I didn't have any success so far. >> >> "ceph orch ps" still doesn't get refreshed. >> >> Some examples: >> >> mds.mds01.ceph06.huavsw ceph06 starting - >> - - - <unknown> <unknown> <unknown> >> mds.mds01.ceph06.rrxmks ceph06 error 4w ago >> 3M - - <unknown> <unknown> <unknown> >> mds.mds01.ceph07.omdisd ceph07 error 4w ago >> 4M - - <unknown> <unknown> <unknown> >> mds.mds01.ceph07.vvqyma ceph07 starting - >> - - - <unknown> <unknown> <unknown> >> mgr.ceph04.qaexpv ceph04 *:8443,9283 running (4w) 4w ago >> 10M 551M - 17.2.6 9cea3956c04b 33df84e346a0 >> mgr.ceph05.jcmkbb ceph05 *:8443,9283 running (4w) 4w ago >> 4M 441M - 17.2.6 9cea3956c04b 1ad485df4399 >> mgr.ceph06.xbduuf ceph06 *:8443,9283 running (4w) 4w ago >> 4M 432M - 17.2.6 9cea3956c04b 5ba5fd95dc48 >> mon.ceph04 ceph04 running (4w) 4w ago >> 4M 223M 2048M 17.2.6 9cea3956c04b 8b6116dd216f >> mon.ceph05 ceph05 running (4w) 4w ago >> 4M 326M 2048M 17.2.6 9cea3956c04b 70520d737f29 >> >> Debug Log doesn't show anything that could help me, either. >> >> 2023-05-15T14:48:40.852088+0000 mgr.ceph05.jcmkbb (mgr.83897390) 1376 : >> cephadm [INF] Schedule start daemon mds.mds01.ceph04.hcmvae >> 2023-05-15T14:48:43.620700+0000 mgr.ceph05.jcmkbb (mgr.83897390) 1380 : >> cephadm [INF] Schedule redeploy daemon mds.mds01.ceph04.hcmvae >> 2023-05-15T14:48:45.124822+0000 mgr.ceph05.jcmkbb (mgr.83897390) 1392 : >> cephadm [INF] Schedule start daemon mds.mds01.ceph04.krxszj >> 2023-05-15T14:48:46.493902+0000 mgr.ceph05.jcmkbb (mgr.83897390) 1394 : >> cephadm [INF] Schedule redeploy daemon mds.mds01.ceph04.krxszj >> 2023-05-15T15:05:25.637079+0000 mgr.ceph05.jcmkbb (mgr.83897390) 2629 : >> cephadm [INF] Saving service mds.mds01 spec with placement

count:2

>> 2023-05-15T15:07:27.625773+0000 mgr.ceph05.jcmkbb (mgr.83897390) 2780 : >> cephadm [INF] Saving service mds.fs_name spec with placement

count:3

>> 2023-05-15T15:07:42.120912+0000 mgr.ceph05.jcmkbb (mgr.83897390) 2795 : >> cephadm [INF] Saving service mds.mds01 spec with placement

count:3

>> >> I'm seeing all the commands I give but I don't get any more information >> on why it's not actually happening. >> >> I tried to change different scheduling mechanisms. Host, Tag, unmanaged >> and back again. I turned off orchestration and resumed. I failed mgr. I >> even had full cluster stops (in the past). I made sure all daemons run >> the same version. (If you remember, upgrade failed underway). >> >> So my only way of getting daemons only is manually. I added two

>> hosts, tagged them. But there isn't a single daemon started

there.

>> >> Could you help me again with how to debug orchestration not

working?

>> >> >> On 04.05.23 15:12, Thomas Widhalm wrote: >>> Thanks. >>> >>> I set the log level to debug, try a few steps and then come

back.

>>> >>> On 04.05.23 14:48, Eugen Block wrote: >>>> Hi, >>>> >>>> try setting debug logs for the mgr: >>>> >>>> ceph config set mgr mgr/cephadm/log_level debug >>>> >>>> This should provide more details what the mgr is trying and where it's >>>> failing, hopefully. Last week this helped to identify an issue between >>>> a lower pacific issue for me. >>>> Do you see anything in the cephadm.log pointing to the mgr actually >>>> trying something? >>>> >>>> >>>> Zitat von Thomas Widhalm <widhalmt(a)widhalm.or.at <mailto:widhalmt@widhalm.or.at>>: >>>> >>>>> Hi, >>>>> >>>>> I'm in the process of upgrading my cluster from 17.2.5 to 17.2.6 but >>>>> the following problem existed when I was still everywhere on 17.2.5 . >>>>> >>>>> I had a major issue in my cluster which could be solved with a lot of >>>>> your help and even more trial and error. Right now it seems that most >>>>> is already fixed but I can't rule out that there's still some problem >>>>> hidden. The very issue I'm asking about started during the repair. >>>>> >>>>> When I want to orchestrate the cluster, it logs the command but it >>>>> doesn't do anything. No matter if I use ceph dashboard or "ceph orch" >>>>> in "cephadm shell". I don't get any error message when I try

>>>>> deploy new services, redeploy them etc. The log only says "scheduled" >>>>> and that's it. Same when I change placement rules. Usually I

use

>>>>> tags. But since they don't work anymore, too, I tried host and >>>>> umanaged. No success. The only way I can actually start and

stop

>>>>> containers is via systemctl from the host itself. >>>>> >>>>> When I run "ceph orch ls" or "ceph orch ps" I see services I deployed >>>>> for testing being deleted (for weeks now). Ans especially a lot of >>>>> old MDS are listed as "error" or "starting". The list doesn't match >>>>> reality at all because I had to start them by hand. >>>>> >>>>> I tried "ceph mgr fail" and even a complete shutdown of the

whole

>>>>> cluster with all nodes including all mgs, mds even osd - everything >>>>> during a maintenance window. Didn't change anything. >>>>> >>>>> Could you help me? To be honest I'm still rather new to Ceph

and

>>>>> since I didn't find anything in the logs that caught my eye I would >>>>> be thankful for hints how to debug. >>>>> >>>>> Cheers, >>>>> Thomas >>>>> -- >>>>> http://www.widhalm.or.at <http://www.widhalm.or.at> >>>>> GnuPG : 6265BAE6 , A84CB603 >>>>> Threema: H7AV7D33 >>>>> Telegram, Signal: widhalmt(a)widhalm.or.at <mailto:widhalmt@widhalm.or.at> >>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io> >>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io> >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> >> To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io> >> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> > To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Thomas Widhalm

4 May 4 May

6:33 p.m.

Hi, What I'm seeing a lot is this: "[stats WARNING root] cmdtag not found in client metadata" Can't make anything of it but I guess it's not showing the initial issue. Now that I think of it - I started the cluster with 3 nodes which are now only used as OSD. Could it be there's something missing on the new nodes that are now used as mgr/mon? Cheers, Thomas On 04.05.23 14:48, Eugen Block wrote:

...

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Thomas Widhalm

7:04 p.m.

To completely rule out hung processes, I managed to get another short shutdown. Now I'm seeing lots of: mgr.server handle_open ignoring open from mds.mds01.ceph01.usujbi v2:192.168.23.61:6800/2922006253; not ready for session (expect reconnect) mgr finish mon failed to return metadata for mds.mds01.ceph02.otvipq: (2) No such file or directory log lines. Seems like it now realises that some of these informations are stale. But it looks like it's just waiting for it to come back and not do anything about it. On 04.05.23 14:48, Eugen Block wrote:

...

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Adam King

7:25 p.m.

...

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Thomas Widhalm

7:31 p.m.

I uploaded the output there: https://nextcloud.widhalm.or.at/nextcloud/s/FCqPM8zRsix3gss IP 192.168.23.62 is one of my OSDs that were still booting when the reconnect tries happened. What makes me wonder is that it's the only one listed when there are a few similar ones in the cluster. On 04.05.23 16:55, Adam King wrote:

...

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Thomas Widhalm

25 May 25 May

5:34 p.m.

What caught my eye is that this is also true for Disks on Hosts. I added another disk to an OSD host. I can zap it with cephadm, I can even make it an OSD with "ceph orch daemon add osd ceph06:/dev/sdb" and it will be listed as new OSD in Ceph Dashboard. But, when I look at the "Physical Disks" part of Ceph dashboard or run "ceph orch device ls --refresh" I still don't see the new disk. Even more, I see the already added disk listed as /dev/sdb even when it's now /dev/sdd after adding the other one. "Refreshed" is the same timeframe I see in "ceph orch ps". I assume there's a global problem not letting me refresh information about services and not change anything. I'm still stuck and wouldn't know where to look. I tried to verify if all hosts are available by ssh from the others and that all daemons have their keyrings. As far as I can tell, this part works, even when I'm not yet a specialist in debugging Ceph. On 04.05.23 16:55, Adam King wrote:

...

what does specifically `ceph log last 200 debug cephadm` spit out? The log lines you've posted so far I don't think are generated by the orchestrator so curious what the last actions it took was (and how long ago). On Thu, May 4, 2023 at 10:35 AM Thomas Widhalm <widhalmt(a)widhalm.or.at <mailto:widhalmt@widhalm.or.at>> wrote: To completely rule out hung processes, I managed to get another short shutdown. Now I'm seeing lots of: mgr.server handle_open ignoring open from mds.mds01.ceph01.usujbi v2:192.168.23.61:6800/2922006253 <http://192.168.23.61:6800/2922006253>; not ready for session (expect reconnect) mgr finish mon failed to return metadata for mds.mds01.ceph02.otvipq: (2) No such file or directory log lines. Seems like it now realises that some of these informations are stale. But it looks like it's just waiting for it to come back and not do anything about it. On 04.05.23 14:48, Eugen Block wrote:

Hi, try setting debug logs for the mgr: ceph config set mgr mgr/cephadm/log_level debug This should provide more details what the mgr is trying and where

it's

failing, hopefully. Last week this helped to identify an issue

between a

lower pacific issue for me. Do you see anything in the cephadm.log pointing to the mgr actually trying something? Zitat von Thomas Widhalm <widhalmt(a)widhalm.or.at

<mailto:widhalmt@widhalm.or.at>>:

> Hi, > > I'm in the process of upgrading my cluster from 17.2.5 to 17.2.6

but

> the following problem existed when I was still everywhere on

17.2.5 .

> > I had a major issue in my cluster which could be solved with a

lot of

> your help and even more trial and error. Right now it seems that

most

> is already fixed but I can't rule out that there's still some

problem

> hidden. The very issue I'm asking about started during the repair. > > When I want to orchestrate the cluster, it logs the command but it > doesn't do anything. No matter if I use ceph dashboard or "ceph

orch"

> in "cephadm shell". I don't get any error message when I try to

deploy

> new services, redeploy them etc. The log only says "scheduled" and > that's it. Same when I change placement rules. Usually I use

tags. But

> since they don't work anymore, too, I tried host and umanaged. No > success. The only way I can actually start and stop containers

is via

> systemctl from the host itself. > > When I run "ceph orch ls" or "ceph orch ps" I see services I

deployed

> for testing being deleted (for weeks now). Ans especially a lot

of old

> MDS are listed as "error" or "starting". The list doesn't match > reality at all because I had to start them by hand. > > I tried "ceph mgr fail" and even a complete shutdown of the whole > cluster with all nodes including all mgs, mds even osd - everything > during a maintenance window. Didn't change anything. > > Could you help me? To be honest I'm still rather new to Ceph and

since

> I didn't find anything in the logs that caught my eye I would be > thankful for hints how to debug. > > Cheers, > Thomas > -- > http://www.widhalm.or.at <http://www.widhalm.or.at> > GnuPG : 6265BAE6 , A84CB603 > Threema: H7AV7D33 > Telegram, Signal: widhalmt(a)widhalm.or.at

<mailto:widhalmt@widhalm.or.at>

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io> _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

Thomas Widhalm

7 Jun 7 Jun

6:02 p.m.

I found something else, that might help with identifying the problem. When I look into which containers are used I see the following: global: quay.io/ceph/ceph@sha256:0560b16bec6e84345f29fb6693cd2430884e6efff16a95d5bdd0bb06d7661c45, mon: quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635, mgr: quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635, mgr.ceph06.xbduuf: 9cea3956c04b, osd: quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635, mds.mds01.ceph03.xqwdjy: 2abc8cd2afe6, mds.mds01.ceph04.hcmvae: quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635, mds.mds01.ceph04.krxszj: quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635, mds.mds01.ceph05.pqxmvt: quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635, mds.mds01.ceph05.szzppy: quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635, mds.mds01.ceph06.huavsw: quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635, mds.mds01.ceph06.rrxmks: quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635, mds.mds01.ceph07.omdisd: quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635, client.crash: quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635 What puzzles me is that global has another container ID than the others. The services mgr.ceph06.xbduuf and mds.mds01.ceph03.xqwdjy which are listed with other IDs don't even exist any more. I even removed the remaining mgr on ceph06 and deployed a new one. I tried changing deployment rules to default (host, no hosts listed, no count set) and shut down the cluster yet again. Still I have 2 months old data in "ceph orch ps". Same in the Dashboard. Any other ideas I could check for? On 25.05.23 15:04, Thomas Widhalm wrote:

...

Hi, try setting debug logs for the mgr: ceph config set mgr mgr/cephadm/log_level debug This should provide more details what the mgr is trying and where

it's

failing, hopefully. Last week this helped to identify an issue

between a

lower pacific issue for me. Do you see anything in the cephadm.log pointing to the mgr

actually

trying something? Zitat von Thomas Widhalm <widhalmt(a)widhalm.or.at

<mailto:widhalmt@widhalm.or.at>>:

> Hi, > > I'm in the process of upgrading my cluster from 17.2.5 to 17.2.6

but

> the following problem existed when I was still everywhere on

17.2.5 .

> > I had a major issue in my cluster which could be solved with a

lot of

> your help and even more trial and error. Right now it seems that

most

> is already fixed but I can't rule out that there's still some

problem

> hidden. The very issue I'm asking about started during the

repair.

> > When I want to orchestrate the cluster, it logs the command

but it

> doesn't do anything. No matter if I use ceph dashboard or "ceph

orch"

> in "cephadm shell". I don't get any error message when I try to

deploy

> new services, redeploy them etc. The log only says "scheduled"

and

> that's it. Same when I change placement rules. Usually I use

tags. But

> since they don't work anymore, too, I tried host and umanaged. No > success. The only way I can actually start and stop containers

is via

> systemctl from the host itself. > > When I run "ceph orch ls" or "ceph orch ps" I see services I

deployed

> for testing being deleted (for weeks now). Ans especially a lot

of old

everything

> during a maintenance window. Didn't change anything. > > Could you help me? To be honest I'm still rather new to Ceph and

since

<mailto:widhalmt@widhalm.or.at>

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Thomas Widhalm

10:12 p.m.

It looks like an old answer from the list just solved my problem! I found https://www.mail-archive.com/ceph-users@ceph.io/msg14418.html . So I tried ceph config rm mds.mds01.ceph03.xqwdjy container_image ceph config rm mgr.ceph06.xbduuf container_image And BOOM. It worked. Thanks for all the help you got me! Cheers, Thomas On 07.06.23 15:32, Thomas Widhalm wrote:

...

Hi, try setting debug logs for the mgr: ceph config set mgr mgr/cephadm/log_level debug This should provide more details what the mgr is trying and where

it's

failing, hopefully. Last week this helped to identify an issue

between a

lower pacific issue for me. Do you see anything in the cephadm.log pointing to the mgr

actually

trying something? Zitat von Thomas Widhalm <widhalmt(a)widhalm.or.at

<mailto:widhalmt@widhalm.or.at>>:

> Hi, > > I'm in the process of upgrading my cluster from 17.2.5 to 17.2.6

but

> the following problem existed when I was still everywhere on

17.2.5 .

> > I had a major issue in my cluster which could be solved with a

lot of

> your help and even more trial and error. Right now it seems that

most

> is already fixed but I can't rule out that there's still some

problem

> hidden. The very issue I'm asking about started during the

repair.

> > When I want to orchestrate the cluster, it logs the command

but it

> doesn't do anything. No matter if I use ceph dashboard or "ceph

orch"

> in "cephadm shell". I don't get any error message when I try to

deploy

> new services, redeploy them etc. The log only says

"scheduled" and

> that's it. Same when I change placement rules. Usually I use

tags. But

> since they don't work anymore, too, I tried host and

umanaged. No

> success. The only way I can actually start and stop containers

is via

> systemctl from the host itself. > > When I run "ceph orch ls" or "ceph orch ps" I see services I

deployed

> for testing being deleted (for weeks now). Ans especially a lot

of old

> MDS are listed as "error" or "starting". The list doesn't match > reality at all because I had to start them by hand. > > I tried "ceph mgr fail" and even a complete shutdown of the

whole

> cluster with all nodes including all mgs, mds even osd -

everything

> during a maintenance window. Didn't change anything. > > Could you help me? To be honest I'm still rather new to Ceph and

since

<mailto:widhalmt@widhalm.or.at>

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Adam King

4 May 4 May

5:35 p.m.

...

Thomas Widhalm

5:41 p.m.

Thanks for the reply. "Refreshed" is "3 weeks ago" on most lines. The running mds and osd.cost_capacity are both "-" in this column. I'm already done with "mgr fail", that didn't do anything. And I even tried a complete shut down during a maintenance windows that was not 3 weeks ago but last week. So this doesn't seem to help. Thanks anyway. The only thing could be that the command was started by a systemd service again. But I can't imagine that. On 04.05.23 15:05, Adam King wrote:

...

First thing I always check when it seems like orchestrator commands aren't doing anything is "ceph orch ps" and "ceph orch device ls" and check the REFRESHED column. If it's well above 10 minutes for orch ps or 30 minutes for orch device ls, then it means the orchestrator is most likely hanging on some command to refresh the host information. If that's the case, you can follow up with a "ceph mgr fail", wait a few minutes and check the orch ps and device ls REFRESHED column again. If only certain hosts are not having their daemon/device information refreshed, you can go to the hosts that aren't having their info refreshed and check for hanging "cephadm" commands (I just check for "ps aux | grep cephadm"). On Thu, May 4, 2023 at 8:38 AM Thomas Widhalm <widhalmt(a)widhalm.or.at <mailto:widhalmt@widhalm.or.at>> wrote: Hi, I'm in the process of upgrading my cluster from 17.2.5 to 17.2.6 but the following problem existed when I was still everywhere on 17.2.5 . I had a major issue in my cluster which could be solved with a lot of your help and even more trial and error. Right now it seems that most is already fixed but I can't rule out that there's still some problem hidden. The very issue I'm asking about started during the repair. When I want to orchestrate the cluster, it logs the command but it doesn't do anything. No matter if I use ceph dashboard or "ceph orch" in "cephadm shell". I don't get any error message when I try to deploy new services, redeploy them etc. The log only says "scheduled" and that's it. Same when I change placement rules. Usually I use tags. But since they don't work anymore, too, I tried host and umanaged. No success. The only way I can actually start and stop containers is via systemctl from the host itself. When I run "ceph orch ls" or "ceph orch ps" I see services I deployed for testing being deleted (for weeks now). Ans especially a lot of old MDS are listed as "error" or "starting". The list doesn't match reality at all because I had to start them by hand. I tried "ceph mgr fail" and even a complete shutdown of the whole cluster with all nodes including all mgs, mds even osd - everything during a maintenance window. Didn't change anything. Could you help me? To be honest I'm still rather new to Ceph and since I didn't find anything in the logs that caught my eye I would be thankful for hints how to debug. Cheers, Thomas -- http://www.widhalm.or.at <http://www.widhalm.or.at> GnuPG : 6265BAE6 , A84CB603 Threema: H7AV7D33 Telegram, Signal: widhalmt(a)widhalm.or.at <mailto:widhalmt@widhalm.or.at> _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

335

days inactive

369

days old

ceph-users@ceph.io

Manage subscription

19 comments

3 participants

tags (0)

participants (3)

Adam King
Eugen Block
Thomas Widhalm