for setting the user, `ceph cephadm set-user` command should do it. Bit
surprised by the second part of that though. With passwordless sudo access
I would have expected that to start working.
On Thu, May 4, 2023 at 11:27 AM Reza Bakhshayeshi <reza.b2008(a)gmail.com>
wrote:
Thank you.
I don't see any more errors rather than:
2023-05-04T15:07:38.003+0000 7ff96cbe0700 0 log_channel(cephadm) log
[DBG] : Running command: sudo which python3
2023-05-04T15:07:38.025+0000 7ff96cbe0700 0 log_channel(cephadm) log
[DBG] : Connection to host1 failed. Process exited with non-zero exit
status 3
2023-05-04T15:07:38.025+0000 7ff96cbe0700 0 log_channel(cephadm) log
[DBG] : _reset_con close host1
What is the best way to safely change the cephadm user to root for the
existing cluster? It seems "ceph cephadm set-ssh-config" is not effective
(BTW, my cephadmin user can run "sudo which python3" without prompting
password on other hosts now, but nothing has been solved)
Best regards,
Reza
On Tue, 2 May 2023 at 19:00, Adam King <adking(a)redhat.com> wrote:
> The number of mgr daemons thing is expected. The way it works is it first
> upgrades all the standby mgrs (which will be all but one) and then fails
> over so the previously active mgr can be upgraded as well. After that
> failover is when it's first actually running the newer cephadm code, which
> is when you're hitting this issue. Are the logs still saying something
> similar about how "sudo which python3" is failing? I'm thinking this
> might just be a general issue with the user being used not having
> passwordless sudo access, that sort of accidentally working in pacific, but
> now not working any more in quincy. If the log lines confirm the same, we
> might have to work on something in order to handle this case (making the
> sudo optional somehow). As mentioned in the previous email, that setup
> wasn't intended to be supported even in pacific, although if it did work,
> we could bring something in to make it usable in quincy onward as well.
>
> On Tue, May 2, 2023 at 10:58 AM Reza Bakhshayeshi <reza.b2008(a)gmail.com>
> wrote:
>
>> Hi Adam,
>>
>> I'm still struggling with this issue. I also checked it one more time
>> with newer versions, upgrading the cluster from 16.2.11 to 16.2.12 was
>> successful but from 16.2.12 to 17.2.6 failed again with the same ssh errors
>> (I checked
>>
https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#ssh-errors a
>> couple of times and all keys/access are fine).
>>
>> [root@host1 ~]# ceph health detail
>> HEALTH_ERR Upgrade: Failed to connect to host host2 at addr (x.x.x.x)
>> [ERR] UPGRADE_OFFLINE_HOST: Upgrade: Failed to connect to host host2 at
>> addr (x.x.x.x)
>> SSH connection failed to host2 at addr (x.x.x.x): Host(s) were
>> marked offline: {'host2', 'host6', 'host9',
'host4', 'host3', 'host5',
>> 'host1', 'host7', 'host8'}
>>
>> The interesting thing is that always (total number of mgrs) - 1 is
>> upgraded, If I provision 5 MGRs then 4 of them, and for 3, 2 of them!
>>
>> As long as I'm in an internal environment, I also checked the process
>> with Quincy cephadm binary file. FYI I'm using stretch mode on this cluster.
>>
>> I don't understand why Quincy MGRs cannot ssh into Pacific nodes, if you
>> have any more hints I would be really glad to hear.
>>
>> Best regards,
>> Reza
>>
>>
>>
>> On Wed, 12 Apr 2023 at 17:18, Adam King <adking(a)redhat.com> wrote:
>>
>>> Ah, okay. Someone else had opened an issue about the same thing after
>>> the 17.2.5 release I believe. It's changed in 17.2.6 at least to only
use
>>> sudo for non-root users
>>>
https://github.com/ceph/ceph/blob/v17.2.6/src/pybind/mgr/cephadm/ssh.py#L14….
>>> But it looks like you're also using a non-root user anyway. We've
required
>>> passwordless sudo access for custom ssh users for a long time I think (e.g.
>>> it's in pacific docs
>>>
https://docs.ceph.com/en/pacific/cephadm/install/#further-information-about…,
>>> see the point on "--ssh-user"). Did this actually work for you
before in
>>> pacific with a non-root user that doesn't have sudo privileges? I had
>>> assumed that had never worked.
>>>
>>> On Wed, Apr 12, 2023 at 10:38 AM Reza Bakhshayeshi <
>>> reza.b2008(a)gmail.com> wrote:
>>>
>>>> Thank you Adam for your response,
>>>>
>>>> I tried all your comments and the troubleshooting link you sent. From
>>>> the Quincy mgrs containers, they can ssh into all other Pacific nodes
>>>> successfully by running the exact command in the log output and vice
versa.
>>>>
>>>> Here are some debug logs from the cephadm while updating:
>>>>
>>>> 2023-04-12T11:35:56.260958+0000 mgr.host8.jukgqm (mgr.4468627) 103 :
>>>> cephadm [DBG] Opening connection to cephadmin(a)x.x.x.x with ssh
>>>> options '-F /tmp/cephadm-conf-2bbfubub -i
/tmp/cephadm-identity-7x2m8gvr'
>>>> 2023-04-12T11:35:56.525091+0000 mgr.host8.jukgqm (mgr.4468627) 144 :
>>>> cephadm [DBG] _run_cephadm : command = ls
>>>> 2023-04-12T11:35:56.525406+0000 mgr.host8.jukgqm (mgr.4468627) 145 :
>>>> cephadm [DBG] _run_cephadm : args = []
>>>> 2023-04-12T11:35:56.525571+0000 mgr.host8.jukgqm (mgr.4468627) 146 :
>>>> cephadm [DBG] mon container image
my-private-repo/quay-io/ceph/ceph@sha256
>>>> :1b9803c8984bef8b82f05e233e8fe8ed8f0bba8e5cc2c57f6efaccbeea682add
>>>> 2023-04-12T11:35:56.525619+0000 mgr.host8.jukgqm (mgr.4468627) 147 :
>>>> cephadm [DBG] args: --image
my-private-repo/quay-io/ceph/ceph@sha256:1b9803c8984bef8b82f05e233e8fe8ed8f0bba8e5cc2c57f6efaccbeea682add
>>>> ls
>>>> 2023-04-12T11:35:56.525738+0000 mgr.host8.jukgqm (mgr.4468627) 148 :
>>>> cephadm [DBG] Running command: sudo which python3
>>>> 2023-04-12T11:35:56.534227+0000 mgr.host8.jukgqm (mgr.4468627) 149 :
>>>> cephadm [DBG] Connection to host1 failed. Process exited with non-zero
exit
>>>> status 3
>>>> 2023-04-12T11:35:56.534275+0000 mgr.host8.jukgqm (mgr.4468627) 150 :
>>>> cephadm [DBG] _reset_con close host1
>>>> 2023-04-12T11:35:56.540135+0000 mgr.host8.jukgqm (mgr.4468627) 158 :
>>>> cephadm [DBG] Host "host1" marked as offline. Skipping gather
facts refresh
>>>> 2023-04-12T11:35:56.540178+0000 mgr.host8.jukgqm (mgr.4468627) 159 :
>>>> cephadm [DBG] Host "host1" marked as offline. Skipping network
refresh
>>>> 2023-04-12T11:35:56.540408+0000 mgr.host8.jukgqm (mgr.4468627) 160 :
>>>> cephadm [DBG] Host "host1" marked as offline. Skipping device
refresh
>>>> 2023-04-12T11:35:56.540490+0000 mgr.host8.jukgqm (mgr.4468627) 161 :
>>>> cephadm [DBG] Host "host1" marked as offline. Skipping osdspec
preview
>>>> refresh
>>>> 2023-04-12T11:35:56.540527+0000 mgr.host8.jukgqm (mgr.4468627) 162 :
>>>> cephadm [DBG] Host "host1" marked as offline. Skipping
autotune
>>>> 2023-04-12T11:35:56.540978+0000 mgr.host8.jukgqm (mgr.4468627) 163 :
>>>> cephadm [DBG] Connection to host1 failed. Process exited with non-zero
exit
>>>> status 3
>>>> 2023-04-12T11:35:56.796966+0000 mgr.host8.jukgqm (mgr.4468627) 728 :
>>>> cephadm [ERR] Upgrade: Paused due to UPGRADE_OFFLINE_HOST: Upgrade:
Failed
>>>> to connect to host host1 at addr (x.x.x.x)
>>>>
>>>> As I can see here, it turns out sudo is added to the code to be able
>>>> to continue:
>>>>
>>>>
>>>>
https://github.com/ceph/ceph/blob/v17.2.5/src/pybind/mgr/cephadm/ssh.py#L143
>>>>
>>>> I cannot privilege the cephadmin user to run sudo commands for some
>>>> policy reasons, could this be the root cause of the issue?
>>>>
>>>> Best regards,
>>>> Reza
>>>>
>>>> On Thu, 6 Apr 2023 at 14:59, Adam King <adking(a)redhat.com> wrote:
>>>>
>>>>> Does "ceph health detail" give any insight into what the
unexpected
>>>>> exception was? If not, I'm pretty confident some traceback would
end up
>>>>> being logged. Could maybe still grab it with "ceph log last 200
info
>>>>> cephadm" if not a lot else has happened. Also, probably need to
find out if
>>>>> the check-host is failing due to the check on the host actually
failing or
>>>>> failing to connect to the host. Could try putting a copy of the
cephadm
>>>>> binary on one and running "cephadm check-host --expect-hostname
<hostname>"
>>>>> where the hostname is the name cephadm knows the host by. If
that's not an
>>>>> issue I'd expect it's a connection thing. Could maybe try
going through
>>>>>
https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#ssh-errors
>>>>>
<https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#ssh-errors>.
>>>>> Cephadm changed the backend ssh library from pacific to quincy due to
the
>>>>> one used in pacific no longer being supported so it's possible
some general
>>>>> ssh error has popped up in your env as a result.
>>>>>
>>>>> On Thu, Apr 6, 2023 at 8:38 AM Reza Bakhshayeshi <
>>>>> reza.b2008(a)gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I have a problem regarding upgrading Ceph cluster from Pacific
to
>>>>>> Quincy
>>>>>> version with cephadm. I have successfully upgraded the cluster to
the
>>>>>> latest Pacific (16.2.11). But when I run the following command
to
>>>>>> upgrade
>>>>>> the cluster to 17.2.5, After upgrading 3/4 mgrs, the upgrade
process
>>>>>> stops
>>>>>> with "Unexpected error". (everything is on a private
network)
>>>>>>
>>>>>> ceph orch upgrade start
my-private-repo/quay-io/ceph/ceph:v17.2.5
>>>>>>
>>>>>> I also tried the 17.2.4 version.
>>>>>>
>>>>>> cephadm fails to check the hosts' status and marks them as
offline:
>>>>>>
>>>>>> cephadm 2023-04-06T10:19:59.998510+0000 mgr.host9.arhpnd
>>>>>> (mgr.4516356) 5782
>>>>>> : cephadm [DBG] host host4 (x.x.x.x) failed check
>>>>>> cephadm 2023-04-06T10:19:59.998553+0000 mgr.host9.arhpnd
>>>>>> (mgr.4516356) 5783
>>>>>> : cephadm [DBG] Host "host4" marked as offline.
Skipping daemon
>>>>>> refresh
>>>>>> cephadm 2023-04-06T10:19:59.998581+0000 mgr.host9.arhpnd
>>>>>> (mgr.4516356) 5784
>>>>>> : cephadm [DBG] Host "host4" marked as offline.
Skipping gather facts
>>>>>> refresh
>>>>>> cephadm 2023-04-06T10:19:59.998609+0000 mgr.host9.arhpnd
>>>>>> (mgr.4516356) 5785
>>>>>> : cephadm [DBG] Host "host4" marked as offline.
Skipping network
>>>>>> refresh
>>>>>> cephadm 2023-04-06T10:19:59.998633+0000 mgr.host9.arhpnd
>>>>>> (mgr.4516356) 5786
>>>>>> : cephadm [DBG] Host "host4" marked as offline.
Skipping device
>>>>>> refresh
>>>>>> cephadm 2023-04-06T10:19:59.998659+0000 mgr.host9.arhpnd
>>>>>> (mgr.4516356) 5787
>>>>>> : cephadm [DBG] Host "host4" marked as offline.
Skipping osdspec
>>>>>> preview
>>>>>> refresh
>>>>>> cephadm 2023-04-06T10:19:59.998682+0000 mgr.host9.arhpnd
>>>>>> (mgr.4516356) 5788
>>>>>> : cephadm [DBG] Host "host4" marked as offline.
Skipping autotune
>>>>>> cluster 2023-04-06T10:20:00.000151+0000 mon.host8 (mon.0) 158587
:
>>>>>> cluster
>>>>>> [ERR] Health detail: HEALTH_ERR 9 hosts fail cephadm check;
Upgrade:
>>>>>> failed
>>>>>> due to an unexpected exception
>>>>>> cluster 2023-04-06T10:20:00.000191+0000 mon.host8 (mon.0) 158588
:
>>>>>> cluster
>>>>>> [ERR] [WRN] CEPHADM_HOST_CHECK_FAILED: 9 hosts fail cephadm
check
>>>>>> cluster 2023-04-06T10:20:00.000202+0000 mon.host8 (mon.0) 158589
:
>>>>>> cluster
>>>>>> [ERR] host host7 (x.x.x.x) failed check: Unable to reach
remote
>>>>>> host
>>>>>> host7. Process exited with non-zero exit status 3
>>>>>> cluster 2023-04-06T10:20:00.000213+0000 mon.host8 (mon.0) 158590
:
>>>>>> cluster
>>>>>> [ERR] host host2 (x.x.x.x) failed check: Unable to reach
remote
>>>>>> host
>>>>>> host2. Process exited with non-zero exit status 3
>>>>>> cluster 2023-04-06T10:20:00.000220+0000 mon.host8 (mon.0) 158591
:
>>>>>> cluster
>>>>>> [ERR] host host8 (x.x.x.x) failed check: Unable to reach
remote
>>>>>> host
>>>>>> host8. Process exited with non-zero exit status 3
>>>>>> cluster 2023-04-06T10:20:00.000228+0000 mon.host8 (mon.0) 158592
:
>>>>>> cluster
>>>>>> [ERR] host host4 (x.x.x.x) failed check: Unable to reach
remote
>>>>>> host
>>>>>> host4. Process exited with non-zero exit status 3
>>>>>> cluster 2023-04-06T10:20:00.000240+0000 mon.host8 (mon.0) 158593
:
>>>>>> cluster
>>>>>> [ERR] host host3 (x.x.x.x) failed check: Unable to reach
remote
>>>>>> host
>>>>>> host3. Process exited with non-zero exit status 3
>>>>>>
>>>>>> and here are some outputs of the commands:
>>>>>>
>>>>>> [root@host8 ~]# ceph -s
>>>>>> cluster:
>>>>>> id: xxx
>>>>>> health: HEALTH_ERR
>>>>>> 9 hosts fail cephadm check
>>>>>> Upgrade: failed due to an unexpected exception
>>>>>>
>>>>>> services:
>>>>>> mon: 5 daemons, quorum host8,host1,host7,host2,host9 (age
2w)
>>>>>> mgr: host9.arhpnd(active, since 105m), standbys:
host8.jowfih,
>>>>>> host1.warjsr, host2.qyavjj
>>>>>> mds: 1/1 daemons up, 3 standby
>>>>>> osd: 37 osds: 37 up (since 8h), 37 in (since 3w)
>>>>>>
>>>>>> data:
>>>>>>
>>>>>>
>>>>>> io:
>>>>>> client:
>>>>>>
>>>>>> progress:
>>>>>> Upgrade to 17.2.5 (0s)
>>>>>> [............................]
>>>>>>
>>>>>> [root@host8 ~]# ceph orch upgrade status
>>>>>> {
>>>>>> "target_image":
"my-private-repo/quay-io/ceph/ceph@sha256
>>>>>>
:34c763383e3323c6bb35f3f2229af9f466518d9db926111277f5e27ed543c427",
>>>>>> "in_progress": true,
>>>>>> "which": "Upgrading all daemon types on all
hosts",
>>>>>> "services_complete": [],
>>>>>> "progress": "3/59 daemons upgraded",
>>>>>> "message": "Error: UPGRADE_EXCEPTION: Upgrade:
failed due to an
>>>>>> unexpected exception",
>>>>>> "is_paused": true
>>>>>> }
>>>>>> [root@host8 ~]# ceph cephadm check-host host7
>>>>>> check-host failed:
>>>>>> Host 'host7' not found. Use 'ceph orch host ls'
to see all managed
>>>>>> hosts.
>>>>>> [root@host8 ~]# ceph versions
>>>>>> {
>>>>>> "mon": {
>>>>>> "ceph version 16.2.11
>>>>>> (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
>>>>>> pacific (stable)": 5
>>>>>> },
>>>>>> "mgr": {
>>>>>> "ceph version 16.2.11
>>>>>> (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
>>>>>> pacific (stable)": 1,
>>>>>> "ceph version 17.2.5
>>>>>> (98318ae89f1a893a6ded3a640405cdbb33e08757)
>>>>>> quincy (stable)": 3
>>>>>> },
>>>>>> "osd": {
>>>>>> "ceph version 16.2.11
>>>>>> (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
>>>>>> pacific (stable)": 37
>>>>>> },
>>>>>> "mds": {
>>>>>> "ceph version 16.2.11
>>>>>> (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
>>>>>> pacific (stable)": 4
>>>>>> },
>>>>>> "overall": {
>>>>>> "ceph version 16.2.11
>>>>>> (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
>>>>>> pacific (stable)": 47,
>>>>>> "ceph version 17.2.5
>>>>>> (98318ae89f1a893a6ded3a640405cdbb33e08757)
>>>>>> quincy (stable)": 3
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> The strange thing is I can rollback the cluster status by failing
to
>>>>>> not-upgraded mgr like this:
>>>>>>
>>>>>> ceph mgr fail
>>>>>> ceph orch upgrade start
my-private-repo/quay-io/ceph/ceph:v16.2.11
>>>>>>
>>>>>> Would you happen to have any idea about this?
>>>>>>
>>>>>> Best regards,
>>>>>> Reza
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>>>
>>>>>>