[ceph-users] Re: How to add back stray OSD daemon after node re-installation

27 May 2021

I managed to remove that wrongly created cluster on the node running:

sudo cephadm rm-cluster --fsid 91a86f20-8083-40b1-8bf1-fe35fac3d677 --force

So I am getting closed but the osd.2 service on that node simply does not want to start as
you can see below:

# ceph orch daemon start osd.2
Scheduled to start osd.2 on host 'ceph1f'

# ceph orch ps|grep osd.2
osd.2                      ceph1f  unknown        2m ago     -    <unknown> 
<unknown>                             <unknown>     <unknown>

In the log files I see the following:

5/27/21 2:47:34 PM[ERR]`ceph1f: cephadm unit osd.2 start` failed Traceback (most recent
call last): File "/usr/share/ceph/mgr/cephadm/module.py", line 1451, in
_daemon_action ['--name', name, a]) File
"/usr/share/ceph/mgr/cephadm/module.py", line 1168, in _run_cephadm code,
'\n'.join(err))) orchestrator._interface.OrchestratorError: cephadm exited with an
error code: 1, stderr:stderr Job for
ceph-8d47792c-987d-11eb-9bb6-a5302e00e1fa(a)osd.2.service failed because the control process
exited with error code. stderr See "systemctl status
ceph-8d47792c-987d-11eb-9bb6-a5302e00e1fa(a)osd.2.service&quot; and "journalctl
-xe" for details. Traceback (most recent call last): File "<stdin>",
line 6159, in <module> File "<stdin>", line 1310, in _infer_fsid
File "<stdin>", line 3655, in command_unit File "<stdin>",
line 1072, in call_throws RuntimeError: Failed command: systemctl start
ceph-8d47792c-987d-11eb-9bb6-a5302e00e1fa(a)osd.2

5/27/21 2:47:34 PM[ERR]cephadm exited with an error code: 1, stderr:stderr Job for
ceph-8d47792c-987d-11eb-9bb6-a5302e00e1fa(a)osd.2.service failed because the control process
exited with error code. stderr See "systemctl status
ceph-8d47792c-987d-11eb-9bb6-a5302e00e1fa(a)osd.2.service&quot; and "journalctl
-xe" for details. Traceback (most recent call last): File "<stdin>",
line 6159, in <module> File "<stdin>", line 1310, in _infer_fsid
File "<stdin>", line 3655, in command_unit File "<stdin>",
line 1072, in call_throws RuntimeError: Failed command: systemctl start
ceph-8d47792c-987d-11eb-9bb6-a5302e00e1fa(a)osd.2 Traceback (most recent call last): File
"/usr/share/ceph/mgr/cephadm/module.py", line 1021, in _remote_connection yield
(conn, connr) File "/usr/share/ceph/mgr/cephadm/module.py", line 1168, in
_run_cephadm code, '\n'.join(err))) orchestrator._interface.OrchestratorError:
cephadm exited with an error code: 1, stderr:stderr Job for
ceph-8d47792c-987d-11eb-9bb6-a5302e00e1fa(a)osd.2.service failed because the control process
exited with error code. stderr See "systemctl status
ceph-8d47792c-987d-11eb-9bb6-a5302e00e1fa(a)osd.2.service&quot; and "journalctl
-xe" for details. Traceback (most recent call last): File "<stdin>",
line 6159, in <module> File "<stdin>", line 1310, in _infer_fsid
File "<stdin>", line 3655, in command_unit File "<stdin>",
line 1072, in call_throws RuntimeError: Failed command: systemctl start
ceph-8d47792c-987d-11eb-9bb6-a5302e00e1fa(a)osd.2

And finally the systemctl "status" of that osd.2 service on the OSD node:

ubuntu@ceph1f:/var/lib/ceph$ sudo systemctl status
ceph-8d47792c-987d-11eb-9bb6-a5302e00e1fa(a)osd.2.service
● ceph-8d47792c-987d-11eb-9bb6-a5302e00e1fa(a)osd.2.service - Ceph osd.2 for
8d47792c-987d-11eb-9bb6-a5302e00e1fa
     Loaded: loaded
(/etc/systemd/system/ceph-8d47792c-987d-11eb-9bb6-a5302e00e1fa@.service; disabled; vendor
preset: enabled)
     Active: failed (Result: exit-code) since Thu 2021-05-27 14:48:24 CEST; 20s ago
    Process: 56163 ExecStartPre=/bin/rm -f
//run/ceph-8d47792c-987d-11eb-9bb6-a5302e00e1fa(a)osd.2.service-pid
//run/ceph-8d47792c-987d-11eb-9bb6-a5302e00e1fa(a)osd.2.service-cid (code=exited,
status=0/SUCCESS)
    Process: 56164 ExecStart=/bin/bash
/var/lib/ceph/8d47792c-987d-11eb-9bb6-a5302e00e1fa/osd.2/unit.run (code=exited,
status=127)
    Process: 56165 ExecStopPost=/bin/bash
/var/lib/ceph/8d47792c-987d-11eb-9bb6-a5302e00e1fa/osd.2/unit.poststop (code=exited,
status=127)
    Process: 56166 ExecStopPost=/bin/rm -f
//run/ceph-8d47792c-987d-11eb-9bb6-a5302e00e1fa(a)osd.2.service-pid
//run/ceph-8d47792c-987d-11eb-9bb6-a5302e00e1fa(a)osd.2.service-cid (code=exited,
status=0/SUCCESS)

May 27 14:48:14 ceph1f systemd[1]: Failed to start Ceph osd.2 for
8d47792c-987d-11eb-9bb6-a5302e00e1fa.
May 27 14:48:24 ceph1f systemd[1]:
ceph-8d47792c-987d-11eb-9bb6-a5302e00e1fa(a)osd.2.service: Scheduled restart job, restart
counter is at 5.
May 27 14:48:24 ceph1f systemd[1]: Stopped Ceph osd.2 for
8d47792c-987d-11eb-9bb6-a5302e00e1fa.
May 27 14:48:24 ceph1f systemd[1]:
ceph-8d47792c-987d-11eb-9bb6-a5302e00e1fa(a)osd.2.service: Start request repeated too
quickly.
May 27 14:48:24 ceph1f systemd[1]:
ceph-8d47792c-987d-11eb-9bb6-a5302e00e1fa(a)osd.2.service: Failed with result
'exit-code'.
May 27 14:48:24 ceph1f systemd[1]: Failed to start Ceph osd.2 for
8d47792c-987d-11eb-9bb6-a5302e00e1fa.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Thursday, May 27, 2021 2:22 PM, mabi &lt;mabi(a)protonmail.ch&gt; wrote:

...
  I am trying to run "cephadm shell" on that
newly installed OSD node and it seems that I have now unfortunately configured a new
cluster ID as it shows:

 ubuntu@ceph1f:~$ sudo cephadm shell
 ERROR: Cannot infer an fsid, one must be specified:
['8d47792c-987d-11eb-9bb6-a5302e00e1fa',
'91a86f20-8083-40b1-8bf1-fe35fac3d677']

 Maybe this is causing trouble... So is there any method where I can remove the wrongly
new created cluster ID 91a86f20-8083-40b1-8bf1-fe35fac3d677 ??

 ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
 On Thursday, May 27, 2021 12:58 PM, mabi mabi(a)protonmail.ch wrote:

 > You are right, I used the FSID of the OSD and not of the cluster in the deploy
command. So now I tried again with the cluster ID as FSID but still it does not work as
you can see below:
 > ubuntu@ceph1f:~$ sudo cephadm deploy --name osd.2 --fsid
8d47792c-987d-11eb-9bb6-a5302e00e1fa
 > Deploy daemon osd.2 ...
 > Traceback (most recent call last):
 > File "/usr/local/sbin/cephadm", line 6223, in <module>
 >
 >     r = args.func()
 >
 >
 > File "/usr/local/sbin/cephadm", line 1440, in _default_image
 > return func()
 > File "/usr/local/sbin/cephadm", line 3457, in command_deploy
 > deploy_daemon(args.fsid, daemon_type, daemon_id, c, uid, gid,
 > File "/usr/local/sbin/cephadm", line 2193, in deploy_daemon
 > deploy_daemon_units(fsid, uid, gid, daemon_type, daemon_id, c,
 > File "/usr/local/sbin/cephadm", line 2255, in deploy_daemon_units
 > assert osd_fsid
 > AssertionError
 > In case that's of any help here is the output of the "cephadm ceph-volume
lvm list" command:
 > ====== osd.2 =======
 > [block]
/dev/ceph-cca8abe6-cf9b-4c2f-ab81-ae0758585414/osd-block-91a86f20-8083-40b1-8bf1-fe35fac3d677
 > block device
/dev/ceph-cca8abe6-cf9b-4c2f-ab81-ae0758585414/osd-block-91a86f20-8083-40b1-8bf1-fe35fac3d677
 > block uuid W3omTg-vami-RB0V-CkVb-cgpb-88Jy-pIK2Tz
 > cephx lockbox secret
 > cluster fsid 8d47792c-987d-11eb-9bb6-a5302e00e1fa
 > cluster name ceph
 > crush device class None
 > encrypted 0
 > osd fsid 91a86f20-8083-40b1-8bf1-fe35fac3d677
 > osd id 2
 > osdspec affinity all-available-devices
 > type block
 > vdo 0
 > devices /dev/sda
 > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
 > On Thursday, May 27, 2021 12:32 PM, Eugen Block eblock(a)nde.ag wrote:
 >
 > > > ubuntu@ceph1f:~$ sudo cephadm deploy --name osd.2 --fsid
 > >
 > > > 91a86f20-8083-40b1-8bf1-fe35fac3d677
 > > > Deploy daemon osd.2 ...
 > >
 > > Which fsid is it, the cluster's or the OSD's? According to the
 > > 'cephadm deploy' help page it should be the cluster fsid.
 > > Zitat von mabi mabi(a)protonmail.ch:
 > >
 > > > Hi Eugen,
 > > > What a good coincidence ;-)
 > > > So I ran "cephadm ceph-volume lvm list" on the OSD node which I
 > > > re-instaled and it saw my osd.2 OSD. So far so good, but the
 > > > following suggested command does not work as you can see below:
 > > > ubuntu@ceph1f:~$ sudo cephadm deploy --name osd.2 --fsid
 > > > 91a86f20-8083-40b1-8bf1-fe35fac3d677
 > > > Deploy daemon osd.2 ...
 > > > Traceback (most recent call last):
 > > > File "/usr/local/sbin/cephadm", line 6223, in <module>
 > > > r = args.func()
 > > > File "/usr/local/sbin/cephadm", line 1440, in _default_image
 > > > return func()
 > > > File "/usr/local/sbin/cephadm", line 3457, in command_deploy
 > > > deploy_daemon(args.fsid, daemon_type, daemon_id, c, uid, gid,
 > > > File "/usr/local/sbin/cephadm", line 2193, in deploy_daemon
 > > > deploy_daemon_units(fsid, uid, gid, daemon_type, daemon_id, c,
 > > > File "/usr/local/sbin/cephadm", line 2255, in
deploy_daemon_units
 > > > assert osd_fsid
 > > > AssertionError
 > > > Any ideas what is wrong here?
 > > > Regards,
 > > > Mabi
 > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
 > > > On Thursday, May 27, 2021 12:13 PM, Eugen Block eblock(a)nde.ag wrote:
 > > >
 > > > > Hi,
 > > > > I posted a link to the docs [1], [2] just yesterday ;-)
 > > > > You should see the respective OSD in the output of 'cephadm
 > > > > ceph-volume lvm list' on that node. You should then be able to
get it
 > > > > back to cephadm with
 > > > > cephadm deploy --name osd.x
 > > > > But I haven't tried this yet myself, so please report back if
that
 > > > > works for you.
 > > > > Regards,
 > > > > Eugen
 > > > > [1] https://tracker.ceph.com/issues/49159
 > > > > [2] https://tracker.ceph.com/issues/46691
 > > > > Zitat von mabi mabi(a)protonmail.ch:
 > > > >
 > > > > > Hello,
 > > > > > I have by mistake re-installed the OS of an OSD node of my
Octopus
 > > > > > cluster (managed by cephadm). Luckily the OSD data is on a
separate
 > > > > > disk and did not get affected by the re-install.
 > > > > > Now I have the following state:
 > > > > >
 > > > > >     health: HEALTH_WARN
 > > > > >             1 stray daemon(s) not managed by cephadm
 > > > > >             1 osds down
 > > > > >             1 host (1 osds) down
 > > > > >
 > > > > >
 > > > > > To fix that I tried to run:
 > > > > > ceph orch daemon add osd ceph1f:/dev/sda
 > > > > >
=====================================================================
 > > > > > Created no osd(s) on host ceph1f; already created?
 > > > > > That did not work, so I tried:
 > > > > > ceph cephadm osd activate ceph1f
 > > > > >
===================================================================================================================
 > > > > > no valid command found; 10 closest matches:
 > > > > > ...
 > > > > > Error EINVAL: invalid command
 > > > > > Did not work either. So I wanted to ask how can I
"adopt" back an
 > > > > > OSD disk to my cluster?
 > > > > > Thanks for your help.
 > > > > > Regards,
 > > > > > Mabi
 > > > > > ceph-users mailing list -- ceph-users(a)ceph.io
 > > > > > To unsubscribe send an email to ceph-users-leave(a)ceph.io
 > > > >
 > > > > ceph-users mailing list -- ceph-users(a)ceph.io
 > > > > To unsubscribe send an email to ceph-users-leave(a)ceph.io
 > >
 > > ceph-users mailing list -- ceph-users(a)ceph.io
 > > To unsubscribe send an email to ceph-users-leave(a)ceph.io 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: How to add back stray OSD daemon after node re-installation