[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

23 May 2021

I think that the orchestrator is trying to bring it up but it's not 
starting (see the errors from my previous e-mail) - the container is not 
starting even if I tried to start it manually.

the placement is the default one , ceph started the mons automatically 
on all my hosts because I only have 3 and the default mon number is 5.

root@node01:/home/adrian# ceph orch ls
NAME                       PORTS   RUNNING  REFRESHED  AGE PLACEMENT
alertmanager                           1/1  16m ago    5h count:1
crash                                  3/3  16m ago    5h   *
grafana                                1/1  16m ago    5h count:1
mgr                                    2/2  16m ago    5h count:2
mon                                    3/5  16m ago    10h count:5
node-exporter                          3/3  16m ago    5h   *
osd.all-available-devices            12/15  16m ago    4h   *
prometheus                             1/1  16m ago    5h count:1
rgw.digi1                  ?:8000      3/3  16m ago    3h 
node01;node02;node03;count:3

I've added the hosts using only the hostnames :

root@node01:/home/adrian# ceph orch host ls
HOST    ADDR          LABELS  STATUS
node01  192.168.80.2
node02  node02
node03  node03

On 5/23/2021 7:52 PM, 胡 玮文 wrote:
> So the orchestrator is aware of that mon is stopped, but not tried to bring it up
again. What is the placement of mon shown in “ceph orch ls”? I explicitly set it to all
host names (e.g. node01;node02;node03), and haven’t experienced this.
>
>> 在 2021年5月24日，00:35，Adrian Nicolae &lt;adrian.nicolae(a)rcs-rds.ro&gt; 写道：
>>
>> Hi,
>>
>> I waited for more than a day on the first mon failure, it didn't resolve
automatically.
>>
>> I checked with 'ceph status'  and also the ceph.conf on that hosts and
the failed mon was removed from the monmap.  The cluster reported only 2 mons (instead of
3) and the third mon was completely removed from config , it wasn't reported as a
failure on 'ceph status'.
>>
>>
>>> On 5/23/2021 7:30 PM, 胡 玮文 wrote:
>>> Hi Adrian,
>>>
>>> I have not tried, but I think it will resolve itself automatically after some
minutes. How long have you waited before you do the manual redeploy?
>>>
>>> Could you also try “ceph mon dump” to see whether mon.node03 is actually
removed from monmap when it failed to start?
>>>
>>>>> 在 2021年5月23日，16:40，Adrian Nicolae &lt;adrian.nicolae(a)rcs-rds.ro&gt;
写道：
>>>> Hi guys,
>>>>
>>>> I'm testing Ceph Pacific 16.2.4 in my lab before deciding if I will
put it in production on a 1PB+ storage cluster with rgw-only access.
>>>>
>>>> I noticed a weird issue with my mons :
>>>>
>>>> - if I reboot a mon host, the ceph-mon container is not starting after
reboot
>>>>
>>>> - I can see with 'ceph orch ps' the following output :
>>>>
>>>> mon.node01               node01               running (20h)   4m ago    
20h   16.2.4     8d91d370c2b8  0a2e86af94b2
>>>> mon.node02               node02               running (115m)  12s ago   
115m  16.2.4     8d91d370c2b8  51f4885a1b06
>>>> mon.node03               node03               stopped         4m ago    
19h   <unknown>  <unknown>     <unknown>
>>>>
>>>> (where node03 is the host which was rebooted).
>>>>
>>>> - I tried to start the mon container manually on node03 with
'/bin/bash /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run'
and I've got the following output :
>>>>
>>>> debug 2021-05-23T08:24:25.192+0000 7f9a9e358700  0 mon.node03(a)-1(???).osd
e408 crush map has features 3314933069573799936, adjusting msgr requires
>>>> debug 2021-05-23T08:24:25.192+0000 7f9a9e358700  0 mon.node03(a)-1(???).osd
e408 crush map has features 432629308056666112, adjusting msgr requires
>>>> debug 2021-05-23T08:24:25.192+0000 7f9a9e358700  0 mon.node03(a)-1(???).osd
e408 crush map has features 432629308056666112, adjusting msgr requires
>>>> debug 2021-05-23T08:24:25.192+0000 7f9a9e358700  0 mon.node03(a)-1(???).osd
e408 crush map has features 432629308056666112, adjusting msgr requires
>>>> cluster 2021-05-23T08:07:12.189243+0000 mgr.node01.ksitls (mgr.14164)
36380 : cluster [DBG] pgmap v36392: 417 pgs: 417 active+clean; 33 KiB data, 605 MiB used,
651 GiB / 652 GiB avail; 9.6 KiB/s rd, 0 B/s wr, 15 op/s
>>>> debug 2021-05-23T08:24:25.196+0000 7f9a9e358700  1
mon.node03(a)-1(???).paxosservice(auth 1..51) refresh upgraded, format 0 -> 3
>>>> debug 2021-05-23T08:24:25.208+0000 7f9a88176700  1 heartbeat_map
reset_timeout 'Monitor::cpu_tp thread 0x7f9a88176700' had timed out after
0.000000000s
>>>> debug 2021-05-23T08:24:25.208+0000 7f9a9e358700  0 mon.node03@-1(probing)
e5  my rank is now 1 (was -1)
>>>> debug 2021-05-23T08:24:25.212+0000 7f9a87975700  0 mon.node03@1(probing)
e6  removed from monmap, suicide.
>>>>
>>>> root@node03:/home/adrian# systemctl status
ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3(a)mon.node03.service
>>>> ● ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3(a)mon.node03.service - Ceph
mon.node03 for c2d41ac4-baf5-11eb-865d-2dc838a337a3
>>>>       Loaded: loaded
(/etc/systemd/system/ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@.service; enabled; vendor
preset: enabled)
>>>>       Active: inactive (dead) since Sun 2021-05-23 08:10:00 UTC; 16min
ago
>>>>      Process: 1176 ExecStart=/bin/bash
/var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run (code=exited,
status=0/SUCCESS)
>>>>      Process: 1855 ExecStop=/usr/bin/docker stop
ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3-mon.node03 (code=exited, status=1/FAILURE)
>>>>      Process: 1861 ExecStopPost=/bin/bash
/var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.poststop (code=exited,
status=0/SUCCESS)
>>>>     Main PID: 1176 (code=exited, status=0/SUCCESS)
>>>>
>>>> The only fix I could find was to redeploy the mon with :
>>>>
>>>> ceph orch daemon rm  mon.node03 --force
>>>> ceph orch daemon add mon node03
>>>>
>>>> However, even if it's working after redeploy, it's not giving me
a lot of trust to use it in a production environment having an issue like that.  I could
reproduce it with 2 different mons so it's not just an exception.
>>>>
>>>> My setup is based on Ubuntu 20.04 and docker instead of podman :
>>>>
>>>> root@node01:~# docker -v
>>>> Docker version 20.10.6, build 370c289
>>>>
>>>> Do you know a workaround for this issue or is this a known bug ? I
noticed that there are some other complaints with the same behaviour in Octopus as well
and the solution at that time was to delete the /var/lib/ceph/mon folder .
>>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Ceph Pacific mon is not starting after host reboot