Hi all,
there was another election after about 2 hours. trying the stop+reboot procedure on
another mon now. Just for the record, I observe that when I stop one mon another goes down
as a consequence:
[root@ceph-02 ~]# docker stop ceph-mon
ceph-mon
[root@ceph-02 ~]# ceph status
cluster:
id: e4ece518-f2cb-4708-b00f-b6bf511e91d9
health: HEALTH_WARN
2/5 mons down, quorum ceph-01,ceph-25,ceph-26
services:
mon: 5 daemons, quorum (age 17M), out of quorum: ceph-01, ceph-02, ceph-03, ceph-25,
ceph-26
mgr: ceph-03(active, since 3d), standbys: ceph-26, ceph-02, ceph-25, ceph-01
mds: con-fs2:8 4 up:standby 8 up:active
osd: 1260 osds: 1260 up (since 3d), 1260 in (since 3M)
data:
pools: 14 pools, 25065 pgs
objects: 1.88G objects, 3.3 PiB
usage: 4.1 PiB used, 9.0 PiB / 13 PiB avail
pgs: 25038 active+clean
25 active+clean+scrubbing+deep
2 active+clean+scrubbing
io:
client: 562 MiB/s rd, 542 MiB/s wr, 3.77k op/s rd, 3.09k op/s wr
This looks like it should not happen either.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans(a)dtu.dk>
Sent: Thursday, May 4, 2023 2:30 PM
To: Gregory Farnum; Dan van der Ster
Cc: ceph-users(a)ceph.io
Subject: [ceph-users] Re: Frequent calling monitor election
Hi all,
I think I can reduce the defcon level a bit. Since I couldn't see something in the mon
log, I started to try if its a specific mon that causes trouble by shutting one by one
down for a while. I got lucky at the first try. Shutting down the leader stopped the
voting from happening.
I left it down for a while and rebooted the server. Then I started the mon again and there
has still not been a new election. It looks like the reboot finally cleared out the
problem.
This indicates that it might be a problem with the hardware, although the coincidence with
the MDS restart is striking and I doubt that its just coincidence. Unfortunately, I
can't find anything in the logs or health monitoring. Also an fsck on the mon store
gave nothing.
Since this is a recurring issue, it would be great if someone could take a look at the
paste
https://pastebin.com/hGPvVkuR if there is a clue.
Thanks a lot for your help!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans(a)dtu.dk>
Sent: Thursday, May 4, 2023 1:01 PM
To: Gregory Farnum; Dan van der Ster
Cc: ceph-users(a)ceph.io
Subject: [ceph-users] Re: Frequent calling monitor election
Hi all,
I have to get back to this case. On Monday I had to restart an MDS to get rid of a stuck
client caps recall. Right after that fail-over, the MONs went into a voting frenzy again.
I already restarted all of them like last time, but this time this doesn't help. I
might be in a different case here.
In an effort to collect debug info, I set debug_mon on the leader to 10/10 and its
producing voluminous output. Unfortunately, while debug_mon=10/10, the voting frenzy is
not happening. It seems that I'm a bit in the situation described with "Tip: When
debug output slows down your system, the latency can hide race conditions." at
https://docs.ceph.com/en/octopus/rados/troubleshooting/log-and-debug/.
The election frequency is significantly lower when debug_mon=10/10. I managed to catch one
though and pasted the 20s before the election happened here:
https://pastebin.com/hGPvVkuR
. I hope there is a clue, I can't see anything that sticks out.
Is there anything else I can look for?
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans(a)dtu.dk>
Sent: Thursday, February 9, 2023 5:29 PM
To: Gregory Farnum; Dan van der Ster
Cc: ceph-users(a)ceph.io
Subject: [ceph-users] Re: Frequent calling monitor election
Hi Dan and Gregory,
thanks! These are good pointers. Will look into that tomorrow.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Gregory Farnum <gfarnum(a)redhat.com>
Sent: 09 February 2023 17:12:23
To: Dan van der Ster
Cc: Frank Schilder; ceph-users(a)ceph.io
Subject: Re: [ceph-users] Re: Frequent calling monitor election
Also, that the current leader (ceph-01) is one of the monitors
proposing an election each time suggests the problem is with getting
commit acks back from one of its followers.
On Thu, Feb 9, 2023 at 8:09 AM Dan van der Ster <dvanders(a)gmail.com> wrote:
Hi Frank,
Check the mon logs with some increased debug levels to find out what
the leader is busy with.
We have a similar issue (though, daily) and it turned out to be
related to the mon leader timing out doing a SMART check.
See
https://tracker.ceph.com/issues/54313 for how I debugged that.
Cheers, Dan
On Thu, Feb 9, 2023 at 7:56 AM Frank Schilder <frans(a)dtu.dk> wrote:
Hi all,
our monitors have enjoyed democracy since the beginning. However, I don't share a
sudden excitement about voting:
2/9/23 4:42:30 PM[INF]overall HEALTH_OK
2/9/23 4:42:30 PM[INF]mon.ceph-01 is new leader, mons
ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 in quorum (ranks 0,1,2,3,4)
2/9/23 4:42:26 PM[INF]mon.ceph-01 calling monitor election
2/9/23 4:42:26 PM[INF]mon.ceph-26 calling monitor election
2/9/23 4:42:26 PM[INF]mon.ceph-25 calling monitor election
2/9/23 4:42:26 PM[INF]mon.ceph-02 calling monitor election
2/9/23 4:40:00 PM[INF]overall HEALTH_OK
2/9/23 4:30:00 PM[INF]overall HEALTH_OK
2/9/23 4:24:34 PM[INF]overall HEALTH_OK
2/9/23 4:24:34 PM[INF]mon.ceph-01 is new leader, mons
ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 in quorum (ranks 0,1,2,3,4)
2/9/23 4:24:29 PM[INF]mon.ceph-01 calling monitor election
2/9/23 4:24:29 PM[INF]mon.ceph-02 calling monitor election
2/9/23 4:24:29 PM[INF]mon.ceph-03 calling monitor election
2/9/23 4:24:29 PM[INF]mon.ceph-01 calling monitor election
2/9/23 4:24:29 PM[INF]mon.ceph-26 calling monitor election
2/9/23 4:24:29 PM[INF]mon.ceph-25 calling monitor election
2/9/23 4:24:29 PM[INF]mon.ceph-02 calling monitor election
2/9/23 4:24:04 PM[INF]overall HEALTH_OK
2/9/23 4:24:03 PM[INF]mon.ceph-01 is new leader, mons
ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 in quorum (ranks 0,1,2,3,4)
2/9/23 4:23:59 PM[INF]mon.ceph-01 calling monitor election
2/9/23 4:23:59 PM[INF]mon.ceph-02 calling monitor election
2/9/23 4:20:00 PM[INF]overall HEALTH_OK
2/9/23 4:10:00 PM[INF]overall HEALTH_OK
2/9/23 4:00:00 PM[INF]overall HEALTH_OK
2/9/23 3:50:00 PM[INF]overall HEALTH_OK
2/9/23 3:43:13 PM[INF]overall HEALTH_OK
2/9/23 3:43:13 PM[INF]mon.ceph-01 is new leader, mons
ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 in quorum (ranks 0,1,2,3,4)
2/9/23 3:43:08 PM[INF]mon.ceph-01 calling monitor election
2/9/23 3:43:08 PM[INF]mon.ceph-26 calling monitor election
2/9/23 3:43:08 PM[INF]mon.ceph-25 calling monitor election
We moved a switch from one rack to another and after the switch came beck up, the
monitors frequently bitch about who is the alpha. How do I get them to focus more on their
daily duties again?
Thanks for any help!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io