For assessing the criticality of the MGR beacon loop of doom outage, during my somewhat
desperate attempts to get this under control, I saw this here:
-------------------------------------
[root@gnosis ~]# ceph status
cluster:
id: ---
health: HEALTH_WARN
no active mgr
Reduced data availability: 2545 pgs inactive
too few PGs per OSD (9 < min 30)
services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: no daemons active
mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay
osd: 288 osds: 268 up, 268 in
data:
pools: 10 pools, 2545 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
2545 unknown
-------------------------------------
Is an active MGR actually required for I/O or is this output just due to not having an MGR
deliver the numbers? Well, is says HEALTH_WARN, so I really hope this is just missing
stats and not complete service outage.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans(a)dtu.dk>
Sent: 11 May 2020 19:43:24
To: Lenz Grimmer; ceph-users(a)ceph.io
Subject: [ceph-users] Re: Yet another meltdown starting
For everyone who does not want to read the details below: I run now with (dramatically?)
increased beacon grace periods for OSD (3600s) and MGR (90s) beacons and am wondering what
the downside of this is and if there are better tuning parameters for my issues.
---
Hi Lenz,
I'm wondering about this as well. I was following this and other MGR threads about
dashboard crashes with great interest.
It is not exactly the same issue, we are still on mimic and under normal circumstances,
all queues are empty. However, I also have the feeling we are hitting a very specific
piece of code that has comparably large execution time for little input. I was actually
surprised to read that python code is involved in processing high-frequency events in a
non-distributed way.
It is said that an MGR is not a single point of failure, but this seems not to be true in
the full meaning. If some workload is not distributed, but processed by only one instance
(in active-passive way), then
- it does not scale,
- it becomes an effective single point of failure as every instance suffers from the same
restriction and
- fail-over will not help as we see a failure of a healthy instance due to load.
This seems to be exactly what I'm observing. What started was a loop of doom:
- active MGR fails to send beacon in time,
- MON marks MGR out and elects next available standby
- new MGR takes over, is hit by the same problem and does not send its beacon in time,
- the other MGR in the meantime reconnects, but slower than the next MGR is marked out.
After a while, the MONs were cyclically kicking out the active MGR. Each MGR stayed only
active for the beacon grace period and was then thrown out. Note that none of the MGR
processes crashed or died. Everything was up and running. I observed a client-load induced
evict-reconnect cycle.
What is suspicious is that I cannot see any significant increase of load or network
traffic on the MGR node during the critical time before the incident (there was, of
course, a huge increase of client traffic to the limit of the hardware). It looks like
something completely under radar, a tiny bit of some very specific processing has a huge
impact under high client load, like the difference between compiled and interpreted code
in the issue you mentioned.
There also seem to be accumulative issues like memory leaks. I observe regular crashes of
the dashboard and the dashboard creates quite a large load for not much as well.
I will put this case on the list of reference of a future thread "Cluster outage due
to client IO" I'm preparing. All the major issues I was observing lately have to
do specifically with beacons sent to the MONs. I did not see any heartbeat failures. The
cluster was physically healthy (everything up and running) but not logically (MONs did not
get required info in time) and increasing the beacon grace periods immediately restored
logical cluster health. It looks like beacons are processed in a different way than
heartbeats and that there is a critical bottleneck somewhere.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Lenz Grimmer <lgrimmer(a)suse.com>
Sent: 11 May 2020 17:50:34
To: ceph-users(a)ceph.io
Subject: [ceph-users] Re: Yet another meltdown starting
Hi Frank,
On 5/11/20 3:03 PM, Frank Schilder wrote:
OK, the command finally executed and it looks like the
cluster is
running stable for now. However, I'm afraid that 90s might not be
sustainable.
Questions: Can I leave the beacon_grace at 90s? Is there a better
parameter to set? Why is the MGR getting overloaded on a rather small
cluster with 160 OSDs? How does this scale?
I wonder if
https://tracker.ceph.com/issues/45439 might be related to
what you're observing here?
In this issue, Andras suggests: "Increasing mgr_stats_period to 15
seconds reduces the load and brings ceph-mgr back to responsive again."
Maybe that helps?
Lenz
--
SUSE Software Solutions Germany GmbH - Maxfeldstr. 5 - 90409 Nuernberg
GF: Felix Imendörffer, HRB 36809 (AG Nürnberg)
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io