On Tue, May 4, 2021 at 4:21 PM Janne Johansson <icepic.dz(a)gmail.com> wrote:
Den tis 4 maj 2021 kl 16:10 skrev Rainer Krienke <krienke(a)uni-koblenz.de>de>:
Hello,
I am playing around with a test ceph 14.2.20 cluster. The cluster
consists of 4 VMs, each VM has 2 OSDs. The first three VMs vceph1,
vceph2 and vceph3 are monitors. vceph1 is also mgr.
What I did was quite simple. The cluster is in the state HEALTHY:
vceph2: systemctl stop ceph-osd@2
# let ceph repair until ceph -s reports cluster is healthy again
vceph2: systemctl start ceph-osd@2 # @ 15:39:15, for the logs
# cluster reports in cephs -s that 8 OSDs are up and in, then
# starts rebalance osd.2
vceph2: ceph -s # hangs forever also if executed on vceph3 or 4
# mon on vceph1 eats 100% CPU permanently, the other mons ~0 %CPU
vceph1: systemctl stop ceph-mon@vceph1 # wait ~30 sec to terminate
vceph1: systemctl start ceph-mon@vceph1 # Everything is OK again
I posted the mon-log to:
https://cloud.uni-koblenz.de/s/t8tWjWFAobZb5Hy
Strange enough if I set "debug mon 20" before starting the experiment
this bug does not show up. I also tried the very same procedure on the
same cluster updated to 15.2.11 but I was unable to reproduce this bug
in this ceph version.
I might have run into the same issue recently, except not in a test
but on a live system,
also running 14.2.20 like you. We have (for other reasons) some flapping OSDs,
and repairs/backfills take a lot of time, and while we might have had slightly
less memory on the mons than we should have, they didn't OOM or anything,
but we found ourselves in the situation where one mon would eat 100% cpu,
not log anything of value at all, and the two others would be all but idling.
Restarting the 100%-using mon would finally allow us to get back into the rest
of the recovery.
Same question as above -- does your mgr log negative progress at level 4 ?
BTW, if you find that this is indeed what's blocking your mons, you
can workaround by setting `ceph progress off` until the fixes are
released.
-- Dan