On Fri, Apr 9, 2021 at 7:24 PM Robert LeBlanc <robert(a)leblancnet.us> wrote:
On Fri, Apr 9, 2021 at 11:05 AM Dan van der Ster <dan(a)vanderster.com> wrote:
Hi Robert,
Have you checked a log with debug_mon=20 yet to try to see what it's doing?
I've posted the logs with debug_mon=20 for a period during high CPU
here
https://owncloud.leblancnet.us/owncloud/index.php/s/OtHsBAYN9r5eSbU
You can look near the end of the log for the verbose logging. I'm not
sure what to look for in there, nothing sticks out to me. I did
disable cephx in the config file to see if that would help, but we
still have the 100% CPU.
Thanks. I didn't see anything ultra obvious to me.
But I did notice the nearfull warnings so I wonder if this cluster is
churning through osdmaps? Did you see a large increase in inbound or
outbound network traffic on this mon following the upgrade?
Totally speculating here, but maybe there is an issue where you have
some old clients, which can't decode an incremental osdmap from a
nautilus mon, so the single mon is busy serving up these maps to the
clients.
Does the mon load decrease if you stop the osdmap churn?, e.g. by
setting norebalance if that is indeed ongoing.
Could you also share debug_ms = 1 for a minute of busy cpu mon?
-- dan
> Thank you,
> Robert LeBlanc