Re: High CPU usage by ceph-mgr in 14.2.5 - Dev

18 Dec 2019

That's how we noticed it too.  Our graphs went silent after the upgrade completed.  Is
your large cluster over 350 OSDs?

Bryan

On Dec 18, 2019, at 2:59 PM, Paul Mezzanini
<pfmeec@rit.edu<mailto:pfmeec@rit.edu>> wrote:

Notice: This email is from an external sender.

Just wanted to say that we are seeing the same thing on our large cluster.   It manifested
mainly in the from of Prometheus stats being totally broken (they take too long to return
if at all so the requesting program just gives up)

--
Paul Mezzanini
Sr Systems Administrator / Engineer, Research Computing
Information & Technology Services
Finance & Administration
Rochester Institute of Technology
o:(585) 475-3245 | pfmeec@rit.edu<mailto:pfmeec@rit.edu>

Sent from my phone. Please excuse any brevity or typoos.

CONFIDENTIALITY NOTE: The information transmitted, including attachments, is
intended only for the person(s) or entity to which it is addressed and may
contain confidential and/or privileged material. Any review, retransmission,
dissemination or other use of, or taking of any action in reliance upon this
information by persons or entities other than the intended recipient is
prohibited. If you received this in error, please contact the sender and
destroy any copies of this information.
------------------------

________________________________
From: Bryan Stillwell <bstillwell@godaddy.com<mailto:bstillwell@godaddy.com>>
Sent: Wednesday, December 18, 2019 4:44:45 PM
To: Sage Weil <sage@newdream.net<mailto:sage@newdream.net>>
Cc: ceph-users <ceph-users@ceph.io<mailto:ceph-users@ceph.io>>;
dev@ceph.io<mailto:dev@ceph.io> <dev@ceph.io<mailto:dev@ceph.io>>
Subject: [ceph-users] Re: High CPU usage by ceph-mgr in 14.2.5

On Dec 18, 2019, at 11:58 AM, Sage Weil
<sage@newdream.net<mailto:sage@newdream.net>> wrote:

On Wed, 18 Dec 2019, Bryan Stillwell wrote:
After upgrading one of our clusters from Nautilus 14.2.2 to Nautilus 14.2.5 I'm seeing
100% CPU usage by a single ceph-mgr thread (found using 'top -H').  Attaching to
the thread with strace shows a lot of mmap and munmap calls.  Here's the distribution
after watching it for a few minutes:

48.73% - mmap
49.48% - munmap
1.75% - futex
0.05% - madvise

I've upgraded 3 other clusters so far (120 OSDs, 30 OSDs, 200 OSDs), but this is the
only one which has seen the problem (355 OSDs). Perhaps it has something to do with its
size?

I was suspecting it might have to do with one of the modules misbehaving, so I disabled
all of them:

# ceph mgr module ls | jq -r '.enabled_modules'
[]

But that didn't help (I restarted the mgrs after disabling the modules too).

I also tried setting debug_mgr and debug_mgrc to 20, but nothing popped out at me as being
the cause of the problem.

It only seems to affect the active mgr.  If I stop the active mgr the problem moves to one
of the other mgrs.

Any guesses or tips on what next steps I should take to figure out what's going on?

What are the balancer modes on the affected and unaffected cluster(s)?

Affected cluster has a balancer mode of "none".

The other three are "upmap", "none", and "upmap".

I don't know if you saw in ceph-users, but this bug report seems to point at the
finisher-Mgr thread:

https://tracker.ceph.com/issues/43364

Thanks,
Bryan