[ceph-users] Re: Nautilus 14.2.19 mon 100% CPU

9 Apr 2021

On Fri, Apr 9, 2021 at 2:04 PM Dan van der Ster &lt;dan(a)vanderster.com&gt; wrote:
...

 On Fri, Apr 9, 2021 at 9:37 PM Dan van der Ster &lt;dan(a)vanderster.com&gt; wrote:

 On Fri, Apr 9, 2021 at 8:39 PM Robert LeBlanc &lt;robert(a)leblancnet.us&gt; wrote:

 On Fri, Apr 9, 2021 at 11:49 AM Dan van der Ster &lt;dan(a)vanderster.com&gt; wrote:

 Thanks. I didn't see anything ultra obvious to me.

 But I did notice the nearfull warnings so I wonder if this cluster is
 churning through osdmaps? Did you see a large increase in inbound or
 outbound network traffic on this mon following the upgrade?
 Totally speculating here, but maybe there is an issue where you have
 some old clients, which can't decode an incremental osdmap from a
 nautilus mon, so the single mon is busy serving up these maps to the
 clients.

 Does the mon load decrease if you stop the osdmap churn?, e.g. by
 setting norebalance if that is indeed ongoing.

 Could you also share debug_ms = 1 for a minute of busy cpu mon? 
 Here are the new logs with the debug_ms=1 for a bit.
 https://owncloud.leblancnet.us/owncloud/index.php/s/1hvtJo3s2oLPpWn 
 Something strange in this is there is one hammer client that is asking
 for nearly a million incremental osdmaps, seemingly every 30s:

     client.131831153 at 172.16.212.55 is asking for incrementals from
 1170448..1987355 (see [1])

 Can you try to evict/kill/block that client and see if your mon load drops?

 Before you respond, just noting here ftr that i think there's a
 possible issue with OSDMonitor::get_removed_snaps_range and clients
 like this.

     https://github.com/ceph/ceph/blob/v14.2.19/src/mon/OSDMonitor.cc#L4193

 Called by send_incremental:

     https://github.com/ceph/ceph/blob/v14.2.19/src/mon/OSDMonitor.cc#L4152

 When building the incremental it will search the mon's rocksdb for
 removed snaps across those ~million missing maps.

 That feature seems removed from octopus onward. 
I evicted that client and CPU hasn't gone down significantly. There
may be other clients also causing the issue. Was it the
`osdmap=1170448` part of the line that says how many OSDmaps it's
trying to get? I can look for others in the logs and evict them as
well.

Maybe if that code path isn't needed in Nautilus it can be removed in
the next point release?

Thank you,
Robert LeBlanc

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Nautilus 14.2.19 mon 100% CPU