On Fri, Apr 9, 2021 at 9:37 PM Dan van der Ster <dan(a)vanderster.com> wrote:
On Fri, Apr 9, 2021 at 8:39 PM Robert LeBlanc <robert(a)leblancnet.us> wrote:
On Fri, Apr 9, 2021 at 11:49 AM Dan van der Ster <dan(a)vanderster.com> wrote:
Thanks. I didn't see anything ultra obvious to me.
But I did notice the nearfull warnings so I wonder if this cluster is
churning through osdmaps? Did you see a large increase in inbound or
outbound network traffic on this mon following the upgrade?
Totally speculating here, but maybe there is an issue where you have
some old clients, which can't decode an incremental osdmap from a
nautilus mon, so the single mon is busy serving up these maps to the
clients.
Does the mon load decrease if you stop the osdmap churn?, e.g. by
setting norebalance if that is indeed ongoing.
Could you also share debug_ms = 1 for a minute of busy cpu mon?
Here are the new logs with the debug_ms=1 for a bit.
https://owncloud.leblancnet.us/owncloud/index.php/s/1hvtJo3s2oLPpWn
Something strange in this is there is one hammer client that is asking
for nearly a million incremental osdmaps, seemingly every 30s:
client.131831153 at 172.16.212.55 is asking for incrementals from
1170448..1987355 (see [1])
Can you try to evict/kill/block that client and see if your mon load drops?
Before you respond, just noting here ftr that i think there's a
possible issue with OSDMonitor::get_removed_snaps_range and clients
like this.
When building the incremental it will search the mon's rocksdb for
removed snaps across those ~million missing maps.
That feature seems removed from octopus onward.