ceph-mgr high CPU utilization - ceph-users

1 May 2020

I'm wondering if anyone still sees issues with ceph-mgr using CPU and 
being unresponsive even in recent Nautilus releases.  We upgraded our 
largest cluster from Mimic to Nautilus (14.2.8) recently - it has about 
3500 OSDs.  Now ceph-mgr is constantly at 100-200% CPU (1-2 cores), and 
becomes unresponsive after a few minutes.  The finisher-Mgr queue length 
grows (I've seen it at over 100k) - similar symptoms as seen with 
earlier Nautilus releases by many.  This is what it looks like after an 
hour of running:

     "finisher-Mgr": {
         "queue_len": 66078,
         "complete_latency": {
             "avgcount": 21,
             "sum": 2098.408767721,
             "avgtime": 99.924227034
         }
     },

We have a pretty vanilla manager config, only the balancer is enabled in 
upmap mode.  Here are the enabled modules:

     "always_on_modules": [
         "balancer",
         "crash",
         "devicehealth",
         "orchestrator_cli",
         "progress",
         "rbd_support",
         "status",
         "volumes"
     ],
     "enabled_modules": [
         "restful"
     ],

Any ideas or outstanding issues in this area?

Andras