[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

20 Mar 2020

If I "curl http://localhost:9283/metrics" and wait sufficiently long
enough, I get this - says "No MON connection". But the mons are health and
the cluster is functioning fine.
That said, the mons' rocksdb sizes are fairly big because there's lots of
rebalancing going on. The Prometheus endpoint hanging seems to happen
regardless of the mon size anyhow.

    mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB)
    mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB)
    mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB)
    mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB)
    mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB)

# fg
curl -H "Connection: close" http://localhost:9283/metrics
<!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
    <meta http-equiv="Content-Type" content="text/html;
charset=utf-8"></meta>
    <title>503 Service Unavailable</title>
    <style type="text/css">
    #powered_by {
        margin-top: 20px;
        border-top: 2px solid black;
        font-style: italic;
    }

    #traceback {
        color: red;
    }
    </style>
</head>
    <body>
        <h2>503 Service Unavailable</h2>
        <p>No MON connection</p>
        <pre id="traceback">Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670,
in respond
    response.body = self.handler()
  File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
217, in __call__
    self.body = self.oldhandler(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61,
in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in metrics
    return self._metrics(instance)
  File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in _metrics
    raise cherrypy.HTTPError(503, 'No MON connection')
HTTPError: (503, 'No MON connection')
</pre>
    <div id="powered_by">
      <span>
        Powered by <a href="http://www.cherrypy.org">CherryPy
3.5.0</a>
      </span>
    </div>
    </body>
</html>

On Fri, Mar 20, 2020 at 6:33 AM Paul Choi &lt;pchoi(a)nuro.ai&gt; wrote:

...
  Hello,

 We are running Mimic 13.2.8 with our cluster, and since upgrading to
 13.2.8 the Prometheus plugin seems to hang a lot. It used to respond under
 10s but now it often hangs. Restarting the mgr processes helps temporarily
 but within minutes it gets stuck again.

 The active mgr doesn't exit when doing `systemctl stop ceph-mgr.target"
 and needs to
  be kill -9'ed.

 Is there anything I can do to address this issue, or at least get better
 visibility into the issue?

 We only have a few plugins enabled:
 $ ceph mgr module ls
 {
     "enabled_modules": [
         "balancer",
         "prometheus",
         "zabbix"
     ],

 3 mgr processes, but it's a pretty large cluster (near 4000 OSDs) and it's
 a busy one with lots of rebalancing. (I don't know if a busy cluster would
 seriously affect the mgr's performance, but just throwing it out there)

   services:
     mon: 5 daemons, quorum
 woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1
     mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1
     mds: cephfs-1/1/1 up  {0=woodenbox6=up:active}, 1 up:standby-replay
     osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs
     rgw: 4 daemons active

 Thanks in advance for your help,

 -Paul Choi

2024

2023

2022

2021

2020

2019

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic