enough, I get this - says "No MON
connection". But the mons are
health and
the cluster is functioning fine.
That said, the mons' rocksdb sizes are fairly big because
there's lots
of
rebalancing going on. The Prometheus endpoint
hanging seems to
happen
regardless of the mon size anyhow.
mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB)
mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB)
mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB)
mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB)
mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB)
# fg
curl -H "Connection: close"
http://localhost:9283/metrics
<!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=utf-8"></meta>
<title>503 Service Unavailable</title>
<style type="text/css">
#powered_by {
margin-top: 20px;
border-top: 2px solid black;
font-style: italic;
}
#traceback {
color: red;
}
</style>
</head>
<body>
<h2>503 Service Unavailable</h2>
<p>No MON connection</p>
<pre id="traceback">Traceback (most recent call last):
File
"/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py",
line 670,
in respond
response.body = self.handler()
File
"/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py",
line
217, in __call__
self.body = self.oldhandler(*args, **kwargs)
File
"/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py",
line 61,
in __call__
return self.callable(*self.args, **self.kwargs)
File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in
metrics
return self._metrics(instance)
File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in
_metrics
raise cherrypy.HTTPError(503, 'No MON
connection')
HTTPError: (503, 'No MON connection')
</pre>
<div id="powered_by">
<span>
Powered by <a href="http://www.cherrypy.org">CherryPy
3.5.0</a>
</span>
</div>
</body>
</html>
On Fri, Mar 20, 2020 at 6:33 AM Paul Choi <pchoi(a)nuro.ai
<mailto:pchoi@nuro.ai>> wrote:
> Hello,
>
> We are running Mimic 13.2.8 with our cluster, and since
upgrading to
> 13.2.8 the Prometheus plugin seems to hang a
lot. It used to
respond under
> 10s but now it often hangs. Restarting the
mgr processes helps
temporarily
> but within minutes it gets stuck again.
>
> The active mgr doesn't exit when doing `systemctl stop
ceph-mgr.target"
> and needs to
> be kill -9'ed.
>
> Is there anything I can do to address this issue, or at least
get better
> visibility into the issue?
>
> We only have a few plugins enabled:
> $ ceph mgr module ls
> {
> "enabled_modules": [
> "balancer",
> "prometheus",
> "zabbix"
> ],
>
> 3 mgr processes, but it's a pretty large cluster (near 4000
OSDs) and
it's
> a busy one with lots of rebalancing. (I
don't know if a busy
cluster would
> seriously affect the mgr's performance,
but just throwing it
out there)
>
> services:
> mon: 5 daemons, quorum
> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1
> mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1
> mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1
up:standby-replay
> osd: 3964 osds: 3928 up, 3928 in; 831
remapped pgs
> rgw: 4 daemons active
>
> Thanks in advance for your help,
>
> -Paul Choi
>
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
<mailto:ceph-users@ceph.io>
To unsubscribe send an email to
ceph-users-leave(a)ceph.io
<mailto:ceph-users-leave@ceph.io>
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io