From janek.bevendorff@uni-weimar.de Fri Mar 20 22:23:02 2020 From: Janek Bevendorff To: ceph-users@ceph.io Subject: [ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic Date: Fri, 20 Mar 2020 23:22:57 +0100 Message-ID: In-Reply-To: CALwDB-eMeuZogh+z9xu1jzPyaJ=L6zR9Z5nYqwOk6ARybBhcnQ@mail.gmail.com MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============5354796390736642664==" --===============5354796390736642664== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit I think this is related to my previous post to this list about MGRs failing regularly and being overall quite slow to respond. The problem has existed before, but the new version has made it way worse. My MGRs keep dyring every few hours and need to be restarted. the Promtheus plugin works, but it's pretty slow and so is the dashboard. Unfortunately, nobody seems to have a solution for this and I wonder why not more people are complaining about this problem. On 20/03/2020 19:30, Paul Choi wrote: > If I "curl http://localhost:9283/metrics" and wait sufficiently long > enough, I get this - says "No MON connection". But the mons are health and > the cluster is functioning fine. > That said, the mons' rocksdb sizes are fairly big because there's lots of > rebalancing going on. The Prometheus endpoint hanging seems to happen > regardless of the mon size anyhow. > > mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB) > mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB) > mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB) > mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB) > mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB) > > # fg > curl -H "Connection: close" http://localhost:9283/metrics > "-//W3C//DTD XHTML 1.0 Transitional//EN" > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > > > > 503 Service Unavailable > > > >

503 Service Unavailable

>

No MON connection

>
Traceback (most recent call last):
>   File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670,
> in respond
>     response.body = self.handler()
>   File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
> 217, in __call__
>     self.body = self.oldhandler(*args, **kwargs)
>   File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61,
> in __call__
>     return self.callable(*self.args, **self.kwargs)
>   File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in metrics
>     return self._metrics(instance)
>   File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in _metrics
>     raise cherrypy.HTTPError(503, 'No MON connection')
> HTTPError: (503, 'No MON connection')
> 
>
> > Powered by CherryPy 3.5.0 > >
> > > > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi wrote: > >> Hello, >> >> We are running Mimic 13.2.8 with our cluster, and since upgrading to >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to respond under >> 10s but now it often hangs. Restarting the mgr processes helps temporarily >> but within minutes it gets stuck again. >> >> The active mgr doesn't exit when doing `systemctl stop ceph-mgr.target" >> and needs to >> be kill -9'ed. >> >> Is there anything I can do to address this issue, or at least get better >> visibility into the issue? >> >> We only have a few plugins enabled: >> $ ceph mgr module ls >> { >> "enabled_modules": [ >> "balancer", >> "prometheus", >> "zabbix" >> ], >> >> 3 mgr processes, but it's a pretty large cluster (near 4000 OSDs) and it's >> a busy one with lots of rebalancing. (I don't know if a busy cluster would >> seriously affect the mgr's performance, but just throwing it out there) >> >> services: >> mon: 5 daemons, quorum >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 >> mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1 >> mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 up:standby-replay >> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs >> rgw: 4 daemons active >> >> Thanks in advance for your help, >> >> -Paul Choi >> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io --===============5354796390736642664==--