From pchoi@nuro.ai Thu Mar 26 16:32:20 2020 From: Paul Choi To: ceph-users@ceph.io Subject: [ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic Date: Thu, 26 Mar 2020 09:32:09 -0700 Message-ID: In-Reply-To: 8fea91b8-12a4-8a8b-38a7-b014657306da@uni-weimar.de MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============5224287890734906364==" --===============5224287890734906364== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit I can't quite explain what happened, but the Prometheus endpoint became stable after the free disk space for the largest pool went substantially lower than 1PB. I wonder if there's some metric that exceeds the maximum size for some int, double, etc? -Paul On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff < janek.bevendorff(a)uni-weimar.de> wrote: > I haven't seen any MGR hangs so far since I disabled the prometheus > module. It seems like the module is not only slow, but kills the whole > MGR when the cluster is sufficiently large, so these two issues are most > likely connected. The issue has become much, much worse with 14.2.8. > > > On 23/03/2020 09:00, Janek Bevendorff wrote: > > I am running the very latest version of Nautilus. I will try setting up > > an external exporter today and see if that fixes anything. Our cluster > > is somewhat large-ish with 1248 OSDs, so I expect stat collection to > > take "some" time, but it definitely shouldn't crush the MGRs all the > time. > > > > On 21/03/2020 02:33, Paul Choi wrote: > >> Hi Janek, > >> > >> What version of Ceph are you using? > >> We also have a much smaller cluster running Nautilus, with no MDS. No > >> Prometheus issues there. > >> I won't speculate further than this but perhaps Nautilus doesn't have > >> the same issue as Mimic? > >> > >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff > >> >> > wrote: > >> > >> I think this is related to my previous post to this list about MGRs > >> failing regularly and being overall quite slow to respond. The > problem > >> has existed before, but the new version has made it way worse. My > MGRs > >> keep dyring every few hours and need to be restarted. the Promtheus > >> plugin works, but it's pretty slow and so is the dashboard. > >> Unfortunately, nobody seems to have a solution for this and I > >> wonder why > >> not more people are complaining about this problem. > >> > >> > >> On 20/03/2020 19:30, Paul Choi wrote: > >> > If I "curl http://localhost:9283/metrics" and wait sufficiently > long > >> > enough, I get this - says "No MON connection". But the mons are > >> health and > >> > the cluster is functioning fine. > >> > That said, the mons' rocksdb sizes are fairly big because > >> there's lots of > >> > rebalancing going on. The Prometheus endpoint hanging seems to > >> happen > >> > regardless of the mon size anyhow. > >> > > >> > mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB) > >> > mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB) > >> > mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB) > >> > mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB) > >> > mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB) > >> > > >> > # fg > >> > curl -H "Connection: close" http://localhost:9283/metrics > >> > >> > "-//W3C//DTD XHTML 1.0 Transitional//EN" > >> > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > >> > > >> > > >> > > >> > 503 Service Unavailable > >> > > >> > > >> > > >> >

503 Service Unavailable

> >> >

No MON connection

> >> >
Traceback (most recent call last):
> >>     >   File
> >>     "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670,
> >>     > in respond
> >>     >     response.body = self.handler()
> >>     >   File
> >>     "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
> >>     > 217, in __call__
> >>     >     self.body = self.oldhandler(*args, **kwargs)
> >>     >   File
> >>     "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61,
> >>     > in __call__
> >>     >     return self.callable(*self.args, **self.kwargs)
> >>     >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in
> >>     metrics
> >>     >     return self._metrics(instance)
> >>     >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in
> >>     _metrics
> >>     >     raise cherrypy.HTTPError(503, 'No MON connection')
> >>     > HTTPError: (503, 'No MON connection')
> >>     > 
> >> >
> >> > > >> > Powered by CherryPy > >> 3.5.0 > >> > > >> >
> >> > > >> > > >> > > >> > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi >> > wrote: > >> > > >> >> Hello, > >> >> > >> >> We are running Mimic 13.2.8 with our cluster, and since > >> upgrading to > >> >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to > >> respond under > >> >> 10s but now it often hangs. Restarting the mgr processes helps > >> temporarily > >> >> but within minutes it gets stuck again. > >> >> > >> >> The active mgr doesn't exit when doing `systemctl stop > >> ceph-mgr.target" > >> >> and needs to > >> >> be kill -9'ed. > >> >> > >> >> Is there anything I can do to address this issue, or at least > >> get better > >> >> visibility into the issue? > >> >> > >> >> We only have a few plugins enabled: > >> >> $ ceph mgr module ls > >> >> { > >> >> "enabled_modules": [ > >> >> "balancer", > >> >> "prometheus", > >> >> "zabbix" > >> >> ], > >> >> > >> >> 3 mgr processes, but it's a pretty large cluster (near 4000 > >> OSDs) and it's > >> >> a busy one with lots of rebalancing. (I don't know if a busy > >> cluster would > >> >> seriously affect the mgr's performance, but just throwing it > >> out there) > >> >> > >> >> services: > >> >> mon: 5 daemons, quorum > >> >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 > >> >> mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1 > >> >> mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 > >> up:standby-replay > >> >> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs > >> >> rgw: 4 daemons active > >> >> > >> >> Thanks in advance for your help, > >> >> > >> >> -Paul Choi > >> >> > >> > _______________________________________________ > >> > ceph-users mailing list -- ceph-users(a)ceph.io > >> > >> > To unsubscribe send an email to ceph-users-leave(a)ceph.io > >> > >> > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > --===============5224287890734906364==--