[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

23 Mar 2020

I am running the very latest version of Nautilus. I will try setting up
an external exporter today and see if that fixes anything. Our cluster
is somewhat large-ish with 1248 OSDs, so I expect stat collection to
take "some" time, but it definitely shouldn't crush the MGRs all the time.

On 21/03/2020 02:33, Paul Choi wrote:
...
  Hi Janek,

 What version of Ceph are you using?
 We also have a much smaller cluster running Nautilus, with no MDS. No
 Prometheus issues there.
 I won't speculate further than this but perhaps Nautilus doesn't have
 the same issue as Mimic?

 On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
 &lt;janek.bevendorff(a)uni-weimar.de
 <mailto:janek.bevendorff@uni-weimar.de>> wrote:

     I think this is related to my previous post to this list about MGRs
     failing regularly and being overall quite slow to respond. The problem
     has existed before, but the new version has made it way worse. My MGRs
     keep dyring every few hours and need to be restarted. the Promtheus
     plugin works, but it's pretty slow and so is the dashboard.
     Unfortunately, nobody seems to have a solution for this and I
     wonder why
     not more people are complaining about this problem.

     On 20/03/2020 19:30, Paul Choi wrote:
  If I "curl
http://localhost:9283/metrics" and wait sufficiently long
 enough, I get this - says "No MON connection". But the mons are     
health and
  the cluster is functioning fine.
 That said, the mons' rocksdb sizes are fairly big because      there's lots
of
  rebalancing going on. The Prometheus endpoint
hanging seems to      happen
  regardless of the mon size anyhow.

     mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB)
     mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB)
     mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB)
     mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB)
     mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB)

 # fg
 curl -H "Connection: close" http://localhost:9283/metrics
 <!DOCTYPE html PUBLIC
 "-//W3C//DTD XHTML 1.0 Transitional//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
 <html>
 <head>
     <meta http-equiv="Content-Type" content="text/html;
 charset=utf-8"></meta>
     <title>503 Service Unavailable</title>
     <style type="text/css">
     #powered_by {
         margin-top: 20px;
         border-top: 2px solid black;
         font-style: italic;
     }

     #traceback {
         color: red;
     }
     </style>
 </head>
     <body>
         <h2>503 Service Unavailable</h2>
         <p>No MON connection</p>
         <pre id="traceback">Traceback (most recent call last):
   File      "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py",
line 670,
  in respond
     response.body = self.handler()
   File      "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py",
line
  217, in __call__
     self.body = self.oldhandler(*args, **kwargs)
   File      "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py",
line 61,
  in __call__
     return self.callable(*self.args, **self.kwargs)
   File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in     
metrics
      return self._metrics(instance)
   File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in     
_metrics
      raise cherrypy.HTTPError(503, 'No MON
connection')
 HTTPError: (503, 'No MON connection')
 </pre>
     <div id="powered_by">
       <span>
         Powered by <a href="http://www.cherrypy.org">CherryPy     
3.5.0</a>
        </span>
     </div>
     </body>
 </html>

 On Fri, Mar 20, 2020 at 6:33 AM Paul Choi &lt;pchoi(a)nuro.ai     
<mailto:pchoi@nuro.ai>> wrote:

> Hello,
>
> We are running Mimic 13.2.8 with our cluster, and since      upgrading to
 > 13.2.8 the Prometheus plugin seems to hang a
lot. It used to      respond under
 > 10s but now it often hangs. Restarting the
mgr processes helps      temporarily
 > but within minutes it gets stuck again.
>
> The active mgr doesn't exit when doing `systemctl stop     
ceph-mgr.target"
 > and needs to
>  be kill -9'ed.
>
> Is there anything I can do to address this issue, or at least      get better
 > visibility into the issue?
>
> We only have a few plugins enabled:
> $ ceph mgr module ls
> {
>     "enabled_modules": [
>         "balancer",
>         "prometheus",
>         "zabbix"
>     ],
>
> 3 mgr processes, but it's a pretty large cluster (near 4000      OSDs) and
it's
 > a busy one with lots of rebalancing. (I
don't know if a busy      cluster would
 > seriously affect the mgr's performance,
but just throwing it      out there)
 >
>   services:
>     mon: 5 daemons, quorum
> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1
>     mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1
>     mds: cephfs-1/1/1 up  {0=woodenbox6=up:active}, 1      up:standby-replay
       osd:
3964 osds: 3928 up, 3928 in; 831 remapped pgs
     rgw: 4 daemons active

 Thanks in advance for your help,

 -Paul Choi
  _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io     
<mailto:ceph-users@ceph.io>
  To unsubscribe send an email to
ceph-users-leave(a)ceph.io      <mailto:ceph-users-leave@ceph.io>

2024

2023

2022

2021

2020

2019

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic