[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

27 Mar 2020

Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were
failing constantly due to the prometheus module doing something funny.

On 26/03/2020 18:10, Paul Choi wrote:
...
  I won't speculate more into the MDS's
stability, but I do wonder about
 the same thing.
 There is one file served by the MDS that would cause the ceph-fuse
 client to hang. It was a file that many people in the company relied
 on for data updates, so very noticeable. The only fix was to fail over
 the MDS.

 Since the free disk space dropped, I haven't heard anyone complain...
 <shrug>

 On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff
 &lt;janek.bevendorff(a)uni-weimar.de
 <mailto:janek.bevendorff@uni-weimar.de>> wrote:

     If there is actually a connection, then it's no wonder our MDS
     kept crashing. Our Ceph has 9.2PiB of available space at the moment.

     On 26/03/2020 17:32, Paul Choi wrote:
>     I can't quite explain what happened, but the Prometheus endpoint
>     became stable after the free disk space for the largest pool went
>     substantially lower than 1PB.
>     I wonder if there's some metric that exceeds the maximum size for
>     some int, double, etc?
>
>     -Paul
>
>     On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff
>     &lt;janek.bevendorff(a)uni-weimar.de
>     <mailto:janek.bevendorff@uni-weimar.de>> wrote:
>
>         I haven't seen any MGR hangs so far since I disabled the
>         prometheus
>         module. It seems like the module is not only slow, but kills
>         the whole
>         MGR when the cluster is sufficiently large, so these two
>         issues are most
>         likely connected. The issue has become much, much worse with
>         14.2.8.
>
>
>         On 23/03/2020 09:00, Janek Bevendorff wrote:
>         > I am running the very latest version of Nautilus. I will
>         try setting up
>         > an external exporter today and see if that fixes anything.
>         Our cluster
>         > is somewhat large-ish with 1248 OSDs, so I expect stat
>         collection to
>         > take "some" time, but it definitely shouldn't crush the
>         MGRs all the time.
>         >
>         > On 21/03/2020 02:33, Paul Choi wrote:
>         >> Hi Janek,
>         >>
>         >> What version of Ceph are you using?
>         >> We also have a much smaller cluster running Nautilus, with
>         no MDS. No
>         >> Prometheus issues there.
>         >> I won't speculate further than this but perhaps Nautilus
>         doesn't have
>         >> the same issue as Mimic?
>         >>
>         >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
>         >> &lt;janek.bevendorff(a)uni-weimar.de
>         <mailto:janek.bevendorff@uni-weimar.de>
>         >> <mailto:janek.bevendorff@uni-weimar.de
>         <mailto:janek.bevendorff@uni-weimar.de>>> wrote:
>         >>
>         >>     I think this is related to my previous post to this
>         list about MGRs
>         >>     failing regularly and being overall quite slow to
>         respond. The problem
>         >>     has existed before, but the new version has made it
>         way worse. My MGRs
>         >>     keep dyring every few hours and need to be restarted.
>         the Promtheus
>         >>     plugin works, but it's pretty slow and so is the
>         dashboard.
>         >>     Unfortunately, nobody seems to have a solution for
>         this and I
>         >>     wonder why
>         >>     not more people are complaining about this problem.
>         >>
>         >>
>         >>     On 20/03/2020 19:30, Paul Choi wrote:
>         >>     > If I "curl http://localhost:9283/metrics" and
wait
>         sufficiently long
>         >>     > enough, I get this - says "No MON connection".
But
>         the mons are
>         >>     health and
>         >>     > the cluster is functioning fine.
>         >>     > That said, the mons' rocksdb sizes are fairly big
>         because
>         >>     there's lots of
>         >>     > rebalancing going on. The Prometheus endpoint
>         hanging seems to
>         >>     happen
>         >>     > regardless of the mon size anyhow.
>         >>     >
>         >>     >     mon.woodenbox0 is 41 GiB >= mon_data_size_warn
>         (15 GiB)
>         >>     >     mon.woodenbox2 is 26 GiB >= mon_data_size_warn
>         (15 GiB)
>         >>     >     mon.woodenbox4 is 42 GiB >= mon_data_size_warn
>         (15 GiB)
>         >>     >     mon.woodenbox3 is 43 GiB >= mon_data_size_warn
>         (15 GiB)
>         >>     >     mon.woodenbox1 is 38 GiB >= mon_data_size_warn
>         (15 GiB)
>         >>     >
>         >>     > # fg
>         >>     > curl -H "Connection: close"
>         http://localhost:9283/metrics
>         >>     > <!DOCTYPE html PUBLIC
>         >>     > "-//W3C//DTD XHTML 1.0 Transitional//EN"
>         >>     >
>         "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
>         >>     > <html>
>         >>     > <head>
>         >>     >     <meta http-equiv="Content-Type"
content="text/html;
>         >>     > charset=utf-8"></meta>
>         >>     >     <title>503 Service Unavailable</title>
>         >>     >     <style type="text/css">
>         >>     >     #powered_by {
>         >>     >         margin-top: 20px;
>         >>     >         border-top: 2px solid black;
>         >>     >         font-style: italic;
>         >>     >     }
>         >>     >
>         >>     >     #traceback {
>         >>     >         color: red;
>         >>     >     }
>         >>     >     </style>
>         >>     > </head>
>         >>     >     <body>
>         >>     >         <h2>503 Service Unavailable</h2>
>         >>     >         <p>No MON connection</p>
>         >>     >         <pre id="traceback">Traceback (most
recent
>         call last):
>         >>     >   File
>         >>   
>          "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py",
>         line 670,
>         >>     > in respond
>         >>     >     response.body = self.handler()
>         >>     >   File
>         >>   
>          "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py",
>         line
>         >>     > 217, in __call__
>         >>     >     self.body = self.oldhandler(*args, **kwargs)
>         >>     >   File
>         >>   
>          "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py",
>         line 61,
>         >>     > in __call__
>         >>     >     return self.callable(*self.args, **self.kwargs)
>         >>     >   File "/usr/lib/ceph/mgr/prometheus/module.py",
>         line 704, in
>         >>     metrics
>         >>     >     return self._metrics(instance)
>         >>     >   File "/usr/lib/ceph/mgr/prometheus/module.py",
>         line 721, in
>         >>     _metrics
>         >>     >     raise cherrypy.HTTPError(503, 'No MON
connection')
>         >>     > HTTPError: (503, 'No MON connection')
>         >>     > </pre>
>         >>     >     <div id="powered_by">
>         >>     >       <span>
>         >>     >         Powered by <a
>         href="http://www.cherrypy.org">CherryPy
>         >>     3.5.0</a>
>         >>     >       </span>
>         >>     >     </div>
>         >>     >     </body>
>         >>     > </html>
>         >>     >
>         >>     > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi
>         &lt;pchoi(a)nuro.ai <mailto:pchoi@nuro.ai>
>         >>     <mailto:pchoi@nuro.ai <mailto:pchoi@nuro.ai>>>
wrote:
>         >>     >
>         >>     >> Hello,
>         >>     >>
>         >>     >> We are running Mimic 13.2.8 with our cluster, and
since
>         >>     upgrading to
>         >>     >> 13.2.8 the Prometheus plugin seems to hang a lot.
>         It used to
>         >>     respond under
>         >>     >> 10s but now it often hangs. Restarting the mgr
>         processes helps
>         >>     temporarily
>         >>     >> but within minutes it gets stuck again.
>         >>     >>
>         >>     >> The active mgr doesn't exit when doing `systemctl
stop
>         >>     ceph-mgr.target"
>         >>     >> and needs to
>         >>     >>  be kill -9'ed.
>         >>     >>
>         >>     >> Is there anything I can do to address this issue,
>         or at least
>         >>     get better
>         >>     >> visibility into the issue?
>         >>     >>
>         >>     >> We only have a few plugins enabled:
>         >>     >> $ ceph mgr module ls
>         >>     >> {
>         >>     >>     "enabled_modules": [
>         >>     >>         "balancer",
>         >>     >>         "prometheus",
>         >>     >>         "zabbix"
>         >>     >>     ],
>         >>     >>
>         >>     >> 3 mgr processes, but it's a pretty large cluster
>         (near 4000
>         >>     OSDs) and it's
>         >>     >> a busy one with lots of rebalancing. (I don't know
>         if a busy
>         >>     cluster would
>         >>     >> seriously affect the mgr's performance, but just
>         throwing it
>         >>     out there)
>         >>     >>
>         >>     >>   services:
>         >>     >>     mon: 5 daemons, quorum
>         >>     >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1
>         >>     >>     mgr: woodenbox2(active), standbys: woodenbox0,
>         woodenbox1
>         >>     >>     mds: cephfs-1/1/1 up  {0=woodenbox6=up:active}, 1
>         >>     up:standby-replay
>         >>     >>     osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs
>         >>     >>     rgw: 4 daemons active
>         >>     >>
>         >>     >> Thanks in advance for your help,
>         >>     >>
>         >>     >> -Paul Choi
>         >>     >>
>         >>     > _______________________________________________
>         >>     > ceph-users mailing list -- ceph-users(a)ceph.io
>         <mailto:ceph-users@ceph.io>
>         >>     <mailto:ceph-users@ceph.io
<mailto:ceph-users@ceph.io>>
>         >>     > To unsubscribe send an email to
>         ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>
>         >>     <mailto:ceph-users-leave@ceph.io
>         <mailto:ceph-users-leave@ceph.io>>
>         >>
>         > _______________________________________________
>         > ceph-users mailing list -- ceph-users(a)ceph.io
>         <mailto:ceph-users@ceph.io>
>         > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>         <mailto:ceph-users-leave@ceph.io>
> 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic