[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

26 Mar 2020

I won't speculate more into the MDS's stability, but I do wonder about the
same thing.
There is one file served by the MDS that would cause the ceph-fuse client
to hang. It was a file that many people in the company relied on for data
updates, so very noticeable. The only fix was to fail over the MDS.

Since the free disk space dropped, I haven't heard anyone complain...
<shrug>

On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff <
janek.bevendorff(a)uni-weimar.de&gt; wrote:

...
  If there is actually a connection, then it's no
wonder our MDS kept
 crashing. Our Ceph has 9.2PiB of available space at the moment.

 On 26/03/2020 17:32, Paul Choi wrote:

 I can't quite explain what happened, but the Prometheus endpoint became
 stable after the free disk space for the largest pool went substantially
 lower than 1PB.
 I wonder if there's some metric that exceeds the maximum size for some
 int, double, etc?

 -Paul

 On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff <
 janek.bevendorff(a)uni-weimar.de&gt; wrote:

  I haven't seen any MGR hangs so far since I
disabled the prometheus
 module. It seems like the module is not only slow, but kills the whole
 MGR when the cluster is sufficiently large, so these two issues are most
 likely connected. The issue has become much, much worse with 14.2.8.

 On 23/03/2020 09:00, Janek Bevendorff wrote:
  I am running the very latest version of Nautilus.
I will try setting up
 an external exporter today and see if that fixes anything. Our cluster
 is somewhat large-ish with 1248 OSDs, so I expect stat collection to
 take "some" time, but it definitely shouldn't crush the MGRs all the 
time.

 On 21/03/2020 02:33, Paul Choi wrote:
> Hi Janek,
>
> What version of Ceph are you using?
> We also have a much smaller cluster running Nautilus, with no MDS. No
> Prometheus issues there.
> I won't speculate further than this but perhaps Nautilus doesn't have
> the same issue as Mimic?
>
> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
> &lt;janek.bevendorff(a)uni-weimar.de
> <mailto:janek.bevendorff@uni-weimar.de>> wrote:
>
>     I think this is related to my previous post to this list about MGRs
>     failing regularly and being overall quite slow to respond. The  problem
 >     has existed before, but the new version
has made it way worse. My  MGRs
 >     keep dyring every few hours and need to
be restarted. the Promtheus
>     plugin works, but it's pretty slow and so is the dashboard.
>     Unfortunately, nobody seems to have a solution for this and I
>     wonder why
>     not more people are complaining about this problem.
>
>
>     On 20/03/2020 19:30, Paul Choi wrote:
>     > If I "curl http://localhost:9283/metrics" and wait sufficiently
 long
 >     > enough, I get this - says "No
MON connection". But the mons are
>     health and
>     > the cluster is functioning fine.
>     > That said, the mons' rocksdb sizes are fairly big because
>     there's lots of
>     > rebalancing going on. The Prometheus endpoint hanging seems to
>     happen
>     > regardless of the mon size anyhow.
>     >
>     >     mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB)
>     >     mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB)
>     >     mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB)
>     >     mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB)
>     >     mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB)
>     >
>     > # fg
>     > curl -H "Connection: close" http://localhost:9283/metrics
>     > <!DOCTYPE html PUBLIC
>     > "-//W3C//DTD XHTML 1.0 Transitional//EN"
>     > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
>     > <html>
>     > <head>
>     >     <meta http-equiv="Content-Type" content="text/html;
>     > charset=utf-8"></meta>
>     >     <title>503 Service Unavailable</title>
>     >     <style type="text/css">
>     >     #powered_by {
>     >         margin-top: 20px;
>     >         border-top: 2px solid black;
>     >         font-style: italic;
>     >     }
>     >
>     >     #traceback {
>     >         color: red;
>     >     }
>     >     </style>
>     > </head>
>     >     <body>
>     >         <h2>503 Service Unavailable</h2>
>     >         <p>No MON connection</p>
>     >         <pre id="traceback">Traceback (most recent call
last):
>     >   File
>     "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 
670,
 >     > in respond
>     >     response.body = self.handler()
>     >   File
>     "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
>     > 217, in __call__
>     >     self.body = self.oldhandler(*args, **kwargs)
>     >   File
>     "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 
61,
       >
in __call__
     >     return self.callable(*self.args, **self.kwargs)
     >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in
     metrics
     >     return self._metrics(instance)
     >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in
     _metrics
     >     raise cherrypy.HTTPError(503, 'No MON connection')
     > HTTPError: (503, 'No MON connection')
     > </pre>
     >     <div id="powered_by">
     >       <span>
     >         Powered by <a href="http://www.cherrypy.org">CherryPy
     3.5.0</a>
     >       </span>
     >     </div>
     >     </body>
     > </html>
     >
     > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi &lt;pchoi(a)nuro.ai
     <mailto:pchoi@nuro.ai>> wrote:
     >
     >> Hello,
     >>
     >> We are running Mimic 13.2.8 with our cluster, and since
     upgrading to
     >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to
     respond under
     >> 10s but now it often hangs. Restarting the mgr processes helps
     temporarily
     >> but within minutes it gets stuck again.
     >>
     >> The active mgr doesn't exit when doing `systemctl stop
     ceph-mgr.target"
     >> and needs to
     >>  be kill -9'ed.
     >>
     >> Is there anything I can do to address this issue, or at least
     get better
     >> visibility into the issue?
     >>
     >> We only have a few plugins enabled:
     >> $ ceph mgr module ls
     >> {
     >>     "enabled_modules": [
     >>         "balancer",
     >>         "prometheus",
     >>         "zabbix"
     >>     ],
     >>
     >> 3 mgr processes, but it's a pretty large cluster (near 4000
     OSDs) and it's
     >> a busy one with lots of rebalancing. (I don't know if a busy
     cluster would
     >> seriously affect the mgr's performance, but just throwing it
     out there)
     >>
     >>   services:
     >>     mon: 5 daemons, quorum
     >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1
     >>     mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1
     >>     mds: cephfs-1/1/1 up  {0=woodenbox6=up:active}, 1
     up:standby-replay
     >>     osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs
     >>     rgw: 4 daemons active
     >>
     >> Thanks in advance for your help,
     >>
     >> -Paul Choi
     >>
     > _______________________________________________
     > ceph-users mailing list -- ceph-users(a)ceph.io
     <mailto:ceph-users@ceph.io>
     > To unsubscribe send an email to ceph-users-leave(a)ceph.io
     <mailto:ceph-users-leave@ceph.io>
  _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic