I've reported stability problems with ceph-mgr w/ prometheus plugin
enabled on all versions we ran in production which were several
versions of Luminous and Mimic. Our solution was to disable the
prometheus exporter. I am using Zabbix instead. Our cluster is 1404
OSD's in size with about 9PB raw with around 35% utilization.
On Fri, Mar 27, 2020 at 4:26 AM Janek Bevendorff
<janek.bevendorff(a)uni-weimar.de> wrote:
>
> Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were
> failing constantly due to the prometheus module doing something funny.
>
>
> On 26/03/2020 18:10, Paul Choi wrote:
> > I won't speculate more into the MDS's stability, but I do wonder about
> > the same thing.
> > There is one file served by the MDS that would cause the ceph-fuse
> > client to hang. It was a file that many people in the company relied
> > on for data updates, so very noticeable. The only fix was to fail over
> > the MDS.
> >
> > Since the free disk space dropped, I haven't heard anyone complain...
> > <shrug>
> >
> > On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff
> > <janek.bevendorff(a)uni-weimar.de
> > <mailto:janek.bevendorff@uni-weimar.de>> wrote:
> >
> > If there is actually a connection, then it's no wonder our MDS
> > kept crashing. Our Ceph has 9.2PiB of available space at the moment.
> >
> >
> > On 26/03/2020 17:32, Paul Choi wrote:
> >> I can't quite explain what happened, but the Prometheus endpoint
> >> became stable after the free disk space for the largest pool went
> >> substantially lower than 1PB.
> >> I wonder if there's some metric that exceeds the maximum size for
> >> some int, double, etc?
> >>
> >> -Paul
> >>
> >> On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff
> >> <janek.bevendorff(a)uni-weimar.de
> >> <mailto:janek.bevendorff@uni-weimar.de>> wrote:
> >>
> >> I haven't seen any MGR hangs so far since I disabled the
> >> prometheus
> >> module. It seems like the module is not only slow, but kills
> >> the whole
> >> MGR when the cluster is sufficiently large, so these two
> >> issues are most
> >> likely connected. The issue has become much, much worse with
> >> 14.2.8.
> >>
> >>
> >> On 23/03/2020 09:00, Janek Bevendorff wrote:
> >> > I am running the very latest version of Nautilus. I will
> >> try setting up
> >> > an external exporter today and see if that fixes anything.
> >> Our cluster
> >> > is somewhat large-ish with 1248 OSDs, so I expect stat
> >> collection to
> >> > take "some" time, but it definitely shouldn't
crush the
> >> MGRs all the time.
> >> >
> >> > On 21/03/2020 02:33, Paul Choi wrote:
> >> >> Hi Janek,
> >> >>
> >> >> What version of Ceph are you using?
> >> >> We also have a much smaller cluster running Nautilus, with
> >> no MDS. No
> >> >> Prometheus issues there.
> >> >> I won't speculate further than this but perhaps
Nautilus
> >> doesn't have
> >> >> the same issue as Mimic?
> >> >>
> >> >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
> >> >> <janek.bevendorff(a)uni-weimar.de
> >> <mailto:janek.bevendorff@uni-weimar.de>
> >> >> <mailto:janek.bevendorff@uni-weimar.de
> >> <mailto:janek.bevendorff@uni-weimar.de>>> wrote:
> >> >>
> >> >> I think this is related to my previous post to this
> >> list about MGRs
> >> >> failing regularly and being overall quite slow to
> >> respond. The problem
> >> >> has existed before, but the new version has made it
> >> way worse. My MGRs
> >> >> keep dyring every few hours and need to be restarted.
> >> the Promtheus
> >> >> plugin works, but it's pretty slow and so is the
> >> dashboard.
> >> >> Unfortunately, nobody seems to have a solution for
> >> this and I
> >> >> wonder why
> >> >> not more people are complaining about this problem.
> >> >>
> >> >>
> >> >> On 20/03/2020 19:30, Paul Choi wrote:
> >> >> > If I "curl
http://localhost:9283/metrics" and wait
> >> sufficiently long
> >> >> > enough, I get this - says "No MON
connection". But
> >> the mons are
> >> >> health and
> >> >> > the cluster is functioning fine.
> >> >> > That said, the mons' rocksdb sizes are fairly
big
> >> because
> >> >> there's lots of
> >> >> > rebalancing going on. The Prometheus endpoint
> >> hanging seems to
> >> >> happen
> >> >> > regardless of the mon size anyhow.
> >> >> >
> >> >> > mon.woodenbox0 is 41 GiB >=
mon_data_size_warn
> >> (15 GiB)
> >> >> > mon.woodenbox2 is 26 GiB >=
mon_data_size_warn
> >> (15 GiB)
> >> >> > mon.woodenbox4 is 42 GiB >=
mon_data_size_warn
> >> (15 GiB)
> >> >> > mon.woodenbox3 is 43 GiB >=
mon_data_size_warn
> >> (15 GiB)
> >> >> > mon.woodenbox1 is 38 GiB >=
mon_data_size_warn
> >> (15 GiB)
> >> >> >
> >> >> > # fg
> >> >> > curl -H "Connection: close"
> >>
http://localhost:9283/metrics
> >> >> > <!DOCTYPE html PUBLIC
> >> >> > "-//W3C//DTD XHTML 1.0
Transitional//EN"
> >> >> >
> >>
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
> >> >> > <html>
> >> >> > <head>
> >> >> > <meta http-equiv="Content-Type"
content="text/html;
> >> >> > charset=utf-8"></meta>
> >> >> > <title>503 Service
Unavailable</title>
> >> >> > <style type="text/css">
> >> >> > #powered_by {
> >> >> > margin-top: 20px;
> >> >> > border-top: 2px solid black;
> >> >> > font-style: italic;
> >> >> > }
> >> >> >
> >> >> > #traceback {
> >> >> > color: red;
> >> >> > }
> >> >> > </style>
> >> >> > </head>
> >> >> > <body>
> >> >> > <h2>503 Service
Unavailable</h2>
> >> >> > <p>No MON connection</p>
> >> >> > <pre
id="traceback">Traceback (most recent
> >> call last):
> >> >> > File
> >> >>
> >>
"/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py",
> >> line 670,
> >> >> > in respond
> >> >> > response.body = self.handler()
> >> >> > File
> >> >>
> >>
"/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py",
> >> line
> >> >> > 217, in __call__
> >> >> > self.body = self.oldhandler(*args, **kwargs)
> >> >> > File
> >> >>
> >>
"/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py",
> >> line 61,
> >> >> > in __call__
> >> >> > return self.callable(*self.args,
**self.kwargs)
> >> >> > File
"/usr/lib/ceph/mgr/prometheus/module.py",
> >> line 704, in
> >> >> metrics
> >> >> > return self._metrics(instance)
> >> >> > File
"/usr/lib/ceph/mgr/prometheus/module.py",
> >> line 721, in
> >> >> _metrics
> >> >> > raise cherrypy.HTTPError(503, 'No MON
connection')
> >> >> > HTTPError: (503, 'No MON connection')
> >> >> > </pre>
> >> >> > <div id="powered_by">
> >> >> > <span>
> >> >> > Powered by <a
> >> href="http://www.cherrypy.org">CherryPy
> >> >> 3.5.0</a>
> >> >> > </span>
> >> >> > </div>
> >> >> > </body>
> >> >> > </html>
> >> >> >
> >> >> > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi
> >> <pchoi(a)nuro.ai <mailto:pchoi@nuro.ai>
> >> >> <mailto:pchoi@nuro.ai
<mailto:pchoi@nuro.ai>>> wrote:
> >> >> >
> >> >> >> Hello,
> >> >> >>
> >> >> >> We are running Mimic 13.2.8 with our cluster,
and since
> >> >> upgrading to
> >> >> >> 13.2.8 the Prometheus plugin seems to hang a
lot.
> >> It used to
> >> >> respond under
> >> >> >> 10s but now it often hangs. Restarting the
mgr
> >> processes helps
> >> >> temporarily
> >> >> >> but within minutes it gets stuck again.
> >> >> >>
> >> >> >> The active mgr doesn't exit when doing
`systemctl stop
> >> >> ceph-mgr.target"
> >> >> >> and needs to
> >> >> >> be kill -9'ed.
> >> >> >>
> >> >> >> Is there anything I can do to address this
issue,
> >> or at least
> >> >> get better
> >> >> >> visibility into the issue?
> >> >> >>
> >> >> >> We only have a few plugins enabled:
> >> >> >> $ ceph mgr module ls
> >> >> >> {
> >> >> >> "enabled_modules": [
> >> >> >> "balancer",
> >> >> >> "prometheus",
> >> >> >> "zabbix"
> >> >> >> ],
> >> >> >>
> >> >> >> 3 mgr processes, but it's a pretty large
cluster
> >> (near 4000
> >> >> OSDs) and it's
> >> >> >> a busy one with lots of rebalancing. (I
don't know
> >> if a busy
> >> >> cluster would
> >> >> >> seriously affect the mgr's performance,
but just
> >> throwing it
> >> >> out there)
> >> >> >>
> >> >> >> services:
> >> >> >> mon: 5 daemons, quorum
> >> >> >>
woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1
> >> >> >> mgr: woodenbox2(active), standbys:
woodenbox0,
> >> woodenbox1
> >> >> >> mds: cephfs-1/1/1 up
{0=woodenbox6=up:active}, 1
> >> >> up:standby-replay
> >> >> >> osd: 3964 osds: 3928 up, 3928 in; 831
remapped pgs
> >> >> >> rgw: 4 daemons active
> >> >> >>
> >> >> >> Thanks in advance for your help,
> >> >> >>
> >> >> >> -Paul Choi
> >> >> >>
> >> >> > _______________________________________________
> >> >> > ceph-users mailing list -- ceph-users(a)ceph.io
> >> <mailto:ceph-users@ceph.io>
> >> >> <mailto:ceph-users@ceph.io
<mailto:ceph-users@ceph.io>>
> >> >> > To unsubscribe send an email to
> >> ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>
> >> >> <mailto:ceph-users-leave@ceph.io
> >> <mailto:ceph-users-leave@ceph.io>>
> >> >>
> >> > _______________________________________________
> >> > ceph-users mailing list -- ceph-users(a)ceph.io
> >> <mailto:ceph-users@ceph.io>
> >> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
> >> <mailto:ceph-users-leave@ceph.io>
> >>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io