Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were
failing constantly due to the prometheus module doing something funny.
On 26/03/2020 18:10, Paul Choi wrote:
I won't speculate more into the MDS's
stability, but I do wonder about
the same thing.
There is one file served by the MDS that would cause the ceph-fuse
client to hang. It was a file that many people in the company relied
on for data updates, so very noticeable. The only fix was to fail over
the MDS.
Since the free disk space dropped, I haven't heard anyone complain...
<shrug>
On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff
<janek.bevendorff(a)uni-weimar.de
<mailto:janek.bevendorff@uni-weimar.de>> wrote:
If there is actually a connection, then it's no wonder our MDS
kept crashing. Our Ceph has 9.2PiB of available space at the moment.
On 26/03/2020 17:32, Paul Choi wrote:
> I can't quite explain what happened, but the Prometheus endpoint
> became stable after the free disk space for the largest pool went
> substantially lower than 1PB.
> I wonder if there's some metric that exceeds the maximum size for
> some int, double, etc?
>
> -Paul
>
> On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff
> <janek.bevendorff(a)uni-weimar.de
> <mailto:janek.bevendorff@uni-weimar.de>> wrote:
>
> I haven't seen any MGR hangs so far since I disabled the
> prometheus
> module. It seems like the module is not only slow, but kills
> the whole
> MGR when the cluster is sufficiently large, so these two
> issues are most
> likely connected. The issue has become much, much worse with
> 14.2.8.
>
>
> On 23/03/2020 09:00, Janek Bevendorff wrote:
> > I am running the very latest version of Nautilus. I will
> try setting up
> > an external exporter today and see if that fixes anything.
> Our cluster
> > is somewhat large-ish with 1248 OSDs, so I expect stat
> collection to
> > take "some" time, but it definitely shouldn't crush the
> MGRs all the time.
> >
> > On 21/03/2020 02:33, Paul Choi wrote:
> >> Hi Janek,
> >>
> >> What version of Ceph are you using?
> >> We also have a much smaller cluster running Nautilus, with
> no MDS. No
> >> Prometheus issues there.
> >> I won't speculate further than this but perhaps Nautilus
> doesn't have
> >> the same issue as Mimic?
> >>
> >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
> >> <janek.bevendorff(a)uni-weimar.de
> <mailto:janek.bevendorff@uni-weimar.de>
> >> <mailto:janek.bevendorff@uni-weimar.de
> <mailto:janek.bevendorff@uni-weimar.de>>> wrote:
> >>
> >> I think this is related to my previous post to this
> list about MGRs
> >> failing regularly and being overall quite slow to
> respond. The problem
> >> has existed before, but the new version has made it
> way worse. My MGRs
> >> keep dyring every few hours and need to be restarted.
> the Promtheus
> >> plugin works, but it's pretty slow and so is the
> dashboard.
> >> Unfortunately, nobody seems to have a solution for
> this and I
> >> wonder why
> >> not more people are complaining about this problem.
> >>
> >>
> >> On 20/03/2020 19:30, Paul Choi wrote:
> >> > If I "curl
http://localhost:9283/metrics" and
wait
> sufficiently long
> >> > enough, I get this - says "No MON connection".
But
> the mons are
> >> health and
> >> > the cluster is functioning fine.
> >> > That said, the mons' rocksdb sizes are fairly big
> because
> >> there's lots of
> >> > rebalancing going on. The Prometheus endpoint
> hanging seems to
> >> happen
> >> > regardless of the mon size anyhow.
> >> >
> >> > mon.woodenbox0 is 41 GiB >= mon_data_size_warn
> (15 GiB)
> >> > mon.woodenbox2 is 26 GiB >= mon_data_size_warn
> (15 GiB)
> >> > mon.woodenbox4 is 42 GiB >= mon_data_size_warn
> (15 GiB)
> >> > mon.woodenbox3 is 43 GiB >= mon_data_size_warn
> (15 GiB)
> >> > mon.woodenbox1 is 38 GiB >= mon_data_size_warn
> (15 GiB)
> >> >
> >> > # fg
> >> > curl -H "Connection: close"
>
http://localhost:9283/metrics
> >> > <!DOCTYPE html PUBLIC
> >> > "-//W3C//DTD XHTML 1.0 Transitional//EN"
> >> >
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
> >> > <html>
> >> > <head>
> >> > <meta http-equiv="Content-Type"
content="text/html;
> >> > charset=utf-8"></meta>
> >> > <title>503 Service Unavailable</title>
> >> > <style type="text/css">
> >> > #powered_by {
> >> > margin-top: 20px;
> >> > border-top: 2px solid black;
> >> > font-style: italic;
> >> > }
> >> >
> >> > #traceback {
> >> > color: red;
> >> > }
> >> > </style>
> >> > </head>
> >> > <body>
> >> > <h2>503 Service Unavailable</h2>
> >> > <p>No MON connection</p>
> >> > <pre id="traceback">Traceback (most
recent
> call last):
> >> > File
> >>
> "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py",
> line 670,
> >> > in respond
> >> > response.body = self.handler()
> >> > File
> >>
> "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py",
> line
> >> > 217, in __call__
> >> > self.body = self.oldhandler(*args, **kwargs)
> >> > File
> >>
> "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py",
> line 61,
> >> > in __call__
> >> > return self.callable(*self.args, **self.kwargs)
> >> > File "/usr/lib/ceph/mgr/prometheus/module.py",
> line 704, in
> >> metrics
> >> > return self._metrics(instance)
> >> > File "/usr/lib/ceph/mgr/prometheus/module.py",
> line 721, in
> >> _metrics
> >> > raise cherrypy.HTTPError(503, 'No MON
connection')
> >> > HTTPError: (503, 'No MON connection')
> >> > </pre>
> >> > <div id="powered_by">
> >> > <span>
> >> > Powered by <a
> href="http://www.cherrypy.org">CherryPy
> >> 3.5.0</a>
> >> > </span>
> >> > </div>
> >> > </body>
> >> > </html>
> >> >
> >> > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi
> <pchoi(a)nuro.ai <mailto:pchoi@nuro.ai>
> >> <mailto:pchoi@nuro.ai <mailto:pchoi@nuro.ai>>>
wrote:
> >> >
> >> >> Hello,
> >> >>
> >> >> We are running Mimic 13.2.8 with our cluster, and
since
> >> upgrading to
> >> >> 13.2.8 the Prometheus plugin seems to hang a lot.
> It used to
> >> respond under
> >> >> 10s but now it often hangs. Restarting the mgr
> processes helps
> >> temporarily
> >> >> but within minutes it gets stuck again.
> >> >>
> >> >> The active mgr doesn't exit when doing `systemctl
stop
> >> ceph-mgr.target"
> >> >> and needs to
> >> >> be kill -9'ed.
> >> >>
> >> >> Is there anything I can do to address this issue,
> or at least
> >> get better
> >> >> visibility into the issue?
> >> >>
> >> >> We only have a few plugins enabled:
> >> >> $ ceph mgr module ls
> >> >> {
> >> >> "enabled_modules": [
> >> >> "balancer",
> >> >> "prometheus",
> >> >> "zabbix"
> >> >> ],
> >> >>
> >> >> 3 mgr processes, but it's a pretty large cluster
> (near 4000
> >> OSDs) and it's
> >> >> a busy one with lots of rebalancing. (I don't know
> if a busy
> >> cluster would
> >> >> seriously affect the mgr's performance, but just
> throwing it
> >> out there)
> >> >>
> >> >> services:
> >> >> mon: 5 daemons, quorum
> >> >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1
> >> >> mgr: woodenbox2(active), standbys: woodenbox0,
> woodenbox1
> >> >> mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1
> >> up:standby-replay
> >> >> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs
> >> >> rgw: 4 daemons active
> >> >>
> >> >> Thanks in advance for your help,
> >> >>
> >> >> -Paul Choi
> >> >>
> >> > _______________________________________________
> >> > ceph-users mailing list -- ceph-users(a)ceph.io
> <mailto:ceph-users@ceph.io>
> >> <mailto:ceph-users@ceph.io
<mailto:ceph-users@ceph.io>>
> >> > To unsubscribe send an email to
> ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>
> >> <mailto:ceph-users-leave@ceph.io
> <mailto:ceph-users-leave@ceph.io>>
> >>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> <mailto:ceph-users@ceph.io>
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
> <mailto:ceph-users-leave@ceph.io>
>