I dug up this issue report, where the problem has been reported before:
https://tracker.ceph.com/issues/39264
Unfortuantely, the issue hasn't got much (or any) attention yet. So
let's get this fixed, the prometheus module is unusable in its current
state.
On 23/03/2020 17:50, Janek Bevendorff wrote:
> I haven't seen any MGR hangs so far since I disabled the prometheus
> module. It seems like the module is not only slow, but kills the whole
> MGR when the cluster is sufficiently large, so these two issues are most
> likely connected. The issue has become much, much worse with 14.2.8.
>
>
> On 23/03/2020 09:00, Janek Bevendorff wrote:
>> I am running the very latest version of Nautilus. I will try setting up
>> an external exporter today and see if that fixes anything. Our cluster
>> is somewhat large-ish with 1248 OSDs, so I expect stat collection to
>> take "some" time, but it definitely shouldn't crush the MGRs all
the time.
>>
>> On 21/03/2020 02:33, Paul Choi wrote:
>>> Hi Janek,
>>>
>>> What version of Ceph are you using?
>>> We also have a much smaller cluster running Nautilus, with no MDS. No
>>> Prometheus issues there.
>>> I won't speculate further than this but perhaps Nautilus doesn't
have
>>> the same issue as Mimic?
>>>
>>> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
>>> <janek.bevendorff(a)uni-weimar.de
>>> <mailto:janek.bevendorff@uni-weimar.de>> wrote:
>>>
>>> I think this is related to my previous post to this list about MGRs
>>> failing regularly and being overall quite slow to respond. The problem
>>> has existed before, but the new version has made it way worse. My MGRs
>>> keep dyring every few hours and need to be restarted. the Promtheus
>>> plugin works, but it's pretty slow and so is the dashboard.
>>> Unfortunately, nobody seems to have a solution for this and I
>>> wonder why
>>> not more people are complaining about this problem.
>>>
>>>
>>> On 20/03/2020 19:30, Paul Choi wrote:
>>> > If I "curl
http://localhost:9283/metrics" and wait
sufficiently long
>>> > enough, I get this - says "No MON connection". But the
mons are
>>> health and
>>> > the cluster is functioning fine.
>>> > That said, the mons' rocksdb sizes are fairly big because
>>> there's lots of
>>> > rebalancing going on. The Prometheus endpoint hanging seems to
>>> happen
>>> > regardless of the mon size anyhow.
>>> >
>>> > mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB)
>>> > mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB)
>>> > mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB)
>>> > mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB)
>>> > mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB)
>>> >
>>> > # fg
>>> > curl -H "Connection: close"
http://localhost:9283/metrics
>>> > <!DOCTYPE html PUBLIC
>>> > "-//W3C//DTD XHTML 1.0 Transitional//EN"
>>> >
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
>>> > <html>
>>> > <head>
>>> > <meta http-equiv="Content-Type"
content="text/html;
>>> > charset=utf-8"></meta>
>>> > <title>503 Service Unavailable</title>
>>> > <style type="text/css">
>>> > #powered_by {
>>> > margin-top: 20px;
>>> > border-top: 2px solid black;
>>> > font-style: italic;
>>> > }
>>> >
>>> > #traceback {
>>> > color: red;
>>> > }
>>> > </style>
>>> > </head>
>>> > <body>
>>> > <h2>503 Service Unavailable</h2>
>>> > <p>No MON connection</p>
>>> > <pre id="traceback">Traceback (most recent
call last):
>>> > File
>>> "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line
670,
>>> > in respond
>>> > response.body = self.handler()
>>> > File
>>> "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py",
line
>>> > 217, in __call__
>>> > self.body = self.oldhandler(*args, **kwargs)
>>> > File
>>> "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py",
line 61,
>>> > in __call__
>>> > return self.callable(*self.args, **self.kwargs)
>>> > File "/usr/lib/ceph/mgr/prometheus/module.py", line 704,
in
>>> metrics
>>> > return self._metrics(instance)
>>> > File "/usr/lib/ceph/mgr/prometheus/module.py", line 721,
in
>>> _metrics
>>> > raise cherrypy.HTTPError(503, 'No MON connection')
>>> > HTTPError: (503, 'No MON connection')
>>> > </pre>
>>> > <div id="powered_by">
>>> > <span>
>>> > Powered by <a
href="http://www.cherrypy.org">CherryPy
>>> 3.5.0</a>
>>> > </span>
>>> > </div>
>>> > </body>
>>> > </html>
>>> >
>>> > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi <pchoi(a)nuro.ai
>>> <mailto:pchoi@nuro.ai>> wrote:
>>> >
>>> >> Hello,
>>> >>
>>> >> We are running Mimic 13.2.8 with our cluster, and since
>>> upgrading to
>>> >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to
>>> respond under
>>> >> 10s but now it often hangs. Restarting the mgr processes helps
>>> temporarily
>>> >> but within minutes it gets stuck again.
>>> >>
>>> >> The active mgr doesn't exit when doing `systemctl stop
>>> ceph-mgr.target"
>>> >> and needs to
>>> >> be kill -9'ed.
>>> >>
>>> >> Is there anything I can do to address this issue, or at least
>>> get better
>>> >> visibility into the issue?
>>> >>
>>> >> We only have a few plugins enabled:
>>> >> $ ceph mgr module ls
>>> >> {
>>> >> "enabled_modules": [
>>> >> "balancer",
>>> >> "prometheus",
>>> >> "zabbix"
>>> >> ],
>>> >>
>>> >> 3 mgr processes, but it's a pretty large cluster (near 4000
>>> OSDs) and it's
>>> >> a busy one with lots of rebalancing. (I don't know if a
busy
>>> cluster would
>>> >> seriously affect the mgr's performance, but just throwing
it
>>> out there)
>>> >>
>>> >> services:
>>> >> mon: 5 daemons, quorum
>>> >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1
>>> >> mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1
>>> >> mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1
>>> up:standby-replay
>>> >> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs
>>> >> rgw: 4 daemons active
>>> >>
>>> >> Thanks in advance for your help,
>>> >>
>>> >> -Paul Choi
>>> >>
>>> > _______________________________________________
>>> > ceph-users mailing list -- ceph-users(a)ceph.io
>>> <mailto:ceph-users@ceph.io>
>>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>> <mailto:ceph-users-leave@ceph.io>
>>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io