[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

27 Mar 2020

I’m actually very curious how well this is performing for you as I’ve definitely not seen
a deployment this large. How do you use it?

...
  On Mar 27, 2020, at 11:47 AM, shubjero
&lt;shubjero(a)gmail.com&gt; wrote:

 I've reported stability problems with ceph-mgr w/ prometheus plugin
 enabled on all versions we ran in production which were several
 versions of Luminous and Mimic. Our solution was to disable the
 prometheus exporter. I am using Zabbix instead. Our cluster is 1404
 OSD's in size with about 9PB raw with around 35% utilization.

 On Fri, Mar 27, 2020 at 4:26 AM Janek Bevendorff
 &lt;janek.bevendorff(a)uni-weimar.de&gt; wrote:

 Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were
 failing constantly due to the prometheus module doing something funny.

 On 26/03/2020 18:10, Paul Choi wrote:
  I won't speculate more into the MDS's
stability, but I do wonder about
 the same thing.
 There is one file served by the MDS that would cause the ceph-fuse
 client to hang. It was a file that many people in the company relied
 on for data updates, so very noticeable. The only fix was to fail over
 the MDS.

 Since the free disk space dropped, I haven't heard anyone complain...
 <shrug>

 On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff
 &lt;janek.bevendorff(a)uni-weimar.de
 <mailto:janek.bevendorff@uni-weimar.de>> wrote:

    If there is actually a connection, then it's no wonder our MDS
    kept crashing. Our Ceph has 9.2PiB of available space at the moment.

    On 26/03/2020 17:32, Paul Choi wrote:
>    I can't quite explain what happened, but the Prometheus endpoint
>    became stable after the free disk space for the largest pool went
>    substantially lower than 1PB.
>    I wonder if there's some metric that exceeds the maximum size for
>    some int, double, etc?
> 
>    -Paul
> 
>    On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff
>    &lt;janek.bevendorff(a)uni-weimar.de
>    <mailto:janek.bevendorff@uni-weimar.de>> wrote:
> 
>        I haven't seen any MGR hangs so far since I disabled the
>        prometheus
>        module. It seems like the module is not only slow, but kills
>        the whole
>        MGR when the cluster is sufficiently large, so these two
>        issues are most
>        likely connected. The issue has become much, much worse with
>        14.2.8.
> 
> 
>        On 23/03/2020 09:00, Janek Bevendorff wrote:
>> I am running the very latest version of Nautilus. I will
>        try setting up
>> an external exporter today and see if that fixes anything.
>        Our cluster
>> is somewhat large-ish with 1248 OSDs, so I expect stat
>        collection to
>> take "some" time, but it definitely shouldn't crush the
>        MGRs all the time.
>> 
>> On 21/03/2020 02:33, Paul Choi wrote:
>>> Hi Janek,
>>> 
>>> What version of Ceph are you using?
>>> We also have a much smaller cluster running Nautilus, with
>        no MDS. No
>>> Prometheus issues there.
>>> I won't speculate further than this but perhaps Nautilus
>        doesn't have
>>> the same issue as Mimic?
>>> 
>>> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
>>> &lt;janek.bevendorff(a)uni-weimar.de
>        <mailto:janek.bevendorff@uni-weimar.de>
>>> <mailto:janek.bevendorff@uni-weimar.de
>        <mailto:janek.bevendorff@uni-weimar.de>>> wrote:
>>> 
>>>    I think this is related to my previous post to this
>        list about MGRs
>>>    failing regularly and being overall quite slow to
>        respond. The problem
>>>    has existed before, but the new version has made it
>        way worse. My MGRs
>>>    keep dyring every few hours and need to be restarted.
>        the Promtheus
>>>    plugin works, but it's pretty slow and so is the
>        dashboard.
>>>    Unfortunately, nobody seems to have a solution for
>        this and I
>>>    wonder why
>>>    not more people are complaining about this problem.
>>> 
>>> 
>>>    On 20/03/2020 19:30, Paul Choi wrote:
>>>> If I "curl http://localhost:9283/metrics" and wait
>        sufficiently long
>>>> enough, I get this - says "No MON connection". But
>        the mons are
>>>    health and
>>>> the cluster is functioning fine.
>>>> That said, the mons' rocksdb sizes are fairly big
>        because
>>>    there's lots of
>>>> rebalancing going on. The Prometheus endpoint
>        hanging seems to
>>>    happen
>>>> regardless of the mon size anyhow.
>>>> 
>>>>    mon.woodenbox0 is 41 GiB >= mon_data_size_warn
>        (15 GiB)
>>>>    mon.woodenbox2 is 26 GiB >= mon_data_size_warn
>        (15 GiB)
>>>>    mon.woodenbox4 is 42 GiB >= mon_data_size_warn
>        (15 GiB)
>>>>    mon.woodenbox3 is 43 GiB >= mon_data_size_warn
>        (15 GiB)
>>>>    mon.woodenbox1 is 38 GiB >= mon_data_size_warn
>        (15 GiB)
>>>> 
>>>> # fg
>>>> curl -H "Connection: close"
>        http://localhost:9283/metrics
>>>> <!DOCTYPE html PUBLIC
>>>> "-//W3C//DTD XHTML 1.0 Transitional//EN"
>>>> 
>        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
>>>> <html>
>>>> <head>
>>>>    <meta http-equiv="Content-Type" content="text/html;
>>>> charset=utf-8"></meta>
>>>>    <title>503 Service Unavailable</title>
>>>>    <style type="text/css">
>>>>    #powered_by {
>>>>        margin-top: 20px;
>>>>        border-top: 2px solid black;
>>>>        font-style: italic;
>>>>    }
>>>> 
>>>>    #traceback {
>>>>        color: red;
>>>>    }
>>>>    </style>
>>>> </head>
>>>>    <body>
>>>>        <h2>503 Service Unavailable</h2>
>>>>        <p>No MON connection</p>
>>>>        <pre id="traceback">Traceback (most recent
>        call last):
>>>>  File
>>> 
>         "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py",
>        line 670,
>>>> in respond
>>>>    response.body = self.handler()
>>>>  File
>>> 
>         "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py",
>        line
>>>> 217, in __call__
>>>>    self.body = self.oldhandler(*args, **kwargs)
>>>>  File
>>> 
>         "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py",
>        line 61,
>>>> in __call__
>>>>    return self.callable(*self.args, **self.kwargs)
>>>>  File "/usr/lib/ceph/mgr/prometheus/module.py",
>        line 704, in
>>>    metrics
>>>>    return self._metrics(instance)
>>>>  File "/usr/lib/ceph/mgr/prometheus/module.py",
>        line 721, in
>>>    _metrics
>>>>    raise cherrypy.HTTPError(503, 'No MON connection')
>>>> HTTPError: (503, 'No MON connection')
>>>> </pre>
>>>>    <div id="powered_by">
>>>>      <span>
>>>>        Powered by <a
>        href="http://www.cherrypy.org">CherryPy
>>>    3.5.0</a>
>>>>      </span>
>>>>    </div>
>>>>    </body>
>>>> </html>
>>>> 
>>>> On Fri, Mar 20, 2020 at 6:33 AM Paul Choi
>        &lt;pchoi(a)nuro.ai <mailto:pchoi@nuro.ai>
>>>    <mailto:pchoi@nuro.ai <mailto:pchoi@nuro.ai>>> wrote:
>>>> 
>>>>> Hello,
>>>>> 
>>>>> We are running Mimic 13.2.8 with our cluster, and since
>>>    upgrading to
>>>>> 13.2.8 the Prometheus plugin seems to hang a lot.
>        It used to
>>>    respond under
>>>>> 10s but now it often hangs. Restarting the mgr
>        processes helps
>>>    temporarily
>>>>> but within minutes it gets stuck again.
>>>>> 
>>>>> The active mgr doesn't exit when doing `systemctl stop
>>>    ceph-mgr.target"
>>>>> and needs to
>>>>> be kill -9'ed.
>>>>> 
>>>>> Is there anything I can do to address this issue,
>        or at least
>>>    get better
>>>>> visibility into the issue?
>>>>> 
>>>>> We only have a few plugins enabled:
>>>>> $ ceph mgr module ls
>>>>> {
>>>>>    "enabled_modules": [
>>>>>        "balancer",
>>>>>        "prometheus",
>>>>>        "zabbix"
>>>>>    ],
>>>>> 
>>>>> 3 mgr processes, but it's a pretty large cluster
>        (near 4000
>>>    OSDs) and it's
>>>>> a busy one with lots of rebalancing. (I don't know
>        if a busy
>>>    cluster would
>>>>> seriously affect the mgr's performance, but just
>        throwing it
>>>    out there)
>>>>> 
>>>>>  services:
>>>>>    mon: 5 daemons, quorum
>>>>> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1
>>>>>    mgr: woodenbox2(active), standbys: woodenbox0,
>        woodenbox1
>>>>>    mds: cephfs-1/1/1 up  {0=woodenbox6=up:active}, 1
>>>    up:standby-replay
>>>>>    osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs
>>>>>    rgw: 4 daemons active
>>>>> 
>>>>> Thanks in advance for your help,
>>>>> 
>>>>> -Paul Choi
>>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>        <mailto:ceph-users@ceph.io>
>>>    <mailto:ceph-users@ceph.io <mailto:ceph-users@ceph.io>>
>>>> To unsubscribe send an email to
>        ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>
>>>    <mailto:ceph-users-leave@ceph.io
>        <mailto:ceph-users-leave@ceph.io>>
>>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>        <mailto:ceph-users@ceph.io>
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>        <mailto:ceph-users-leave@ceph.io>
>   _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 
_______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic