I’m actually very curious how well this is performing
for you as I’ve definitely not seen a deployment this large. How do you use it?
What exactly do you mean? Our cluster has 11PiB capacity of which about
15% are used at the moment (web-scale corpora and such). We have
deployed 5 MONs and 5MGRs (both on the same hosts) and it works totally
fine overall. We have some MDS performance issues here and there, but
that's not too bad anymore after a few upstream patches and then we have
this annoying Prometheus MGR problem, which kills our MGRs reliably
after a few hours.
>
>> On Mar 27, 2020, at 11:47 AM, shubjero <shubjero(a)gmail.com> wrote:
>>
>> I've reported stability problems with ceph-mgr w/ prometheus plugin
>> enabled on all versions we ran in production which were several
>> versions of Luminous and Mimic. Our solution was to disable the
>> prometheus exporter. I am using Zabbix instead. Our cluster is 1404
>> OSD's in size with about 9PB raw with around 35% utilization.
>>
>> On Fri, Mar 27, 2020 at 4:26 AM Janek Bevendorff
>> <janek.bevendorff(a)uni-weimar.de> wrote:
>>> Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were
>>> failing constantly due to the prometheus module doing something funny.
>>>
>>>
>>> On 26/03/2020 18:10, Paul Choi wrote:
>>>> I won't speculate more into the MDS's stability, but I do wonder
about
>>>> the same thing.
>>>> There is one file served by the MDS that would cause the ceph-fuse
>>>> client to hang. It was a file that many people in the company relied
>>>> on for data updates, so very noticeable. The only fix was to fail over
>>>> the MDS.
>>>>
>>>> Since the free disk space dropped, I haven't heard anyone
complain...
>>>> <shrug>
>>>>
>>>> On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff
>>>> <janek.bevendorff(a)uni-weimar.de
>>>> <mailto:janek.bevendorff@uni-weimar.de>> wrote:
>>>>
>>>> If there is actually a connection, then it's no wonder our MDS
>>>> kept crashing. Our Ceph has 9.2PiB of available space at the moment.
>>>>
>>>>
>>>> On 26/03/2020 17:32, Paul Choi wrote:
>>>>> I can't quite explain what happened, but the Prometheus
endpoint
>>>>> became stable after the free disk space for the largest pool went
>>>>> substantially lower than 1PB.
>>>>> I wonder if there's some metric that exceeds the maximum size
for
>>>>> some int, double, etc?
>>>>>
>>>>> -Paul
>>>>>
>>>>> On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff
>>>>> <janek.bevendorff(a)uni-weimar.de
>>>>> <mailto:janek.bevendorff@uni-weimar.de>> wrote:
>>>>>
>>>>> I haven't seen any MGR hangs so far since I disabled the
>>>>> prometheus
>>>>> module. It seems like the module is not only slow, but kills
>>>>> the whole
>>>>> MGR when the cluster is sufficiently large, so these two
>>>>> issues are most
>>>>> likely connected. The issue has become much, much worse with
>>>>> 14.2.8.
>>>>>
>>>>>
>>>>> On 23/03/2020 09:00, Janek Bevendorff wrote:
>>>>>> I am running the very latest version of Nautilus. I will
>>>>> try setting up
>>>>>> an external exporter today and see if that fixes anything.
>>>>> Our cluster
>>>>>> is somewhat large-ish with 1248 OSDs, so I expect stat
>>>>> collection to
>>>>>> take "some" time, but it definitely shouldn't crush
the
>>>>> MGRs all the time.
>>>>>> On 21/03/2020 02:33, Paul Choi wrote:
>>>>>>> Hi Janek,
>>>>>>>
>>>>>>> What version of Ceph are you using?
>>>>>>> We also have a much smaller cluster running Nautilus, with
>>>>> no MDS. No
>>>>>>> Prometheus issues there.
>>>>>>> I won't speculate further than this but perhaps Nautilus
>>>>> doesn't have
>>>>>>> the same issue as Mimic?
>>>>>>>
>>>>>>> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
>>>>>>> <janek.bevendorff(a)uni-weimar.de
>>>>> <mailto:janek.bevendorff@uni-weimar.de>
>>>>>>> <mailto:janek.bevendorff@uni-weimar.de
>>>>> <mailto:janek.bevendorff@uni-weimar.de>>> wrote:
>>>>>>> I think this is related to my previous post to this
>>>>> list about MGRs
>>>>>>> failing regularly and being overall quite slow to
>>>>> respond. The problem
>>>>>>> has existed before, but the new version has made it
>>>>> way worse. My MGRs
>>>>>>> keep dyring every few hours and need to be restarted.
>>>>> the Promtheus
>>>>>>> plugin works, but it's pretty slow and so is the
>>>>> dashboard.
>>>>>>> Unfortunately, nobody seems to have a solution for
>>>>> this and I
>>>>>>> wonder why
>>>>>>> not more people are complaining about this problem.
>>>>>>>
>>>>>>>
>>>>>>> On 20/03/2020 19:30, Paul Choi wrote:
>>>>>>>> If I "curl
http://localhost:9283/metrics" and
wait
>>>>> sufficiently long
>>>>>>>> enough, I get this - says "No MON connection".
But
>>>>> the mons are
>>>>>>> health and
>>>>>>>> the cluster is functioning fine.
>>>>>>>> That said, the mons' rocksdb sizes are fairly big
>>>>> because
>>>>>>> there's lots of
>>>>>>>> rebalancing going on. The Prometheus endpoint
>>>>> hanging seems to
>>>>>>> happen
>>>>>>>> regardless of the mon size anyhow.
>>>>>>>>
>>>>>>>> mon.woodenbox0 is 41 GiB >= mon_data_size_warn
>>>>> (15 GiB)
>>>>>>>> mon.woodenbox2 is 26 GiB >= mon_data_size_warn
>>>>> (15 GiB)
>>>>>>>> mon.woodenbox4 is 42 GiB >= mon_data_size_warn
>>>>> (15 GiB)
>>>>>>>> mon.woodenbox3 is 43 GiB >= mon_data_size_warn
>>>>> (15 GiB)
>>>>>>>> mon.woodenbox1 is 38 GiB >= mon_data_size_warn
>>>>> (15 GiB)
>>>>>>>> # fg
>>>>>>>> curl -H "Connection: close"
>>>>>
http://localhost:9283/metrics
>>>>>>>> <!DOCTYPE html PUBLIC
>>>>>>>> "-//W3C//DTD XHTML 1.0 Transitional//EN"
>>>>>>>>
>>>>>
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
>>>>>>>> <html>
>>>>>>>> <head>
>>>>>>>> <meta http-equiv="Content-Type"
content="text/html;
>>>>>>>> charset=utf-8"></meta>
>>>>>>>> <title>503 Service Unavailable</title>
>>>>>>>> <style type="text/css">
>>>>>>>> #powered_by {
>>>>>>>> margin-top: 20px;
>>>>>>>> border-top: 2px solid black;
>>>>>>>> font-style: italic;
>>>>>>>> }
>>>>>>>>
>>>>>>>> #traceback {
>>>>>>>> color: red;
>>>>>>>> }
>>>>>>>> </style>
>>>>>>>> </head>
>>>>>>>> <body>
>>>>>>>> <h2>503 Service Unavailable</h2>
>>>>>>>> <p>No MON connection</p>
>>>>>>>> <pre id="traceback">Traceback
(most recent
>>>>> call last):
>>>>>>>> File
>>>>>
"/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py",
>>>>> line 670,
>>>>>>>> in respond
>>>>>>>> response.body = self.handler()
>>>>>>>> File
>>>>>
"/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py",
>>>>> line
>>>>>>>> 217, in __call__
>>>>>>>> self.body = self.oldhandler(*args, **kwargs)
>>>>>>>> File
>>>>>
"/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py",
>>>>> line 61,
>>>>>>>> in __call__
>>>>>>>> return self.callable(*self.args, **self.kwargs)
>>>>>>>> File
"/usr/lib/ceph/mgr/prometheus/module.py",
>>>>> line 704, in
>>>>>>> metrics
>>>>>>>> return self._metrics(instance)
>>>>>>>> File
"/usr/lib/ceph/mgr/prometheus/module.py",
>>>>> line 721, in
>>>>>>> _metrics
>>>>>>>> raise cherrypy.HTTPError(503, 'No MON
connection')
>>>>>>>> HTTPError: (503, 'No MON connection')
>>>>>>>> </pre>
>>>>>>>> <div id="powered_by">
>>>>>>>> <span>
>>>>>>>> Powered by <a
>>>>> href="http://www.cherrypy.org">CherryPy
>>>>>>> 3.5.0</a>
>>>>>>>> </span>
>>>>>>>> </div>
>>>>>>>> </body>
>>>>>>>> </html>
>>>>>>>>
>>>>>>>> On Fri, Mar 20, 2020 at 6:33 AM Paul Choi
>>>>> <pchoi(a)nuro.ai <mailto:pchoi@nuro.ai>
>>>>>>> <mailto:pchoi@nuro.ai
<mailto:pchoi@nuro.ai>>> wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> We are running Mimic 13.2.8 with our cluster, and
since
>>>>>>> upgrading to
>>>>>>>>> 13.2.8 the Prometheus plugin seems to hang a lot.
>>>>> It used to
>>>>>>> respond under
>>>>>>>>> 10s but now it often hangs. Restarting the mgr
>>>>> processes helps
>>>>>>> temporarily
>>>>>>>>> but within minutes it gets stuck again.
>>>>>>>>>
>>>>>>>>> The active mgr doesn't exit when doing `systemctl
stop
>>>>>>> ceph-mgr.target"
>>>>>>>>> and needs to
>>>>>>>>> be kill -9'ed.
>>>>>>>>>
>>>>>>>>> Is there anything I can do to address this issue,
>>>>> or at least
>>>>>>> get better
>>>>>>>>> visibility into the issue?
>>>>>>>>>
>>>>>>>>> We only have a few plugins enabled:
>>>>>>>>> $ ceph mgr module ls
>>>>>>>>> {
>>>>>>>>> "enabled_modules": [
>>>>>>>>> "balancer",
>>>>>>>>> "prometheus",
>>>>>>>>> "zabbix"
>>>>>>>>> ],
>>>>>>>>>
>>>>>>>>> 3 mgr processes, but it's a pretty large cluster
>>>>> (near 4000
>>>>>>> OSDs) and it's
>>>>>>>>> a busy one with lots of rebalancing. (I don't
know
>>>>> if a busy
>>>>>>> cluster would
>>>>>>>>> seriously affect the mgr's performance, but just
>>>>> throwing it
>>>>>>> out there)
>>>>>>>>> services:
>>>>>>>>> mon: 5 daemons, quorum
>>>>>>>>>
woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1
>>>>>>>>> mgr: woodenbox2(active), standbys: woodenbox0,
>>>>> woodenbox1
>>>>>>>>> mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1
>>>>>>> up:standby-replay
>>>>>>>>> osd: 3964 osds: 3928 up, 3928 in; 831 remapped
pgs
>>>>>>>>> rgw: 4 daemons active
>>>>>>>>>
>>>>>>>>> Thanks in advance for your help,
>>>>>>>>>
>>>>>>>>> -Paul Choi
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>> <mailto:ceph-users@ceph.io>
>>>>>>> <mailto:ceph-users@ceph.io
<mailto:ceph-users@ceph.io>>
>>>>>>>> To unsubscribe send an email to
>>>>> ceph-users-leave(a)ceph.io
<mailto:ceph-users-leave@ceph.io>
>>>>>>> <mailto:ceph-users-leave@ceph.io
>>>>> <mailto:ceph-users-leave@ceph.io>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>> <mailto:ceph-users@ceph.io>
>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>> <mailto:ceph-users-leave@ceph.io>
>>>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io