From pchoi@nuro.ai Fri Mar 20 16:33:48 2020 From: Paul Choi To: ceph-users@ceph.io Subject: [ceph-users] No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic Date: Fri, 20 Mar 2020 06:33:39 -1000 Message-ID: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============8042832416970433845==" --===============8042832416970433845== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Hello, We are running Mimic 13.2.8 with our cluster, and since upgrading to 13.2.8 the Prometheus plugin seems to hang a lot. It used to respond under 10s but now it often hangs. Restarting the mgr processes helps temporarily but within minutes it gets stuck again. The active mgr doesn't exit when doing `systemctl stop ceph-mgr.target" and needs to be kill -9'ed. Is there anything I can do to address this issue, or at least get better visibility into the issue? We only have a few plugins enabled: $ ceph mgr module ls { "enabled_modules": [ "balancer", "prometheus", "zabbix" ], 3 mgr processes, but it's a pretty large cluster (near 4000 OSDs) and it's a busy one with lots of rebalancing. (I don't know if a busy cluster would seriously affect the mgr's performance, but just throwing it out there) services: mon: 5 daemons, quorum woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1 mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 up:standby-replay osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs rgw: 4 daemons active Thanks in advance for your help, -Paul Choi --===============8042832416970433845==-- From pchoi@nuro.ai Fri Mar 20 18:30:28 2020 From: Paul Choi To: ceph-users@ceph.io Subject: [ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic Date: Fri, 20 Mar 2020 08:30:18 -1000 Message-ID: In-Reply-To: CALwDB-fMbY883CLQ_pJWRUr50w8wwuhQYE9u_iazw+dNNe5PqA@mail.gmail.com MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============2021197458844010841==" --===============2021197458844010841== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit If I "curl http://localhost:9283/metrics" and wait sufficiently long enough, I get this - says "No MON connection". But the mons are health and the cluster is functioning fine. That said, the mons' rocksdb sizes are fairly big because there's lots of rebalancing going on. The Prometheus endpoint hanging seems to happen regardless of the mon size anyhow. mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB) mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB) mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB) mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB) mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB) # fg curl -H "Connection: close" http://localhost:9283/metrics 503 Service Unavailable

503 Service Unavailable

No MON connection

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670,
in respond
    response.body = self.handler()
  File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
217, in __call__
    self.body = self.oldhandler(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61,
in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in metrics
    return self._metrics(instance)
  File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in _metrics
    raise cherrypy.HTTPError(503, 'No MON connection')
HTTPError: (503, 'No MON connection')
Powered by CherryPy 3.5.0
On Fri, Mar 20, 2020 at 6:33 AM Paul Choi wrote: > Hello, > > We are running Mimic 13.2.8 with our cluster, and since upgrading to > 13.2.8 the Prometheus plugin seems to hang a lot. It used to respond under > 10s but now it often hangs. Restarting the mgr processes helps temporarily > but within minutes it gets stuck again. > > The active mgr doesn't exit when doing `systemctl stop ceph-mgr.target" > and needs to > be kill -9'ed. > > Is there anything I can do to address this issue, or at least get better > visibility into the issue? > > We only have a few plugins enabled: > $ ceph mgr module ls > { > "enabled_modules": [ > "balancer", > "prometheus", > "zabbix" > ], > > 3 mgr processes, but it's a pretty large cluster (near 4000 OSDs) and it's > a busy one with lots of rebalancing. (I don't know if a busy cluster would > seriously affect the mgr's performance, but just throwing it out there) > > services: > mon: 5 daemons, quorum > woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 > mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1 > mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 up:standby-replay > osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs > rgw: 4 daemons active > > Thanks in advance for your help, > > -Paul Choi > --===============2021197458844010841==-- From janek.bevendorff@uni-weimar.de Fri Mar 20 22:23:02 2020 From: Janek Bevendorff To: ceph-users@ceph.io Subject: [ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic Date: Fri, 20 Mar 2020 23:22:57 +0100 Message-ID: In-Reply-To: CALwDB-eMeuZogh+z9xu1jzPyaJ=L6zR9Z5nYqwOk6ARybBhcnQ@mail.gmail.com MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1088869585384712681==" --===============1088869585384712681== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit I think this is related to my previous post to this list about MGRs failing regularly and being overall quite slow to respond. The problem has existed before, but the new version has made it way worse. My MGRs keep dyring every few hours and need to be restarted. the Promtheus plugin works, but it's pretty slow and so is the dashboard. Unfortunately, nobody seems to have a solution for this and I wonder why not more people are complaining about this problem. On 20/03/2020 19:30, Paul Choi wrote: > If I "curl http://localhost:9283/metrics" and wait sufficiently long > enough, I get this - says "No MON connection". But the mons are health and > the cluster is functioning fine. > That said, the mons' rocksdb sizes are fairly big because there's lots of > rebalancing going on. The Prometheus endpoint hanging seems to happen > regardless of the mon size anyhow. > > mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB) > mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB) > mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB) > mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB) > mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB) > > # fg > curl -H "Connection: close" http://localhost:9283/metrics > "-//W3C//DTD XHTML 1.0 Transitional//EN" > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > > > > 503 Service Unavailable > > > >

503 Service Unavailable

>

No MON connection

>
Traceback (most recent call last):
>   File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670,
> in respond
>     response.body = self.handler()
>   File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
> 217, in __call__
>     self.body = self.oldhandler(*args, **kwargs)
>   File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61,
> in __call__
>     return self.callable(*self.args, **self.kwargs)
>   File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in metrics
>     return self._metrics(instance)
>   File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in _metrics
>     raise cherrypy.HTTPError(503, 'No MON connection')
> HTTPError: (503, 'No MON connection')
> 
>
> > Powered by CherryPy 3.5.0 > >
> > > > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi wrote: > >> Hello, >> >> We are running Mimic 13.2.8 with our cluster, and since upgrading to >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to respond under >> 10s but now it often hangs. Restarting the mgr processes helps temporarily >> but within minutes it gets stuck again. >> >> The active mgr doesn't exit when doing `systemctl stop ceph-mgr.target" >> and needs to >> be kill -9'ed. >> >> Is there anything I can do to address this issue, or at least get better >> visibility into the issue? >> >> We only have a few plugins enabled: >> $ ceph mgr module ls >> { >> "enabled_modules": [ >> "balancer", >> "prometheus", >> "zabbix" >> ], >> >> 3 mgr processes, but it's a pretty large cluster (near 4000 OSDs) and it's >> a busy one with lots of rebalancing. (I don't know if a busy cluster would >> seriously affect the mgr's performance, but just throwing it out there) >> >> services: >> mon: 5 daemons, quorum >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 >> mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1 >> mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 up:standby-replay >> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs >> rgw: 4 daemons active >> >> Thanks in advance for your help, >> >> -Paul Choi >> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io --===============1088869585384712681==-- From pchoi@nuro.ai Sat Mar 21 01:34:00 2020 From: Paul Choi To: ceph-users@ceph.io Subject: [ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic Date: Fri, 20 Mar 2020 15:33:46 -1000 Message-ID: In-Reply-To: e6ba677f-894b-7600-988d-f33c8e751c72@uni-weimar.de MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1607877138711663395==" --===============1607877138711663395== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Hi Janek, What version of Ceph are you using? We also have a much smaller cluster running Nautilus, with no MDS. No Prometheus issues there. I won't speculate further than this but perhaps Nautilus doesn't have the same issue as Mimic? On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff < janek.bevendorff(a)uni-weimar.de> wrote: > I think this is related to my previous post to this list about MGRs > failing regularly and being overall quite slow to respond. The problem > has existed before, but the new version has made it way worse. My MGRs > keep dyring every few hours and need to be restarted. the Promtheus > plugin works, but it's pretty slow and so is the dashboard. > Unfortunately, nobody seems to have a solution for this and I wonder why > not more people are complaining about this problem. > > > On 20/03/2020 19:30, Paul Choi wrote: > > If I "curl http://localhost:9283/metrics" and wait sufficiently long > > enough, I get this - says "No MON connection". But the mons are health > and > > the cluster is functioning fine. > > That said, the mons' rocksdb sizes are fairly big because there's lots of > > rebalancing going on. The Prometheus endpoint hanging seems to happen > > regardless of the mon size anyhow. > > > > mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB) > > mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB) > > mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB) > > mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB) > > mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB) > > > > # fg > > curl -H "Connection: close" http://localhost:9283/metrics > > > "-//W3C//DTD XHTML 1.0 Transitional//EN" > > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > > > > > > > > 503 Service Unavailable > > > > > > > >

503 Service Unavailable

> >

No MON connection

> >
Traceback (most recent call last):
> >   File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line
> 670,
> > in respond
> >     response.body = self.handler()
> >   File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
> > 217, in __call__
> >     self.body = self.oldhandler(*args, **kwargs)
> >   File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line
> 61,
> > in __call__
> >     return self.callable(*self.args, **self.kwargs)
> >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in metrics
> >     return self._metrics(instance)
> >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in _metrics
> >     raise cherrypy.HTTPError(503, 'No MON connection')
> > HTTPError: (503, 'No MON connection')
> > 
> >
> > > > Powered by CherryPy 3.5.0 > > > >
> > > > > > > > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi wrote: > > > >> Hello, > >> > >> We are running Mimic 13.2.8 with our cluster, and since upgrading to > >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to respond > under > >> 10s but now it often hangs. Restarting the mgr processes helps > temporarily > >> but within minutes it gets stuck again. > >> > >> The active mgr doesn't exit when doing `systemctl stop ceph-mgr.target" > >> and needs to > >> be kill -9'ed. > >> > >> Is there anything I can do to address this issue, or at least get better > >> visibility into the issue? > >> > >> We only have a few plugins enabled: > >> $ ceph mgr module ls > >> { > >> "enabled_modules": [ > >> "balancer", > >> "prometheus", > >> "zabbix" > >> ], > >> > >> 3 mgr processes, but it's a pretty large cluster (near 4000 OSDs) and > it's > >> a busy one with lots of rebalancing. (I don't know if a busy cluster > would > >> seriously affect the mgr's performance, but just throwing it out there) > >> > >> services: > >> mon: 5 daemons, quorum > >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 > >> mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1 > >> mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 up:standby-replay > >> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs > >> rgw: 4 daemons active > >> > >> Thanks in advance for your help, > >> > >> -Paul Choi > >> > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > --===============1607877138711663395==-- From janek.bevendorff@uni-weimar.de Mon Mar 23 08:00:39 2020 From: Janek Bevendorff To: ceph-users@ceph.io Subject: [ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic Date: Mon, 23 Mar 2020 09:00:33 +0100 Message-ID: <85801915-449f-cc9c-7843-c512ef50d2be@uni-weimar.de> In-Reply-To: CALwDB-dzFz9HAseGJz1nDrgJYgwkbP+S9PqeewT7kFPzBHeXPA@mail.gmail.com MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0265967999579501656==" --===============0265967999579501656== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit I am running the very latest version of Nautilus. I will try setting up an external exporter today and see if that fixes anything. Our cluster is somewhat large-ish with 1248 OSDs, so I expect stat collection to take "some" time, but it definitely shouldn't crush the MGRs all the time. On 21/03/2020 02:33, Paul Choi wrote: > Hi Janek, > > What version of Ceph are you using? > We also have a much smaller cluster running Nautilus, with no MDS. No > Prometheus issues there. > I won't speculate further than this but perhaps Nautilus doesn't have > the same issue as Mimic? > > On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff > > wrote: > > I think this is related to my previous post to this list about MGRs > failing regularly and being overall quite slow to respond. The problem > has existed before, but the new version has made it way worse. My MGRs > keep dyring every few hours and need to be restarted. the Promtheus > plugin works, but it's pretty slow and so is the dashboard. > Unfortunately, nobody seems to have a solution for this and I > wonder why > not more people are complaining about this problem. > > > On 20/03/2020 19:30, Paul Choi wrote: > > If I "curl http://localhost:9283/metrics" and wait sufficiently long > > enough, I get this - says "No MON connection". But the mons are > health and > > the cluster is functioning fine. > > That said, the mons' rocksdb sizes are fairly big because > there's lots of > > rebalancing going on. The Prometheus endpoint hanging seems to > happen > > regardless of the mon size anyhow. > > > >     mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB) > >     mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB) > >     mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB) > >     mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB) > >     mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB) > > > > # fg > > curl -H "Connection: close" http://localhost:9283/metrics > > > "-//W3C//DTD XHTML 1.0 Transitional//EN" > > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > > > > > >      > >     503 Service Unavailable > >      > > > >      > >         

503 Service Unavailable

> >         

No MON connection

> >         
Traceback (most recent call last):
>     >   File
>     "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670,
>     > in respond
>     >     response.body = self.handler()
>     >   File
>     "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
>     > 217, in __call__
>     >     self.body = self.oldhandler(*args, **kwargs)
>     >   File
>     "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61,
>     > in __call__
>     >     return self.callable(*self.args, **self.kwargs)
>     >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in
>     metrics
>     >     return self._metrics(instance)
>     >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in
>     _metrics
>     >     raise cherrypy.HTTPError(503, 'No MON connection')
>     > HTTPError: (503, 'No MON connection')
>     > 
> >     
> >        > >         Powered by CherryPy > 3.5.0 > >        > >     
> >      > > > > > > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi > wrote: > > > >> Hello, > >> > >> We are running Mimic 13.2.8 with our cluster, and since > upgrading to > >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to > respond under > >> 10s but now it often hangs. Restarting the mgr processes helps > temporarily > >> but within minutes it gets stuck again. > >> > >> The active mgr doesn't exit when doing `systemctl stop > ceph-mgr.target" > >> and needs to > >>  be kill -9'ed. > >> > >> Is there anything I can do to address this issue, or at least > get better > >> visibility into the issue? > >> > >> We only have a few plugins enabled: > >> $ ceph mgr module ls > >> { > >>     "enabled_modules": [ > >>         "balancer", > >>         "prometheus", > >>         "zabbix" > >>     ], > >> > >> 3 mgr processes, but it's a pretty large cluster (near 4000 > OSDs) and it's > >> a busy one with lots of rebalancing. (I don't know if a busy > cluster would > >> seriously affect the mgr's performance, but just throwing it > out there) > >> > >>   services: > >>     mon: 5 daemons, quorum > >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 > >>     mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1 > >>     mds: cephfs-1/1/1 up  {0=woodenbox6=up:active}, 1 > up:standby-replay > >>     osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs > >>     rgw: 4 daemons active > >> > >> Thanks in advance for your help, > >> > >> -Paul Choi > >> > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > > --===============0265967999579501656==-- From janek.bevendorff@uni-weimar.de Mon Mar 23 16:50:36 2020 From: Janek Bevendorff To: ceph-users@ceph.io Subject: [ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic Date: Mon, 23 Mar 2020 17:50:31 +0100 Message-ID: <8fea91b8-12a4-8a8b-38a7-b014657306da@uni-weimar.de> In-Reply-To: 85801915-449f-cc9c-7843-c512ef50d2be@uni-weimar.de MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============2433933820073559108==" --===============2433933820073559108== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit I haven't seen any MGR hangs so far since I disabled the prometheus module. It seems like the module is not only slow, but kills the whole MGR when the cluster is sufficiently large, so these two issues are most likely connected. The issue has become much, much worse with 14.2.8. On 23/03/2020 09:00, Janek Bevendorff wrote: > I am running the very latest version of Nautilus. I will try setting up > an external exporter today and see if that fixes anything. Our cluster > is somewhat large-ish with 1248 OSDs, so I expect stat collection to > take "some" time, but it definitely shouldn't crush the MGRs all the time. > > On 21/03/2020 02:33, Paul Choi wrote: >> Hi Janek, >> >> What version of Ceph are you using? >> We also have a much smaller cluster running Nautilus, with no MDS. No >> Prometheus issues there. >> I won't speculate further than this but perhaps Nautilus doesn't have >> the same issue as Mimic? >> >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff >> > > wrote: >> >> I think this is related to my previous post to this list about MGRs >> failing regularly and being overall quite slow to respond. The problem >> has existed before, but the new version has made it way worse. My MGRs >> keep dyring every few hours and need to be restarted. the Promtheus >> plugin works, but it's pretty slow and so is the dashboard. >> Unfortunately, nobody seems to have a solution for this and I >> wonder why >> not more people are complaining about this problem. >> >> >> On 20/03/2020 19:30, Paul Choi wrote: >> > If I "curl http://localhost:9283/metrics" and wait sufficiently long >> > enough, I get this - says "No MON connection". But the mons are >> health and >> > the cluster is functioning fine. >> > That said, the mons' rocksdb sizes are fairly big because >> there's lots of >> > rebalancing going on. The Prometheus endpoint hanging seems to >> happen >> > regardless of the mon size anyhow. >> > >> >     mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB) >> >     mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB) >> >     mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB) >> >     mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB) >> >     mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB) >> > >> > # fg >> > curl -H "Connection: close" http://localhost:9283/metrics >> > > > "-//W3C//DTD XHTML 1.0 Transitional//EN" >> > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> >> > >> > >> >      >> >     503 Service Unavailable >> >      >> > >> >      >> >         

503 Service Unavailable

>> >         

No MON connection

>> >         
Traceback (most recent call last):
>>     >   File
>>     "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670,
>>     > in respond
>>     >     response.body = self.handler()
>>     >   File
>>     "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
>>     > 217, in __call__
>>     >     self.body = self.oldhandler(*args, **kwargs)
>>     >   File
>>     "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61,
>>     > in __call__
>>     >     return self.callable(*self.args, **self.kwargs)
>>     >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in
>>     metrics
>>     >     return self._metrics(instance)
>>     >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in
>>     _metrics
>>     >     raise cherrypy.HTTPError(503, 'No MON connection')
>>     > HTTPError: (503, 'No MON connection')
>>     > 
>> >     
>> >        >> >         Powered by CherryPy >> 3.5.0 >> >        >> >     
>> >      >> > >> > >> > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi > > wrote: >> > >> >> Hello, >> >> >> >> We are running Mimic 13.2.8 with our cluster, and since >> upgrading to >> >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to >> respond under >> >> 10s but now it often hangs. Restarting the mgr processes helps >> temporarily >> >> but within minutes it gets stuck again. >> >> >> >> The active mgr doesn't exit when doing `systemctl stop >> ceph-mgr.target" >> >> and needs to >> >>  be kill -9'ed. >> >> >> >> Is there anything I can do to address this issue, or at least >> get better >> >> visibility into the issue? >> >> >> >> We only have a few plugins enabled: >> >> $ ceph mgr module ls >> >> { >> >>     "enabled_modules": [ >> >>         "balancer", >> >>         "prometheus", >> >>         "zabbix" >> >>     ], >> >> >> >> 3 mgr processes, but it's a pretty large cluster (near 4000 >> OSDs) and it's >> >> a busy one with lots of rebalancing. (I don't know if a busy >> cluster would >> >> seriously affect the mgr's performance, but just throwing it >> out there) >> >> >> >>   services: >> >>     mon: 5 daemons, quorum >> >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 >> >>     mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1 >> >>     mds: cephfs-1/1/1 up  {0=woodenbox6=up:active}, 1 >> up:standby-replay >> >>     osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs >> >>     rgw: 4 daemons active >> >> >> >> Thanks in advance for your help, >> >> >> >> -Paul Choi >> >> >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users(a)ceph.io >> >> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >> >> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io --===============2433933820073559108==-- From janek.bevendorff@uni-weimar.de Mon Mar 23 17:06:30 2020 From: Janek Bevendorff To: ceph-users@ceph.io Subject: [ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic Date: Mon, 23 Mar 2020 18:06:25 +0100 Message-ID: In-Reply-To: 8fea91b8-12a4-8a8b-38a7-b014657306da@uni-weimar.de MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============8035966537722884561==" --===============8035966537722884561== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit I dug up this issue report, where the problem has been reported before: https://tracker.ceph.com/issues/39264 Unfortuantely, the issue hasn't got much (or any) attention yet. So let's get this fixed, the prometheus module is unusable in its current state. On 23/03/2020 17:50, Janek Bevendorff wrote: > I haven't seen any MGR hangs so far since I disabled the prometheus > module. It seems like the module is not only slow, but kills the whole > MGR when the cluster is sufficiently large, so these two issues are most > likely connected. The issue has become much, much worse with 14.2.8. > > > On 23/03/2020 09:00, Janek Bevendorff wrote: >> I am running the very latest version of Nautilus. I will try setting up >> an external exporter today and see if that fixes anything. Our cluster >> is somewhat large-ish with 1248 OSDs, so I expect stat collection to >> take "some" time, but it definitely shouldn't crush the MGRs all the time. >> >> On 21/03/2020 02:33, Paul Choi wrote: >>> Hi Janek, >>> >>> What version of Ceph are you using? >>> We also have a much smaller cluster running Nautilus, with no MDS. No >>> Prometheus issues there. >>> I won't speculate further than this but perhaps Nautilus doesn't have >>> the same issue as Mimic? >>> >>> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff >>> >> > wrote: >>> >>> I think this is related to my previous post to this list about MGRs >>> failing regularly and being overall quite slow to respond. The problem >>> has existed before, but the new version has made it way worse. My MGRs >>> keep dyring every few hours and need to be restarted. the Promtheus >>> plugin works, but it's pretty slow and so is the dashboard. >>> Unfortunately, nobody seems to have a solution for this and I >>> wonder why >>> not more people are complaining about this problem. >>> >>> >>> On 20/03/2020 19:30, Paul Choi wrote: >>> > If I "curl http://localhost:9283/metrics" and wait sufficiently long >>> > enough, I get this - says "No MON connection". But the mons are >>> health and >>> > the cluster is functioning fine. >>> > That said, the mons' rocksdb sizes are fairly big because >>> there's lots of >>> > rebalancing going on. The Prometheus endpoint hanging seems to >>> happen >>> > regardless of the mon size anyhow. >>> > >>> >     mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB) >>> >     mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB) >>> >     mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB) >>> >     mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB) >>> >     mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB) >>> > >>> > # fg >>> > curl -H "Connection: close" http://localhost:9283/metrics >>> > >> > "-//W3C//DTD XHTML 1.0 Transitional//EN" >>> > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> >>> > >>> > >>> >      >>> >     503 Service Unavailable >>> >      >>> > >>> >      >>> >         

503 Service Unavailable

>>> >         

No MON connection

>>> >         
Traceback (most recent call last):
>>>     >   File
>>>     "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670,
>>>     > in respond
>>>     >     response.body = self.handler()
>>>     >   File
>>>     "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
>>>     > 217, in __call__
>>>     >     self.body = self.oldhandler(*args, **kwargs)
>>>     >   File
>>>     "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61,
>>>     > in __call__
>>>     >     return self.callable(*self.args, **self.kwargs)
>>>     >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in
>>>     metrics
>>>     >     return self._metrics(instance)
>>>     >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in
>>>     _metrics
>>>     >     raise cherrypy.HTTPError(503, 'No MON connection')
>>>     > HTTPError: (503, 'No MON connection')
>>>     > 
>>> >     
>>> >        >>> >         Powered by CherryPy >>> 3.5.0 >>> >        >>> >     
>>> >      >>> > >>> > >>> > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi >> > wrote: >>> > >>> >> Hello, >>> >> >>> >> We are running Mimic 13.2.8 with our cluster, and since >>> upgrading to >>> >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to >>> respond under >>> >> 10s but now it often hangs. Restarting the mgr processes helps >>> temporarily >>> >> but within minutes it gets stuck again. >>> >> >>> >> The active mgr doesn't exit when doing `systemctl stop >>> ceph-mgr.target" >>> >> and needs to >>> >>  be kill -9'ed. >>> >> >>> >> Is there anything I can do to address this issue, or at least >>> get better >>> >> visibility into the issue? >>> >> >>> >> We only have a few plugins enabled: >>> >> $ ceph mgr module ls >>> >> { >>> >>     "enabled_modules": [ >>> >>         "balancer", >>> >>         "prometheus", >>> >>         "zabbix" >>> >>     ], >>> >> >>> >> 3 mgr processes, but it's a pretty large cluster (near 4000 >>> OSDs) and it's >>> >> a busy one with lots of rebalancing. (I don't know if a busy >>> cluster would >>> >> seriously affect the mgr's performance, but just throwing it >>> out there) >>> >> >>> >>   services: >>> >>     mon: 5 daemons, quorum >>> >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 >>> >>     mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1 >>> >>     mds: cephfs-1/1/1 up  {0=woodenbox6=up:active}, 1 >>> up:standby-replay >>> >>     osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs >>> >>     rgw: 4 daemons active >>> >> >>> >> Thanks in advance for your help, >>> >> >>> >> -Paul Choi >>> >> >>> > _______________________________________________ >>> > ceph-users mailing list -- ceph-users(a)ceph.io >>> >>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> >>> >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io --===============8035966537722884561==-- From pchoi@nuro.ai Thu Mar 26 16:32:20 2020 From: Paul Choi To: ceph-users@ceph.io Subject: [ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic Date: Thu, 26 Mar 2020 09:32:09 -0700 Message-ID: In-Reply-To: 8fea91b8-12a4-8a8b-38a7-b014657306da@uni-weimar.de MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============6399495665059526764==" --===============6399495665059526764== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit I can't quite explain what happened, but the Prometheus endpoint became stable after the free disk space for the largest pool went substantially lower than 1PB. I wonder if there's some metric that exceeds the maximum size for some int, double, etc? -Paul On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff < janek.bevendorff(a)uni-weimar.de> wrote: > I haven't seen any MGR hangs so far since I disabled the prometheus > module. It seems like the module is not only slow, but kills the whole > MGR when the cluster is sufficiently large, so these two issues are most > likely connected. The issue has become much, much worse with 14.2.8. > > > On 23/03/2020 09:00, Janek Bevendorff wrote: > > I am running the very latest version of Nautilus. I will try setting up > > an external exporter today and see if that fixes anything. Our cluster > > is somewhat large-ish with 1248 OSDs, so I expect stat collection to > > take "some" time, but it definitely shouldn't crush the MGRs all the > time. > > > > On 21/03/2020 02:33, Paul Choi wrote: > >> Hi Janek, > >> > >> What version of Ceph are you using? > >> We also have a much smaller cluster running Nautilus, with no MDS. No > >> Prometheus issues there. > >> I won't speculate further than this but perhaps Nautilus doesn't have > >> the same issue as Mimic? > >> > >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff > >> >> > wrote: > >> > >> I think this is related to my previous post to this list about MGRs > >> failing regularly and being overall quite slow to respond. The > problem > >> has existed before, but the new version has made it way worse. My > MGRs > >> keep dyring every few hours and need to be restarted. the Promtheus > >> plugin works, but it's pretty slow and so is the dashboard. > >> Unfortunately, nobody seems to have a solution for this and I > >> wonder why > >> not more people are complaining about this problem. > >> > >> > >> On 20/03/2020 19:30, Paul Choi wrote: > >> > If I "curl http://localhost:9283/metrics" and wait sufficiently > long > >> > enough, I get this - says "No MON connection". But the mons are > >> health and > >> > the cluster is functioning fine. > >> > That said, the mons' rocksdb sizes are fairly big because > >> there's lots of > >> > rebalancing going on. The Prometheus endpoint hanging seems to > >> happen > >> > regardless of the mon size anyhow. > >> > > >> > mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB) > >> > mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB) > >> > mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB) > >> > mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB) > >> > mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB) > >> > > >> > # fg > >> > curl -H "Connection: close" http://localhost:9283/metrics > >> > >> > "-//W3C//DTD XHTML 1.0 Transitional//EN" > >> > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > >> > > >> > > >> > > >> > 503 Service Unavailable > >> > > >> > > >> > > >> >

503 Service Unavailable

> >> >

No MON connection

> >> >
Traceback (most recent call last):
> >>     >   File
> >>     "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670,
> >>     > in respond
> >>     >     response.body = self.handler()
> >>     >   File
> >>     "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
> >>     > 217, in __call__
> >>     >     self.body = self.oldhandler(*args, **kwargs)
> >>     >   File
> >>     "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61,
> >>     > in __call__
> >>     >     return self.callable(*self.args, **self.kwargs)
> >>     >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in
> >>     metrics
> >>     >     return self._metrics(instance)
> >>     >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in
> >>     _metrics
> >>     >     raise cherrypy.HTTPError(503, 'No MON connection')
> >>     > HTTPError: (503, 'No MON connection')
> >>     > 
> >> >
> >> > > >> > Powered by CherryPy > >> 3.5.0 > >> > > >> >
> >> > > >> > > >> > > >> > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi >> > wrote: > >> > > >> >> Hello, > >> >> > >> >> We are running Mimic 13.2.8 with our cluster, and since > >> upgrading to > >> >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to > >> respond under > >> >> 10s but now it often hangs. Restarting the mgr processes helps > >> temporarily > >> >> but within minutes it gets stuck again. > >> >> > >> >> The active mgr doesn't exit when doing `systemctl stop > >> ceph-mgr.target" > >> >> and needs to > >> >> be kill -9'ed. > >> >> > >> >> Is there anything I can do to address this issue, or at least > >> get better > >> >> visibility into the issue? > >> >> > >> >> We only have a few plugins enabled: > >> >> $ ceph mgr module ls > >> >> { > >> >> "enabled_modules": [ > >> >> "balancer", > >> >> "prometheus", > >> >> "zabbix" > >> >> ], > >> >> > >> >> 3 mgr processes, but it's a pretty large cluster (near 4000 > >> OSDs) and it's > >> >> a busy one with lots of rebalancing. (I don't know if a busy > >> cluster would > >> >> seriously affect the mgr's performance, but just throwing it > >> out there) > >> >> > >> >> services: > >> >> mon: 5 daemons, quorum > >> >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 > >> >> mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1 > >> >> mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 > >> up:standby-replay > >> >> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs > >> >> rgw: 4 daemons active > >> >> > >> >> Thanks in advance for your help, > >> >> > >> >> -Paul Choi > >> >> > >> > _______________________________________________ > >> > ceph-users mailing list -- ceph-users(a)ceph.io > >> > >> > To unsubscribe send an email to ceph-users-leave(a)ceph.io > >> > >> > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > --===============6399495665059526764==-- From janek.bevendorff@uni-weimar.de Thu Mar 26 16:43:14 2020 From: Janek Bevendorff To: ceph-users@ceph.io Subject: [ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic Date: Thu, 26 Mar 2020 17:43:10 +0100 Message-ID: <2e385b2d-7a75-ffd7-3599-c12d66b20b9c@uni-weimar.de> In-Reply-To: CALwDB-f1vnLLVyHfi=sum7VZoGZQzKoK31ieSq+Shmp7MZVM2Q@mail.gmail.com MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============6237504570762936148==" --===============6237504570762936148== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable If there is actually a connection, then it's no wonder our MDS kept crashing. Our Ceph has 9.2PiB of available space at the moment. On 26/03/2020 17:32, Paul Choi wrote: > I can't quite explain what happened, but the Prometheus endpoint > became stable after the free disk space for the largest pool went > substantially lower than 1PB. > I wonder if there's some metric that exceeds the maximum size for some > int, double, etc? > > -Paul > > On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff > > wrote: > > I haven't seen any MGR hangs so far since I disabled the prometheus > module. It seems like the module is not only slow, but kills the whole > MGR when the cluster is sufficiently large, so these two issues > are most > likely connected. The issue has become much, much worse with 14.2.8. > > > On 23/03/2020 09:00, Janek Bevendorff wrote: > > I am running the very latest version of Nautilus. I will try > setting up > > an external exporter today and see if that fixes anything. Our > cluster > > is somewhat large-ish with 1248 OSDs, so I expect stat collection to > > take "some" time, but it definitely shouldn't crush the MGRs all > the time. > > > > On 21/03/2020 02:33, Paul Choi wrote: > >> Hi Janek, > >> > >> What version of Ceph are you using? > >> We also have a much smaller cluster running Nautilus, with no > MDS. No > >> Prometheus issues there. > >> I won't speculate further than this but perhaps Nautilus > doesn't have > >> the same issue as Mimic? > >> > >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff > >> > >> >> wrote: > >> > >>=C2=A0 =C2=A0 =C2=A0I think this is related to my previous post to th= is list > about MGRs > >>=C2=A0 =C2=A0 =C2=A0failing regularly and being overall quite slow to= respond. > The problem > >>=C2=A0 =C2=A0 =C2=A0has existed before, but the new version has made = it way > worse. My MGRs > >>=C2=A0 =C2=A0 =C2=A0keep dyring every few hours and need to be restar= ted. the > Promtheus > >>=C2=A0 =C2=A0 =C2=A0plugin works, but it's pretty slow and so is the = dashboard. > >>=C2=A0 =C2=A0 =C2=A0Unfortunately, nobody seems to have a solution fo= r this and I > >>=C2=A0 =C2=A0 =C2=A0wonder why > >>=C2=A0 =C2=A0 =C2=A0not more people are complaining about this proble= m. > >> > >> > >>=C2=A0 =C2=A0 =C2=A0On 20/03/2020 19:30, Paul Choi wrote: > >>=C2=A0 =C2=A0 =C2=A0> If I "curl http://localhost:9283/metrics" and w= ait > sufficiently long > >>=C2=A0 =C2=A0 =C2=A0> enough, I get this - says "No MON connection". = But the > mons are > >>=C2=A0 =C2=A0 =C2=A0health and > >>=C2=A0 =C2=A0 =C2=A0> the cluster is functioning fine. > >>=C2=A0 =C2=A0 =C2=A0> That said, the mons' rocksdb sizes are fairly b= ig because > >>=C2=A0 =C2=A0 =C2=A0there's lots of > >>=C2=A0 =C2=A0 =C2=A0> rebalancing going on. The Prometheus endpoint h= anging > seems to > >>=C2=A0 =C2=A0 =C2=A0happen > >>=C2=A0 =C2=A0 =C2=A0> regardless of the mon size anyhow. > >>=C2=A0 =C2=A0 =C2=A0> > >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0mon.woodenbox0 is 41 GiB >= =3D mon_data_size_warn (15 GiB) > >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0mon.woodenbox2 is 26 GiB >= =3D mon_data_size_warn (15 GiB) > >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0mon.woodenbox4 is 42 GiB >= =3D mon_data_size_warn (15 GiB) > >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0mon.woodenbox3 is 43 GiB >= =3D mon_data_size_warn (15 GiB) > >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0mon.woodenbox1 is 38 GiB >= =3D mon_data_size_warn (15 GiB) > >>=C2=A0 =C2=A0 =C2=A0> > >>=C2=A0 =C2=A0 =C2=A0> # fg > >>=C2=A0 =C2=A0 =C2=A0> curl -H "Connection: close" http://localhost:92= 83/metrics > >>=C2=A0 =C2=A0 =C2=A0> >>=C2=A0 =C2=A0 =C2=A0> "-//W3C//DTD XHTML 1.0 Transitional//EN" > >>=C2=A0 =C2=A0 =C2=A0> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transit= ional.dtd"> > >>=C2=A0 =C2=A0 =C2=A0> > >>=C2=A0 =C2=A0 =C2=A0> > >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0 >>=C2=A0 =C2=A0 =C2=A0> charset=3Dutf-8"> > >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0503 Service Unavailab= le > >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0 > >>=C2=A0 =C2=A0 =C2=A0> > >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0 > >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0

503 Servic= e Unavailable

> >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0

No MON conn= ection

> >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0
Traceback (most recent call
>     last):
>     >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0File
>     >>=C2=A0 =C2=A0 =C2=A0"/usr/lib/python2.7/dist-packages/cherrypy/_cpreq=
uest.py",
>     line 670,
>     >>=C2=A0 =C2=A0 =C2=A0> in respond
>     >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0response.body =3D self.handl=
er()
>     >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0File
>     >>=C2=A0 =C2=A0
>     =C2=A0"/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
>     >>=C2=A0 =C2=A0 =C2=A0> 217, in __call__
>     >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0self.body =3D self.oldhandle=
r(*args, **kwargs)
>     >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0File
>     >>=C2=A0 =C2=A0 =C2=A0"/usr/lib/python2.7/dist-packages/cherrypy/_cpdis=
patch.py",
>     line 61,
>     >>=C2=A0 =C2=A0 =C2=A0> in __call__
>     >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0return self.callable(*self.a=
rgs, **self.kwargs)
>     >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0File "/usr/lib/ceph/mgr/prometheus/=
module.py", line 704, in
>     >>=C2=A0 =C2=A0 =C2=A0metrics
>     >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0return self._metrics(instanc=
e)
>     >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0File "/usr/lib/ceph/mgr/prometheus/=
module.py", line 721, in
>     >>=C2=A0 =C2=A0 =C2=A0_metrics
>     >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0raise cherrypy.HTTPError(503=
, 'No MON connection')
>     >>=C2=A0 =C2=A0 =C2=A0> HTTPError: (503, 'No MON connection')
>     >>=C2=A0 =C2=A0 =C2=A0> 
> >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0
> >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0 =C2=A0 > >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Powered by CherryPy > >>=C2=A0 =C2=A0 =C2=A03.5.0 > >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0 =C2=A0 > >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0
> >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0 > >>=C2=A0 =C2=A0 =C2=A0> > >>=C2=A0 =C2=A0 =C2=A0> > >>=C2=A0 =C2=A0 =C2=A0> On Fri, Mar 20, 2020 at 6:33 AM Paul Choi > >>=C2=A0 =C2=A0 =C2=A0>= > wrote: > >>=C2=A0 =C2=A0 =C2=A0> > >>=C2=A0 =C2=A0 =C2=A0>> Hello, > >>=C2=A0 =C2=A0 =C2=A0>> > >>=C2=A0 =C2=A0 =C2=A0>> We are running Mimic 13.2.8 with our cluster, = and since > >>=C2=A0 =C2=A0 =C2=A0upgrading to > >>=C2=A0 =C2=A0 =C2=A0>> 13.2.8 the Prometheus plugin seems to hang a l= ot. It used to > >>=C2=A0 =C2=A0 =C2=A0respond under > >>=C2=A0 =C2=A0 =C2=A0>> 10s but now it often hangs. Restarting the mgr= processes > helps > >>=C2=A0 =C2=A0 =C2=A0temporarily > >>=C2=A0 =C2=A0 =C2=A0>> but within minutes it gets stuck again. > >>=C2=A0 =C2=A0 =C2=A0>> > >>=C2=A0 =C2=A0 =C2=A0>> The active mgr doesn't exit when doing `system= ctl stop > >>=C2=A0 =C2=A0 =C2=A0ceph-mgr.target" > >>=C2=A0 =C2=A0 =C2=A0>> and needs to > >>=C2=A0 =C2=A0 =C2=A0>>=C2=A0 be kill -9'ed. > >>=C2=A0 =C2=A0 =C2=A0>> > >>=C2=A0 =C2=A0 =C2=A0>> Is there anything I can do to address this iss= ue, or at > least > >>=C2=A0 =C2=A0 =C2=A0get better > >>=C2=A0 =C2=A0 =C2=A0>> visibility into the issue? > >>=C2=A0 =C2=A0 =C2=A0>> > >>=C2=A0 =C2=A0 =C2=A0>> We only have a few plugins enabled: > >>=C2=A0 =C2=A0 =C2=A0>> $ ceph mgr module ls > >>=C2=A0 =C2=A0 =C2=A0>> { > >>=C2=A0 =C2=A0 =C2=A0>>=C2=A0 =C2=A0 =C2=A0"enabled_modules": [ > >>=C2=A0 =C2=A0 =C2=A0>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0"balancer", > >>=C2=A0 =C2=A0 =C2=A0>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0"prometheus", > >>=C2=A0 =C2=A0 =C2=A0>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0"zabbix" > >>=C2=A0 =C2=A0 =C2=A0>>=C2=A0 =C2=A0 =C2=A0], > >>=C2=A0 =C2=A0 =C2=A0>> > >>=C2=A0 =C2=A0 =C2=A0>> 3 mgr processes, but it's a pretty large clust= er (near 4000 > >>=C2=A0 =C2=A0 =C2=A0OSDs) and it's > >>=C2=A0 =C2=A0 =C2=A0>> a busy one with lots of rebalancing. (I don't = know if a busy > >>=C2=A0 =C2=A0 =C2=A0cluster would > >>=C2=A0 =C2=A0 =C2=A0>> seriously affect the mgr's performance, but ju= st throwing it > >>=C2=A0 =C2=A0 =C2=A0out there) > >>=C2=A0 =C2=A0 =C2=A0>> > >>=C2=A0 =C2=A0 =C2=A0>>=C2=A0 =C2=A0services: > >>=C2=A0 =C2=A0 =C2=A0>>=C2=A0 =C2=A0 =C2=A0mon: 5 daemons, quorum > >>=C2=A0 =C2=A0 =C2=A0>> woodenbox0,woodenbox2,woodenbox4,woodenbox3,wo= odenbox1 > >>=C2=A0 =C2=A0 =C2=A0>>=C2=A0 =C2=A0 =C2=A0mgr: woodenbox2(active), st= andbys: woodenbox0, > woodenbox1 > >>=C2=A0 =C2=A0 =C2=A0>>=C2=A0 =C2=A0 =C2=A0mds: cephfs-1/1/1 up=C2=A0 = {0=3Dwoodenbox6=3Dup:active}, 1 > >>=C2=A0 =C2=A0 =C2=A0up:standby-replay > >>=C2=A0 =C2=A0 =C2=A0>>=C2=A0 =C2=A0 =C2=A0osd: 3964 osds: 3928 up, 39= 28 in; 831 remapped pgs > >>=C2=A0 =C2=A0 =C2=A0>>=C2=A0 =C2=A0 =C2=A0rgw: 4 daemons active > >>=C2=A0 =C2=A0 =C2=A0>> > >>=C2=A0 =C2=A0 =C2=A0>> Thanks in advance for your help, > >>=C2=A0 =C2=A0 =C2=A0>> > >>=C2=A0 =C2=A0 =C2=A0>> -Paul Choi > >>=C2=A0 =C2=A0 =C2=A0>> > >>=C2=A0 =C2=A0 =C2=A0> _______________________________________________ > >>=C2=A0 =C2=A0 =C2=A0> ceph-users mailing list -- ceph-users(a)ceph.io > > >>=C2=A0 =C2=A0 =C2=A0> > >>=C2=A0 =C2=A0 =C2=A0> To unsubscribe send an email to ceph-users-leav= e(a)ceph.io > > >>=C2=A0 =C2=A0 =C2=A0 > > >> > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > > --===============6237504570762936148==-- From pchoi@nuro.ai Thu Mar 26 17:11:05 2020 From: Paul Choi To: ceph-users@ceph.io Subject: [ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic Date: Thu, 26 Mar 2020 10:10:57 -0700 Message-ID: In-Reply-To: 2e385b2d-7a75-ffd7-3599-c12d66b20b9c@uni-weimar.de MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============8313045499209756955==" --===============8313045499209756955== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit I won't speculate more into the MDS's stability, but I do wonder about the same thing. There is one file served by the MDS that would cause the ceph-fuse client to hang. It was a file that many people in the company relied on for data updates, so very noticeable. The only fix was to fail over the MDS. Since the free disk space dropped, I haven't heard anyone complain... On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff < janek.bevendorff(a)uni-weimar.de> wrote: > If there is actually a connection, then it's no wonder our MDS kept > crashing. Our Ceph has 9.2PiB of available space at the moment. > > > On 26/03/2020 17:32, Paul Choi wrote: > > I can't quite explain what happened, but the Prometheus endpoint became > stable after the free disk space for the largest pool went substantially > lower than 1PB. > I wonder if there's some metric that exceeds the maximum size for some > int, double, etc? > > -Paul > > On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff < > janek.bevendorff(a)uni-weimar.de> wrote: > >> I haven't seen any MGR hangs so far since I disabled the prometheus >> module. It seems like the module is not only slow, but kills the whole >> MGR when the cluster is sufficiently large, so these two issues are most >> likely connected. The issue has become much, much worse with 14.2.8. >> >> >> On 23/03/2020 09:00, Janek Bevendorff wrote: >> > I am running the very latest version of Nautilus. I will try setting up >> > an external exporter today and see if that fixes anything. Our cluster >> > is somewhat large-ish with 1248 OSDs, so I expect stat collection to >> > take "some" time, but it definitely shouldn't crush the MGRs all the >> time. >> > >> > On 21/03/2020 02:33, Paul Choi wrote: >> >> Hi Janek, >> >> >> >> What version of Ceph are you using? >> >> We also have a much smaller cluster running Nautilus, with no MDS. No >> >> Prometheus issues there. >> >> I won't speculate further than this but perhaps Nautilus doesn't have >> >> the same issue as Mimic? >> >> >> >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff >> >> > >> > wrote: >> >> >> >> I think this is related to my previous post to this list about MGRs >> >> failing regularly and being overall quite slow to respond. The >> problem >> >> has existed before, but the new version has made it way worse. My >> MGRs >> >> keep dyring every few hours and need to be restarted. the Promtheus >> >> plugin works, but it's pretty slow and so is the dashboard. >> >> Unfortunately, nobody seems to have a solution for this and I >> >> wonder why >> >> not more people are complaining about this problem. >> >> >> >> >> >> On 20/03/2020 19:30, Paul Choi wrote: >> >> > If I "curl http://localhost:9283/metrics" and wait sufficiently >> long >> >> > enough, I get this - says "No MON connection". But the mons are >> >> health and >> >> > the cluster is functioning fine. >> >> > That said, the mons' rocksdb sizes are fairly big because >> >> there's lots of >> >> > rebalancing going on. The Prometheus endpoint hanging seems to >> >> happen >> >> > regardless of the mon size anyhow. >> >> > >> >> > mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB) >> >> > mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB) >> >> > mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB) >> >> > mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB) >> >> > mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB) >> >> > >> >> > # fg >> >> > curl -H "Connection: close" http://localhost:9283/metrics >> >> > > >> > "-//W3C//DTD XHTML 1.0 Transitional//EN" >> >> > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> >> >> > >> >> > >> >> > >> >> > 503 Service Unavailable >> >> > >> >> > >> >> > >> >> >

503 Service Unavailable

>> >> >

No MON connection

>> >> >
Traceback (most recent call last):
>> >>     >   File
>> >>     "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line
>> 670,
>> >>     > in respond
>> >>     >     response.body = self.handler()
>> >>     >   File
>> >>     "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
>> >>     > 217, in __call__
>> >>     >     self.body = self.oldhandler(*args, **kwargs)
>> >>     >   File
>> >>     "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line
>> 61,
>> >>     > in __call__
>> >>     >     return self.callable(*self.args, **self.kwargs)
>> >>     >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in
>> >>     metrics
>> >>     >     return self._metrics(instance)
>> >>     >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in
>> >>     _metrics
>> >>     >     raise cherrypy.HTTPError(503, 'No MON connection')
>> >>     > HTTPError: (503, 'No MON connection')
>> >>     > 
>> >> >
>> >> > >> >> > Powered by CherryPy >> >> 3.5.0 >> >> > >> >> >
>> >> > >> >> > >> >> > >> >> > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi > >> > wrote: >> >> > >> >> >> Hello, >> >> >> >> >> >> We are running Mimic 13.2.8 with our cluster, and since >> >> upgrading to >> >> >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to >> >> respond under >> >> >> 10s but now it often hangs. Restarting the mgr processes helps >> >> temporarily >> >> >> but within minutes it gets stuck again. >> >> >> >> >> >> The active mgr doesn't exit when doing `systemctl stop >> >> ceph-mgr.target" >> >> >> and needs to >> >> >> be kill -9'ed. >> >> >> >> >> >> Is there anything I can do to address this issue, or at least >> >> get better >> >> >> visibility into the issue? >> >> >> >> >> >> We only have a few plugins enabled: >> >> >> $ ceph mgr module ls >> >> >> { >> >> >> "enabled_modules": [ >> >> >> "balancer", >> >> >> "prometheus", >> >> >> "zabbix" >> >> >> ], >> >> >> >> >> >> 3 mgr processes, but it's a pretty large cluster (near 4000 >> >> OSDs) and it's >> >> >> a busy one with lots of rebalancing. (I don't know if a busy >> >> cluster would >> >> >> seriously affect the mgr's performance, but just throwing it >> >> out there) >> >> >> >> >> >> services: >> >> >> mon: 5 daemons, quorum >> >> >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 >> >> >> mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1 >> >> >> mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 >> >> up:standby-replay >> >> >> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs >> >> >> rgw: 4 daemons active >> >> >> >> >> >> Thanks in advance for your help, >> >> >> >> >> >> -Paul Choi >> >> >> >> >> > _______________________________________________ >> >> > ceph-users mailing list -- ceph-users(a)ceph.io >> >> >> >> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >> >> >> >> >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users(a)ceph.io >> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >> > --===============8313045499209756955==-- From janek.bevendorff@uni-weimar.de Fri Mar 27 08:25:21 2020 From: Janek Bevendorff To: ceph-users@ceph.io Subject: [ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic Date: Fri, 27 Mar 2020 09:25:17 +0100 Message-ID: In-Reply-To: CALwDB-e=WXb5_T4NXW3zzgactU6vSgD9oKERCk=vnWNHU7u--w@mail.gmail.com MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============3547379919013120755==" --===============3547379919013120755== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were failing constantly due to the prometheus module doing something funny. On 26/03/2020 18:10, Paul Choi wrote: > I won't speculate more into the MDS's stability, but I do wonder about > the same thing. > There is one file served by the MDS that would cause the ceph-fuse > client to hang. It was a file that many people in the company relied > on for data updates, so very noticeable. The only fix was to fail over > the MDS. > > Since the free disk space dropped, I haven't heard anyone complain... > > > On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff > > wrote: > > If there is actually a connection, then it's no wonder our MDS > kept crashing. Our Ceph has 9.2PiB of available space at the moment. > > > On 26/03/2020 17:32, Paul Choi wrote: >> I can't quite explain what happened, but the Prometheus endpoint >> became stable after the free disk space for the largest pool went >> substantially lower than 1PB. >> I wonder if there's some metric that exceeds the maximum size for >> some int, double, etc? >> >> -Paul >> >> On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff >> > > wrote: >> >> I haven't seen any MGR hangs so far since I disabled the >> prometheus >> module. It seems like the module is not only slow, but kills >> the whole >> MGR when the cluster is sufficiently large, so these two >> issues are most >> likely connected. The issue has become much, much worse with >> 14.2.8. >> >> >> On 23/03/2020 09:00, Janek Bevendorff wrote: >> > I am running the very latest version of Nautilus. I will >> try setting up >> > an external exporter today and see if that fixes anything. >> Our cluster >> > is somewhat large-ish with 1248 OSDs, so I expect stat >> collection to >> > take "some" time, but it definitely shouldn't crush the >> MGRs all the time. >> > >> > On 21/03/2020 02:33, Paul Choi wrote: >> >> Hi Janek, >> >> >> >> What version of Ceph are you using? >> >> We also have a much smaller cluster running Nautilus, with >> no MDS. No >> >> Prometheus issues there. >> >> I won't speculate further than this but perhaps Nautilus >> doesn't have >> >> the same issue as Mimic? >> >> >> >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff >> >> > >> >> > >> wrote: >> >> >> >>=C2=A0 =C2=A0 =C2=A0I think this is related to my previous post = to this >> list about MGRs >> >>=C2=A0 =C2=A0 =C2=A0failing regularly and being overall quite sl= ow to >> respond. The problem >> >>=C2=A0 =C2=A0 =C2=A0has existed before, but the new version has = made it >> way worse. My MGRs >> >>=C2=A0 =C2=A0 =C2=A0keep dyring every few hours and need to be r= estarted. >> the Promtheus >> >>=C2=A0 =C2=A0 =C2=A0plugin works, but it's pretty slow and so is= the >> dashboard. >> >>=C2=A0 =C2=A0 =C2=A0Unfortunately, nobody seems to have a soluti= on for >> this and I >> >>=C2=A0 =C2=A0 =C2=A0wonder why >> >>=C2=A0 =C2=A0 =C2=A0not more people are complaining about this p= roblem. >> >> >> >> >> >>=C2=A0 =C2=A0 =C2=A0On 20/03/2020 19:30, Paul Choi wrote: >> >>=C2=A0 =C2=A0 =C2=A0> If I "curl http://localhost:9283/metrics" = and wait >> sufficiently long >> >>=C2=A0 =C2=A0 =C2=A0> enough, I get this - says "No MON connecti= on". But >> the mons are >> >>=C2=A0 =C2=A0 =C2=A0health and >> >>=C2=A0 =C2=A0 =C2=A0> the cluster is functioning fine. >> >>=C2=A0 =C2=A0 =C2=A0> That said, the mons' rocksdb sizes are fai= rly big >> because >> >>=C2=A0 =C2=A0 =C2=A0there's lots of >> >>=C2=A0 =C2=A0 =C2=A0> rebalancing going on. The Prometheus endpo= int >> hanging seems to >> >>=C2=A0 =C2=A0 =C2=A0happen >> >>=C2=A0 =C2=A0 =C2=A0> regardless of the mon size anyhow. >> >>=C2=A0 =C2=A0 =C2=A0> >> >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0mon.woodenbox0 is 41 Gi= B >=3D mon_data_size_warn >> (15 GiB) >> >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0mon.woodenbox2 is 26 Gi= B >=3D mon_data_size_warn >> (15 GiB) >> >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0mon.woodenbox4 is 42 Gi= B >=3D mon_data_size_warn >> (15 GiB) >> >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0mon.woodenbox3 is 43 Gi= B >=3D mon_data_size_warn >> (15 GiB) >> >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0mon.woodenbox1 is 38 Gi= B >=3D mon_data_size_warn >> (15 GiB) >> >>=C2=A0 =C2=A0 =C2=A0> >> >>=C2=A0 =C2=A0 =C2=A0> # fg >> >>=C2=A0 =C2=A0 =C2=A0> curl -H "Connection: close" >> http://localhost:9283/metrics >> >>=C2=A0 =C2=A0 =C2=A0> > >>=C2=A0 =C2=A0 =C2=A0> "-//W3C//DTD XHTML 1.0 Transitional//EN" >> >>=C2=A0 =C2=A0 =C2=A0> >> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> >> >>=C2=A0 =C2=A0 =C2=A0> >> >>=C2=A0 =C2=A0 =C2=A0> >> >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0> >>=C2=A0 =C2=A0 =C2=A0> charset=3Dutf-8"> >> >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0503 Service Unav= ailable >> >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0 >> >>=C2=A0 =C2=A0 =C2=A0> >> >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0 >> >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0

503 S= ervice Unavailable

>> >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0

No MON= connection

>> >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0
Traceback (most recent
>>         call last):
>>         >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0File
>>         >>=C2=A0 =C2=A0
>>         =C2=A0"/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py",
>>         line 670,
>>         >>=C2=A0 =C2=A0 =C2=A0> in respond
>>         >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0response.body =3D self.=
handler()
>>         >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0File
>>         >>=C2=A0 =C2=A0
>>         =C2=A0"/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py",
>>         line
>>         >>=C2=A0 =C2=A0 =C2=A0> 217, in __call__
>>         >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0self.body =3D self.oldh=
andler(*args, **kwargs)
>>         >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0File
>>         >>=C2=A0 =C2=A0
>>         =C2=A0"/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py",
>>         line 61,
>>         >>=C2=A0 =C2=A0 =C2=A0> in __call__
>>         >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0return self.callable(*s=
elf.args, **self.kwargs)
>>         >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0File "/usr/lib/ceph/mgr/promet=
heus/module.py",
>>         line 704, in
>>         >>=C2=A0 =C2=A0 =C2=A0metrics
>>         >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0return self._metrics(in=
stance)
>>         >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0File "/usr/lib/ceph/mgr/promet=
heus/module.py",
>>         line 721, in
>>         >>=C2=A0 =C2=A0 =C2=A0_metrics
>>         >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0raise cherrypy.HTTPErro=
r(503, 'No MON connection')
>>         >>=C2=A0 =C2=A0 =C2=A0> HTTPError: (503, 'No MON connection')
>>         >>=C2=A0 =C2=A0 =C2=A0> 
>> >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0
>> >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0 =C2=A0 >> >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Powered b= y > href=3D"http://www.cherrypy.org">CherryPy >> >>=C2=A0 =C2=A0 =C2=A03.5.0 >> >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0 =C2=A0 >> >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0
>> >>=C2=A0 =C2=A0 =C2=A0>=C2=A0 =C2=A0 =C2=A0 >> >>=C2=A0 =C2=A0 =C2=A0> >> >>=C2=A0 =C2=A0 =C2=A0> >> >>=C2=A0 =C2=A0 =C2=A0> On Fri, Mar 20, 2020 at 6:33 AM Paul Choi >> >> >>=C2=A0 =C2=A0 =C2=A0>> wrote: >> >>=C2=A0 =C2=A0 =C2=A0> >> >>=C2=A0 =C2=A0 =C2=A0>> Hello, >> >>=C2=A0 =C2=A0 =C2=A0>> >> >>=C2=A0 =C2=A0 =C2=A0>> We are running Mimic 13.2.8 with our clus= ter, and since >> >>=C2=A0 =C2=A0 =C2=A0upgrading to >> >>=C2=A0 =C2=A0 =C2=A0>> 13.2.8 the Prometheus plugin seems to han= g a lot. >> It used to >> >>=C2=A0 =C2=A0 =C2=A0respond under >> >>=C2=A0 =C2=A0 =C2=A0>> 10s but now it often hangs. Restarting th= e mgr >> processes helps >> >>=C2=A0 =C2=A0 =C2=A0temporarily >> >>=C2=A0 =C2=A0 =C2=A0>> but within minutes it gets stuck again. >> >>=C2=A0 =C2=A0 =C2=A0>> >> >>=C2=A0 =C2=A0 =C2=A0>> The active mgr doesn't exit when doing `s= ystemctl stop >> >>=C2=A0 =C2=A0 =C2=A0ceph-mgr.target" >> >>=C2=A0 =C2=A0 =C2=A0>> and needs to >> >>=C2=A0 =C2=A0 =C2=A0>>=C2=A0 be kill -9'ed. >> >>=C2=A0 =C2=A0 =C2=A0>> >> >>=C2=A0 =C2=A0 =C2=A0>> Is there anything I can do to address thi= s issue, >> or at least >> >>=C2=A0 =C2=A0 =C2=A0get better >> >>=C2=A0 =C2=A0 =C2=A0>> visibility into the issue? >> >>=C2=A0 =C2=A0 =C2=A0>> >> >>=C2=A0 =C2=A0 =C2=A0>> We only have a few plugins enabled: >> >>=C2=A0 =C2=A0 =C2=A0>> $ ceph mgr module ls >> >>=C2=A0 =C2=A0 =C2=A0>> { >> >>=C2=A0 =C2=A0 =C2=A0>>=C2=A0 =C2=A0 =C2=A0"enabled_modules": [ >> >>=C2=A0 =C2=A0 =C2=A0>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0"balance= r", >> >>=C2=A0 =C2=A0 =C2=A0>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0"prometh= eus", >> >>=C2=A0 =C2=A0 =C2=A0>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0"zabbix" >> >>=C2=A0 =C2=A0 =C2=A0>>=C2=A0 =C2=A0 =C2=A0], >> >>=C2=A0 =C2=A0 =C2=A0>> >> >>=C2=A0 =C2=A0 =C2=A0>> 3 mgr processes, but it's a pretty large = cluster >> (near 4000 >> >>=C2=A0 =C2=A0 =C2=A0OSDs) and it's >> >>=C2=A0 =C2=A0 =C2=A0>> a busy one with lots of rebalancing. (I d= on't know >> if a busy >> >>=C2=A0 =C2=A0 =C2=A0cluster would >> >>=C2=A0 =C2=A0 =C2=A0>> seriously affect the mgr's performance, b= ut just >> throwing it >> >>=C2=A0 =C2=A0 =C2=A0out there) >> >>=C2=A0 =C2=A0 =C2=A0>> >> >>=C2=A0 =C2=A0 =C2=A0>>=C2=A0 =C2=A0services: >> >>=C2=A0 =C2=A0 =C2=A0>>=C2=A0 =C2=A0 =C2=A0mon: 5 daemons, quorum >> >>=C2=A0 =C2=A0 =C2=A0>> woodenbox0,woodenbox2,woodenbox4,woodenbo= x3,woodenbox1 >> >>=C2=A0 =C2=A0 =C2=A0>>=C2=A0 =C2=A0 =C2=A0mgr: woodenbox2(active= ), standbys: woodenbox0, >> woodenbox1 >> >>=C2=A0 =C2=A0 =C2=A0>>=C2=A0 =C2=A0 =C2=A0mds: cephfs-1/1/1 up= =C2=A0 {0=3Dwoodenbox6=3Dup:active}, 1 >> >>=C2=A0 =C2=A0 =C2=A0up:standby-replay >> >>=C2=A0 =C2=A0 =C2=A0>>=C2=A0 =C2=A0 =C2=A0osd: 3964 osds: 3928 u= p, 3928 in; 831 remapped pgs >> >>=C2=A0 =C2=A0 =C2=A0>>=C2=A0 =C2=A0 =C2=A0rgw: 4 daemons active >> >>=C2=A0 =C2=A0 =C2=A0>> >> >>=C2=A0 =C2=A0 =C2=A0>> Thanks in advance for your help, >> >>=C2=A0 =C2=A0 =C2=A0>> >> >>=C2=A0 =C2=A0 =C2=A0>> -Paul Choi >> >>=C2=A0 =C2=A0 =C2=A0>> >> >>=C2=A0 =C2=A0 =C2=A0> __________________________________________= _____ >> >>=C2=A0 =C2=A0 =C2=A0> ceph-users mailing list -- ceph-users(a)ce= ph.io >> >> >>=C2=A0 =C2=A0 =C2=A0> >> >>=C2=A0 =C2=A0 =C2=A0> To unsubscribe send an email to >> ceph-users-leave(a)ceph.io >> >>=C2=A0 =C2=A0 =C2=A0> > >> >> >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users(a)ceph.io >> >> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >> >> --===============3547379919013120755==-- From shubjero@gmail.com Fri Mar 27 15:47:19 2020 From: shubjero To: ceph-users@ceph.io Subject: [ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic Date: Fri, 27 Mar 2020 11:47:10 -0400 Message-ID: In-Reply-To: de8b6053-f268-788e-ef80-ef8c39b7ce54@uni-weimar.de MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============4647195261961956795==" --===============4647195261961956795== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable I've reported stability problems with ceph-mgr w/ prometheus plugin enabled on all versions we ran in production which were several versions of Luminous and Mimic. Our solution was to disable the prometheus exporter. I am using Zabbix instead. Our cluster is 1404 OSD's in size with about 9PB raw with around 35% utilization. On Fri, Mar 27, 2020 at 4:26 AM Janek Bevendorff wrote: > > Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were > failing constantly due to the prometheus module doing something funny. > > > On 26/03/2020 18:10, Paul Choi wrote: > > I won't speculate more into the MDS's stability, but I do wonder about > > the same thing. > > There is one file served by the MDS that would cause the ceph-fuse > > client to hang. It was a file that many people in the company relied > > on for data updates, so very noticeable. The only fix was to fail over > > the MDS. > > > > Since the free disk space dropped, I haven't heard anyone complain... > > > > > > On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff > > > > wrote: > > > > If there is actually a connection, then it's no wonder our MDS > > kept crashing. Our Ceph has 9.2PiB of available space at the moment. > > > > > > On 26/03/2020 17:32, Paul Choi wrote: > >> I can't quite explain what happened, but the Prometheus endpoint > >> became stable after the free disk space for the largest pool went > >> substantially lower than 1PB. > >> I wonder if there's some metric that exceeds the maximum size for > >> some int, double, etc? > >> > >> -Paul > >> > >> On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff > >> >> > wrote: > >> > >> I haven't seen any MGR hangs so far since I disabled the > >> prometheus > >> module. It seems like the module is not only slow, but kills > >> the whole > >> MGR when the cluster is sufficiently large, so these two > >> issues are most > >> likely connected. The issue has become much, much worse with > >> 14.2.8. > >> > >> > >> On 23/03/2020 09:00, Janek Bevendorff wrote: > >> > I am running the very latest version of Nautilus. I will > >> try setting up > >> > an external exporter today and see if that fixes anything. > >> Our cluster > >> > is somewhat large-ish with 1248 OSDs, so I expect stat > >> collection to > >> > take "some" time, but it definitely shouldn't crush the > >> MGRs all the time. > >> > > >> > On 21/03/2020 02:33, Paul Choi wrote: > >> >> Hi Janek, > >> >> > >> >> What version of Ceph are you using? > >> >> We also have a much smaller cluster running Nautilus, with > >> no MDS. No > >> >> Prometheus issues there. > >> >> I won't speculate further than this but perhaps Nautilus > >> doesn't have > >> >> the same issue as Mimic? > >> >> > >> >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff > >> >> >> > >> >> >> >> wrote: > >> >> > >> >> I think this is related to my previous post to this > >> list about MGRs > >> >> failing regularly and being overall quite slow to > >> respond. The problem > >> >> has existed before, but the new version has made it > >> way worse. My MGRs > >> >> keep dyring every few hours and need to be restarted. > >> the Promtheus > >> >> plugin works, but it's pretty slow and so is the > >> dashboard. > >> >> Unfortunately, nobody seems to have a solution for > >> this and I > >> >> wonder why > >> >> not more people are complaining about this problem. > >> >> > >> >> > >> >> On 20/03/2020 19:30, Paul Choi wrote: > >> >> > If I "curl http://localhost:9283/metrics" and wait > >> sufficiently long > >> >> > enough, I get this - says "No MON connection". But > >> the mons are > >> >> health and > >> >> > the cluster is functioning fine. > >> >> > That said, the mons' rocksdb sizes are fairly big > >> because > >> >> there's lots of > >> >> > rebalancing going on. The Prometheus endpoint > >> hanging seems to > >> >> happen > >> >> > regardless of the mon size anyhow. > >> >> > > >> >> > mon.woodenbox0 is 41 GiB >=3D mon_data_size_warn > >> (15 GiB) > >> >> > mon.woodenbox2 is 26 GiB >=3D mon_data_size_warn > >> (15 GiB) > >> >> > mon.woodenbox4 is 42 GiB >=3D mon_data_size_warn > >> (15 GiB) > >> >> > mon.woodenbox3 is 43 GiB >=3D mon_data_size_warn > >> (15 GiB) > >> >> > mon.woodenbox1 is 38 GiB >=3D mon_data_size_warn > >> (15 GiB) > >> >> > > >> >> > # fg > >> >> > curl -H "Connection: close" > >> http://localhost:9283/metrics > >> >> > >> >> > "-//W3C//DTD XHTML 1.0 Transitional//EN" > >> >> > > >> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > >> >> > > >> >> > > >> >> > >> >> > charset=3Dutf-8"> > >> >> > 503 Service Unavailable > >> >> > > >> >> > > >> >> > > >> >> >

503 Service Unavailable

> >> >> >

No MON connection

> >> >> >
Traceback (most recent
> >>         call last):
> >>         >>     >   File
> >>         >>
> >>          "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py",
> >>         line 670,
> >>         >>     > in respond
> >>         >>     >     response.body =3D self.handler()
> >>         >>     >   File
> >>         >>
> >>          "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py",
> >>         line
> >>         >>     > 217, in __call__
> >>         >>     >     self.body =3D self.oldhandler(*args, **kwargs)
> >>         >>     >   File
> >>         >>
> >>          "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py",
> >>         line 61,
> >>         >>     > in __call__
> >>         >>     >     return self.callable(*self.args, **self.kwargs)
> >>         >>     >   File "/usr/lib/ceph/mgr/prometheus/module.py",
> >>         line 704, in
> >>         >>     metrics
> >>         >>     >     return self._metrics(instance)
> >>         >>     >   File "/usr/lib/ceph/mgr/prometheus/module.py",
> >>         line 721, in
> >>         >>     _metrics
> >>         >>     >     raise cherrypy.HTTPError(503, 'No MON connection')
> >>         >>     > HTTPError: (503, 'No MON connection')
> >>         >>     > 
> >> >> >
> >> >> > > >> >> > Powered by >> href=3D"http://www.cherrypy.org">CherryPy > >> >> 3.5.0 > >> >> > > >> >> >
> >> >> > > >> >> > > >> >> > > >> >> > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi > >> > >> >> >> wrote: > >> >> > > >> >> >> Hello, > >> >> >> > >> >> >> We are running Mimic 13.2.8 with our cluster, and since > >> >> upgrading to > >> >> >> 13.2.8 the Prometheus plugin seems to hang a lot. > >> It used to > >> >> respond under > >> >> >> 10s but now it often hangs. Restarting the mgr > >> processes helps > >> >> temporarily > >> >> >> but within minutes it gets stuck again. > >> >> >> > >> >> >> The active mgr doesn't exit when doing `systemctl stop > >> >> ceph-mgr.target" > >> >> >> and needs to > >> >> >> be kill -9'ed. > >> >> >> > >> >> >> Is there anything I can do to address this issue, > >> or at least > >> >> get better > >> >> >> visibility into the issue? > >> >> >> > >> >> >> We only have a few plugins enabled: > >> >> >> $ ceph mgr module ls > >> >> >> { > >> >> >> "enabled_modules": [ > >> >> >> "balancer", > >> >> >> "prometheus", > >> >> >> "zabbix" > >> >> >> ], > >> >> >> > >> >> >> 3 mgr processes, but it's a pretty large cluster > >> (near 4000 > >> >> OSDs) and it's > >> >> >> a busy one with lots of rebalancing. (I don't know > >> if a busy > >> >> cluster would > >> >> >> seriously affect the mgr's performance, but just > >> throwing it > >> >> out there) > >> >> >> > >> >> >> services: > >> >> >> mon: 5 daemons, quorum > >> >> >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 > >> >> >> mgr: woodenbox2(active), standbys: woodenbox0, > >> woodenbox1 > >> >> >> mds: cephfs-1/1/1 up {0=3Dwoodenbox6=3Dup:active}= , 1 > >> >> up:standby-replay > >> >> >> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs > >> >> >> rgw: 4 daemons active > >> >> >> > >> >> >> Thanks in advance for your help, > >> >> >> > >> >> >> -Paul Choi > >> >> >> > >> >> > _______________________________________________ > >> >> > ceph-users mailing list -- ceph-users(a)ceph.io > >> > >> >> > > >> >> > To unsubscribe send an email to > >> ceph-users-leave(a)ceph.io > >> >> >> > > >> >> > >> > _______________________________________________ > >> > ceph-users mailing list -- ceph-users(a)ceph.io > >> > >> > To unsubscribe send an email to ceph-users-leave(a)ceph.io > >> > >> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io --===============4647195261961956795==-- From jarett@reticulum.us Fri Mar 27 15:51:15 2020 From: Jarett DeAngelis To: ceph-users@ceph.io Subject: [ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic Date: Fri, 27 Mar 2020 11:51:16 -0400 Message-ID: <2C201E9D-FFCA-4456-85E0-AC81F9DD01FF@reticulum.us> In-Reply-To: CADGXWpPPP9F_ss_69W3C-9CtJbFWjBnAuicpqf3cLNU1Q-njCw@mail.gmail.com MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============4949493396856203969==" --===============4949493396856203969== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable I=E2=80=99m actually very curious how well this is performing for you as I=E2= =80=99ve definitely not seen a deployment this large. How do you use it? > On Mar 27, 2020, at 11:47 AM, shubjero wrote: >=20 > I've reported stability problems with ceph-mgr w/ prometheus plugin > enabled on all versions we ran in production which were several > versions of Luminous and Mimic. Our solution was to disable the > prometheus exporter. I am using Zabbix instead. Our cluster is 1404 > OSD's in size with about 9PB raw with around 35% utilization. >=20 > On Fri, Mar 27, 2020 at 4:26 AM Janek Bevendorff > wrote: >>=20 >> Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were >> failing constantly due to the prometheus module doing something funny. >>=20 >>=20 >> On 26/03/2020 18:10, Paul Choi wrote: >>> I won't speculate more into the MDS's stability, but I do wonder about >>> the same thing. >>> There is one file served by the MDS that would cause the ceph-fuse >>> client to hang. It was a file that many people in the company relied >>> on for data updates, so very noticeable. The only fix was to fail over >>> the MDS. >>>=20 >>> Since the free disk space dropped, I haven't heard anyone complain... >>> >>>=20 >>> On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff >>> >> > wrote: >>>=20 >>> If there is actually a connection, then it's no wonder our MDS >>> kept crashing. Our Ceph has 9.2PiB of available space at the moment. >>>=20 >>>=20 >>> On 26/03/2020 17:32, Paul Choi wrote: >>>> I can't quite explain what happened, but the Prometheus endpoint >>>> became stable after the free disk space for the largest pool went >>>> substantially lower than 1PB. >>>> I wonder if there's some metric that exceeds the maximum size for >>>> some int, double, etc? >>>>=20 >>>> -Paul >>>>=20 >>>> On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff >>>> >>> > wrote: >>>>=20 >>>> I haven't seen any MGR hangs so far since I disabled the >>>> prometheus >>>> module. It seems like the module is not only slow, but kills >>>> the whole >>>> MGR when the cluster is sufficiently large, so these two >>>> issues are most >>>> likely connected. The issue has become much, much worse with >>>> 14.2.8. >>>>=20 >>>>=20 >>>> On 23/03/2020 09:00, Janek Bevendorff wrote: >>>>> I am running the very latest version of Nautilus. I will >>>> try setting up >>>>> an external exporter today and see if that fixes anything. >>>> Our cluster >>>>> is somewhat large-ish with 1248 OSDs, so I expect stat >>>> collection to >>>>> take "some" time, but it definitely shouldn't crush the >>>> MGRs all the time. >>>>>=20 >>>>> On 21/03/2020 02:33, Paul Choi wrote: >>>>>> Hi Janek, >>>>>>=20 >>>>>> What version of Ceph are you using? >>>>>> We also have a much smaller cluster running Nautilus, with >>>> no MDS. No >>>>>> Prometheus issues there. >>>>>> I won't speculate further than this but perhaps Nautilus >>>> doesn't have >>>>>> the same issue as Mimic? >>>>>>=20 >>>>>> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff >>>>>> >>> >>>>>> >>> >> wrote: >>>>>>=20 >>>>>> I think this is related to my previous post to this >>>> list about MGRs >>>>>> failing regularly and being overall quite slow to >>>> respond. The problem >>>>>> has existed before, but the new version has made it >>>> way worse. My MGRs >>>>>> keep dyring every few hours and need to be restarted. >>>> the Promtheus >>>>>> plugin works, but it's pretty slow and so is the >>>> dashboard. >>>>>> Unfortunately, nobody seems to have a solution for >>>> this and I >>>>>> wonder why >>>>>> not more people are complaining about this problem. >>>>>>=20 >>>>>>=20 >>>>>> On 20/03/2020 19:30, Paul Choi wrote: >>>>>>> If I "curl http://localhost:9283/metrics" and wait >>>> sufficiently long >>>>>>> enough, I get this - says "No MON connection". But >>>> the mons are >>>>>> health and >>>>>>> the cluster is functioning fine. >>>>>>> That said, the mons' rocksdb sizes are fairly big >>>> because >>>>>> there's lots of >>>>>>> rebalancing going on. The Prometheus endpoint >>>> hanging seems to >>>>>> happen >>>>>>> regardless of the mon size anyhow. >>>>>>>=20 >>>>>>> mon.woodenbox0 is 41 GiB >=3D mon_data_size_warn >>>> (15 GiB) >>>>>>> mon.woodenbox2 is 26 GiB >=3D mon_data_size_warn >>>> (15 GiB) >>>>>>> mon.woodenbox4 is 42 GiB >=3D mon_data_size_warn >>>> (15 GiB) >>>>>>> mon.woodenbox3 is 43 GiB >=3D mon_data_size_warn >>>> (15 GiB) >>>>>>> mon.woodenbox1 is 38 GiB >=3D mon_data_size_warn >>>> (15 GiB) >>>>>>>=20 >>>>>>> # fg >>>>>>> curl -H "Connection: close" >>>> http://localhost:9283/metrics >>>>>>> >>>>>> "-//W3C//DTD XHTML 1.0 Transitional//EN" >>>>>>>=20 >>>> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> >>>>>>> >>>>>>> >>>>>>> >>>>>> charset=3Dutf-8"> >>>>>>> 503 Service Unavailable >>>>>>> >>>>>>> >>>>>>> >>>>>>>

503 Service Unavailable

>>>>>>>

No MON connection

>>>>>>>
Traceback (most recent
>>>>        call last):
>>>>>>>  File
>>>>>>=20
>>>>         "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py",
>>>>        line 670,
>>>>>>> in respond
>>>>>>>    response.body =3D self.handler()
>>>>>>>  File
>>>>>>=20
>>>>         "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py",
>>>>        line
>>>>>>> 217, in __call__
>>>>>>>    self.body =3D self.oldhandler(*args, **kwargs)
>>>>>>>  File
>>>>>>=20
>>>>         "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py",
>>>>        line 61,
>>>>>>> in __call__
>>>>>>>    return self.callable(*self.args, **self.kwargs)
>>>>>>>  File "/usr/lib/ceph/mgr/prometheus/module.py",
>>>>        line 704, in
>>>>>>    metrics
>>>>>>>    return self._metrics(instance)
>>>>>>>  File "/usr/lib/ceph/mgr/prometheus/module.py",
>>>>        line 721, in
>>>>>>    _metrics
>>>>>>>    raise cherrypy.HTTPError(503, 'No MON connection')
>>>>>>> HTTPError: (503, 'No MON connection')
>>>>>>> 
>>>>>>>
>>>>>>> >>>>>>> Powered by >>> href=3D"http://www.cherrypy.org">CherryPy >>>>>> 3.5.0 >>>>>>> >>>>>>>
>>>>>>> >>>>>>> >>>>>>>=20 >>>>>>> On Fri, Mar 20, 2020 at 6:33 AM Paul Choi >>>> >>>>>> >> wrote: >>>>>>>=20 >>>>>>>> Hello, >>>>>>>>=20 >>>>>>>> We are running Mimic 13.2.8 with our cluster, and since >>>>>> upgrading to >>>>>>>> 13.2.8 the Prometheus plugin seems to hang a lot. >>>> It used to >>>>>> respond under >>>>>>>> 10s but now it often hangs. Restarting the mgr >>>> processes helps >>>>>> temporarily >>>>>>>> but within minutes it gets stuck again. >>>>>>>>=20 >>>>>>>> The active mgr doesn't exit when doing `systemctl stop >>>>>> ceph-mgr.target" >>>>>>>> and needs to >>>>>>>> be kill -9'ed. >>>>>>>>=20 >>>>>>>> Is there anything I can do to address this issue, >>>> or at least >>>>>> get better >>>>>>>> visibility into the issue? >>>>>>>>=20 >>>>>>>> We only have a few plugins enabled: >>>>>>>> $ ceph mgr module ls >>>>>>>> { >>>>>>>> "enabled_modules": [ >>>>>>>> "balancer", >>>>>>>> "prometheus", >>>>>>>> "zabbix" >>>>>>>> ], >>>>>>>>=20 >>>>>>>> 3 mgr processes, but it's a pretty large cluster >>>> (near 4000 >>>>>> OSDs) and it's >>>>>>>> a busy one with lots of rebalancing. (I don't know >>>> if a busy >>>>>> cluster would >>>>>>>> seriously affect the mgr's performance, but just >>>> throwing it >>>>>> out there) >>>>>>>>=20 >>>>>>>> services: >>>>>>>> mon: 5 daemons, quorum >>>>>>>> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 >>>>>>>> mgr: woodenbox2(active), standbys: woodenbox0, >>>> woodenbox1 >>>>>>>> mds: cephfs-1/1/1 up {0=3Dwoodenbox6=3Dup:active}, 1 >>>>>> up:standby-replay >>>>>>>> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs >>>>>>>> rgw: 4 daemons active >>>>>>>>=20 >>>>>>>> Thanks in advance for your help, >>>>>>>>=20 >>>>>>>> -Paul Choi >>>>>>>>=20 >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> >>>>>> > >>>>>>> To unsubscribe send an email to >>>> ceph-users-leave(a)ceph.io >>>>>> >>> > >>>>>>=20 >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> >>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> >>>>=20 >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io --===============4949493396856203969==-- From janek.bevendorff@uni-weimar.de Wed Apr 1 09:30:31 2020 From: Janek Bevendorff To: ceph-users@ceph.io Subject: [ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic Date: Wed, 01 Apr 2020 11:30:35 +0200 Message-ID: In-Reply-To: 2C201E9D-FFCA-4456-85E0-AC81F9DD01FF@reticulum.us MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============7585005908105252432==" --===============7585005908105252432== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable > I=E2=80=99m actually very curious how well this is performing for you as I= =E2=80=99ve definitely not seen a deployment this large. How do you use it? What exactly do you mean? Our cluster has 11PiB capacity of which about 15% are used at the moment (web-scale corpora and such). We have deployed 5 MONs and 5MGRs (both on the same hosts) and it works totally fine overall. We have some MDS performance issues here and there, but that's not too bad anymore after a few upstream patches and then we have this annoying Prometheus MGR problem, which kills our MGRs reliably after a few hours. > >> On Mar 27, 2020, at 11:47 AM, shubjero wrote: >> >> I've reported stability problems with ceph-mgr w/ prometheus plugin >> enabled on all versions we ran in production which were several >> versions of Luminous and Mimic. Our solution was to disable the >> prometheus exporter. I am using Zabbix instead. Our cluster is 1404 >> OSD's in size with about 9PB raw with around 35% utilization. >> >> On Fri, Mar 27, 2020 at 4:26 AM Janek Bevendorff >> wrote: >>> Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were >>> failing constantly due to the prometheus module doing something funny. >>> >>> >>> On 26/03/2020 18:10, Paul Choi wrote: >>>> I won't speculate more into the MDS's stability, but I do wonder about >>>> the same thing. >>>> There is one file served by the MDS that would cause the ceph-fuse >>>> client to hang. It was a file that many people in the company relied >>>> on for data updates, so very noticeable. The only fix was to fail over >>>> the MDS. >>>> >>>> Since the free disk space dropped, I haven't heard anyone complain... >>>> >>>> >>>> On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff >>>> >>> > wrote: >>>> >>>> If there is actually a connection, then it's no wonder our MDS >>>> kept crashing. Our Ceph has 9.2PiB of available space at the moment. >>>> >>>> >>>> On 26/03/2020 17:32, Paul Choi wrote: >>>>> I can't quite explain what happened, but the Prometheus endpoint >>>>> became stable after the free disk space for the largest pool went >>>>> substantially lower than 1PB. >>>>> I wonder if there's some metric that exceeds the maximum size for >>>>> some int, double, etc? >>>>> >>>>> -Paul >>>>> >>>>> On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff >>>>> >>>> > wrote: >>>>> >>>>> I haven't seen any MGR hangs so far since I disabled the >>>>> prometheus >>>>> module. It seems like the module is not only slow, but kills >>>>> the whole >>>>> MGR when the cluster is sufficiently large, so these two >>>>> issues are most >>>>> likely connected. The issue has become much, much worse with >>>>> 14.2.8. >>>>> >>>>> >>>>> On 23/03/2020 09:00, Janek Bevendorff wrote: >>>>>> I am running the very latest version of Nautilus. I will >>>>> try setting up >>>>>> an external exporter today and see if that fixes anything. >>>>> Our cluster >>>>>> is somewhat large-ish with 1248 OSDs, so I expect stat >>>>> collection to >>>>>> take "some" time, but it definitely shouldn't crush the >>>>> MGRs all the time. >>>>>> On 21/03/2020 02:33, Paul Choi wrote: >>>>>>> Hi Janek, >>>>>>> >>>>>>> What version of Ceph are you using? >>>>>>> We also have a much smaller cluster running Nautilus, with >>>>> no MDS. No >>>>>>> Prometheus issues there. >>>>>>> I won't speculate further than this but perhaps Nautilus >>>>> doesn't have >>>>>>> the same issue as Mimic? >>>>>>> >>>>>>> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff >>>>>>> >>>> >>>>>>> >>>> >> wrote: >>>>>>> I think this is related to my previous post to this >>>>> list about MGRs >>>>>>> failing regularly and being overall quite slow to >>>>> respond. The problem >>>>>>> has existed before, but the new version has made it >>>>> way worse. My MGRs >>>>>>> keep dyring every few hours and need to be restarted. >>>>> the Promtheus >>>>>>> plugin works, but it's pretty slow and so is the >>>>> dashboard. >>>>>>> Unfortunately, nobody seems to have a solution for >>>>> this and I >>>>>>> wonder why >>>>>>> not more people are complaining about this problem. >>>>>>> >>>>>>> >>>>>>> On 20/03/2020 19:30, Paul Choi wrote: >>>>>>>> If I "curl http://localhost:9283/metrics" and wait >>>>> sufficiently long >>>>>>>> enough, I get this - says "No MON connection". But >>>>> the mons are >>>>>>> health and >>>>>>>> the cluster is functioning fine. >>>>>>>> That said, the mons' rocksdb sizes are fairly big >>>>> because >>>>>>> there's lots of >>>>>>>> rebalancing going on. The Prometheus endpoint >>>>> hanging seems to >>>>>>> happen >>>>>>>> regardless of the mon size anyhow. >>>>>>>> >>>>>>>> mon.woodenbox0 is 41 GiB >=3D mon_data_size_warn >>>>> (15 GiB) >>>>>>>> mon.woodenbox2 is 26 GiB >=3D mon_data_size_warn >>>>> (15 GiB) >>>>>>>> mon.woodenbox4 is 42 GiB >=3D mon_data_size_warn >>>>> (15 GiB) >>>>>>>> mon.woodenbox3 is 43 GiB >=3D mon_data_size_warn >>>>> (15 GiB) >>>>>>>> mon.woodenbox1 is 38 GiB >=3D mon_data_size_warn >>>>> (15 GiB) >>>>>>>> # fg >>>>>>>> curl -H "Connection: close" >>>>> http://localhost:9283/metrics >>>>>>>> >>>>>>> "-//W3C//DTD XHTML 1.0 Transitional//EN" >>>>>>>> >>>>> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> charset=3Dutf-8"> >>>>>>>> 503 Service Unavailable >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>

503 Service Unavailable

>>>>>>>>

No MON connection

>>>>>>>>
Traceback (most recent
>>>>>        call last):
>>>>>>>>  File
>>>>>         "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py",
>>>>>        line 670,
>>>>>>>> in respond
>>>>>>>>    response.body =3D self.handler()
>>>>>>>>  File
>>>>>         "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py",
>>>>>        line
>>>>>>>> 217, in __call__
>>>>>>>>    self.body =3D self.oldhandler(*args, **kwargs)
>>>>>>>>  File
>>>>>         "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py",
>>>>>        line 61,
>>>>>>>> in __call__
>>>>>>>>    return self.callable(*self.args, **self.kwargs)
>>>>>>>>  File "/usr/lib/ceph/mgr/prometheus/module.py",
>>>>>        line 704, in
>>>>>>>    metrics
>>>>>>>>    return self._metrics(instance)
>>>>>>>>  File "/usr/lib/ceph/mgr/prometheus/module.py",
>>>>>        line 721, in
>>>>>>>    _metrics
>>>>>>>>    raise cherrypy.HTTPError(503, 'No MON connection')
>>>>>>>> HTTPError: (503, 'No MON connection')
>>>>>>>> 
>>>>>>>>
>>>>>>>> >>>>>>>> Powered by >>>> href=3D"http://www.cherrypy.org">CherryPy >>>>>>> 3.5.0 >>>>>>>> >>>>>>>>
>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Mar 20, 2020 at 6:33 AM Paul Choi >>>>> >>>>>>> >> wrote: >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> We are running Mimic 13.2.8 with our cluster, and since >>>>>>> upgrading to >>>>>>>>> 13.2.8 the Prometheus plugin seems to hang a lot. >>>>> It used to >>>>>>> respond under >>>>>>>>> 10s but now it often hangs. Restarting the mgr >>>>> processes helps >>>>>>> temporarily >>>>>>>>> but within minutes it gets stuck again. >>>>>>>>> >>>>>>>>> The active mgr doesn't exit when doing `systemctl stop >>>>>>> ceph-mgr.target" >>>>>>>>> and needs to >>>>>>>>> be kill -9'ed. >>>>>>>>> >>>>>>>>> Is there anything I can do to address this issue, >>>>> or at least >>>>>>> get better >>>>>>>>> visibility into the issue? >>>>>>>>> >>>>>>>>> We only have a few plugins enabled: >>>>>>>>> $ ceph mgr module ls >>>>>>>>> { >>>>>>>>> "enabled_modules": [ >>>>>>>>> "balancer", >>>>>>>>> "prometheus", >>>>>>>>> "zabbix" >>>>>>>>> ], >>>>>>>>> >>>>>>>>> 3 mgr processes, but it's a pretty large cluster >>>>> (near 4000 >>>>>>> OSDs) and it's >>>>>>>>> a busy one with lots of rebalancing. (I don't know >>>>> if a busy >>>>>>> cluster would >>>>>>>>> seriously affect the mgr's performance, but just >>>>> throwing it >>>>>>> out there) >>>>>>>>> services: >>>>>>>>> mon: 5 daemons, quorum >>>>>>>>> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 >>>>>>>>> mgr: woodenbox2(active), standbys: woodenbox0, >>>>> woodenbox1 >>>>>>>>> mds: cephfs-1/1/1 up {0=3Dwoodenbox6=3Dup:active}, 1 >>>>>>> up:standby-replay >>>>>>>>> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs >>>>>>>>> rgw: 4 daemons active >>>>>>>>> >>>>>>>>> Thanks in advance for your help, >>>>>>>>> >>>>>>>>> -Paul Choi >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>> >>>>>>> > >>>>>>>> To unsubscribe send an email to >>>>> ceph-users-leave(a)ceph.io >>>>>>> >>>> > >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>> >>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>> >>>>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io --===============7585005908105252432==--