No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic - ceph-users

List overview All Threads
Download

newer

No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

older

osd can not start at boot after...

Multiple CephFS creation

Paul Choi

20 Mar 2020 20 Mar '20

4:33 p.m.

Hello, We are running Mimic 13.2.8 with our cluster, and since upgrading to 13.2.8 the Prometheus plugin seems to hang a lot. It used to respond under 10s but now it often hangs. Restarting the mgr processes helps temporarily but within minutes it gets stuck again. The active mgr doesn't exit when doing `systemctl stop ceph-mgr.target" and needs to be kill -9'ed. Is there anything I can do to address this issue, or at least get better visibility into the issue? We only have a few plugins enabled: $ ceph mgr module ls { "enabled_modules": [ "balancer", "prometheus", "zabbix" ], 3 mgr processes, but it's a pretty large cluster (near 4000 OSDs) and it's a busy one with lots of rebalancing. (I don't know if a busy cluster would seriously affect the mgr's performance, but just throwing it out there) services: mon: 5 daemons, quorum woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1 mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 up:standby-replay osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs rgw: 4 daemons active Thanks in advance for your help, -Paul Choi

Show replies by date

Paul Choi

20 Mar 20 Mar

6:30 p.m.

If I "curl http://localhost:9283/metrics" and wait sufficiently long enough, I get this - says "No MON connection". But the mons are health and the cluster is functioning fine. That said, the mons' rocksdb sizes are fairly big because there's lots of rebalancing going on. The Prometheus endpoint hanging seems to happen regardless of the mon size anyhow. mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB) mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB) mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB) mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB) mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB) # fg curl -H "Connection: close" http://localhost:9283/metrics <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta> <title>503 Service Unavailable</title> <style type="text/css"> #powered_by { margin-top: 20px; border-top: 2px solid black; font-style: italic; } #traceback { color: red; } </style> </head> <body> <h2>503 Service Unavailable</h2> <p>No MON connection</p> <pre id="traceback">Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670, in respond response.body = self.handler() File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line 217, in __call__ self.body = self.oldhandler(*args, **kwargs) File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61, in __call__ return self.callable(*self.args, **self.kwargs) File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in metrics return self._metrics(instance) File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in _metrics raise cherrypy.HTTPError(503, 'No MON connection') HTTPError: (503, 'No MON connection') </pre> <div id="powered_by"> <span> Powered by <a href="http://www.cherrypy.org">CherryPy 3.5.0</a> </span> </div> </body> </html> On Fri, Mar 20, 2020 at 6:33 AM Paul Choi <pchoi(a)nuro.ai> wrote:

...

Janek Bevendorff

10:22 p.m.

I think this is related to my previous post to this list about MGRs failing regularly and being overall quite slow to respond. The problem has existed before, but the new version has made it way worse. My MGRs keep dyring every few hours and need to be restarted. the Promtheus plugin works, but it's pretty slow and so is the dashboard. Unfortunately, nobody seems to have a solution for this and I wonder why not more people are complaining about this problem. On 20/03/2020 19:30, Paul Choi wrote: > If I "curl http://localhost:9283/metrics" and wait sufficiently long > enough, I get this - says "No MON connection". But the mons are health and > the cluster is functioning fine. > That said, the mons' rocksdb sizes are fairly big because there's lots of > rebalancing going on. The Prometheus endpoint hanging seems to happen > regardless of the mon size anyhow. > > mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB) > mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB) > mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB) > mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB) > mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB) > > # fg > curl -H "Connection: close" http://localhost:9283/metrics > <!DOCTYPE html PUBLIC > "-//W3C//DTD XHTML 1.0 Transitional//EN" > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > <html> > <head> > <meta http-equiv="Content-Type" content="text/html; > charset=utf-8"></meta> > <title>503 Service Unavailable</title> > <style type="text/css"> > #powered_by { > margin-top: 20px; > border-top: 2px solid black; > font-style: italic; > } > > #traceback { > color: red; > } > </style> > </head> > <body> > <h2>503 Service Unavailable</h2> > <p>No MON connection</p> > <pre id="traceback">Traceback (most recent call last): > File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670, > in respond > response.body = self.handler() > File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line > 217, in __call__ > self.body = self.oldhandler(*args, **kwargs) > File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61, > in __call__ > return self.callable(*self.args, **self.kwargs) > File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in metrics > return self._metrics(instance) > File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in _metrics > raise cherrypy.HTTPError(503, 'No MON connection') > HTTPError: (503, 'No MON connection') > </pre> > <div id="powered_by"> > <span> > Powered by <a href="http://www.cherrypy.org">CherryPy 3.5.0</a> > </span> > </div> > </body> > </html> > > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi <pchoi(a)nuro.ai> wrote: > >> Hello, >> >> We are running Mimic 13.2.8 with our cluster, and since upgrading to >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to respond under >> 10s but now it often hangs. Restarting the mgr processes helps temporarily >> but within minutes it gets stuck again. >> >> The active mgr doesn't exit when doing `systemctl stop ceph-mgr.target" >> and needs to >> be kill -9'ed. >> >> Is there anything I can do to address this issue, or at least get better >> visibility into the issue? >> >> We only have a few plugins enabled: >> $ ceph mgr module ls >> { >> "enabled_modules": [ >> "balancer", >> "prometheus", >> "zabbix" >> ], >> >> 3 mgr processes, but it's a pretty large cluster (near 4000 OSDs) and it's >> a busy one with lots of rebalancing. (I don't know if a busy cluster would >> seriously affect the mgr's performance, but just throwing it out there) >> >> services: >> mon: 5 daemons, quorum >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 >> mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1 >> mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 up:standby-replay >> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs >> rgw: 4 daemons active >> >> Thanks in advance for your help, >> >> -Paul Choi >> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Paul Choi

21 Mar 21 Mar

1:33 a.m.

...

If I "curl http://localhost:9283/metrics" and wait sufficiently long enough, I get this - says "No MON connection". But the mons are health

and

the cluster is functioning fine. That said, the mons' rocksdb sizes are fairly big because there's lots of rebalancing going on. The Prometheus endpoint hanging seems to happen regardless of the mon size anyhow. mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB) mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB) mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB) mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB) mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB) # fg curl -H "Connection: close" http://localhost:9283/metrics <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta> <title>503 Service Unavailable</title> <style type="text/css"> #powered_by { margin-top: 20px; border-top: 2px solid black; font-style: italic; } #traceback { color: red; } </style> </head> <body> <h2>503 Service Unavailable</h2> <p>No MON connection</p> <pre id="traceback">Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line

670,

in respond response.body = self.handler() File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line 217, in __call__ self.body = self.oldhandler(*args, **kwargs) File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line

61,

in __call__ return self.callable(*self.args, **self.kwargs) File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in metrics return self._metrics(instance) File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in _metrics raise cherrypy.HTTPError(503, 'No MON connection') HTTPError: (503, 'No MON connection') </pre> <div id="powered_by"> <span> Powered by <a href="http://www.cherrypy.org">CherryPy 3.5.0</a> </span> </div> </body> </html> On Fri, Mar 20, 2020 at 6:33 AM Paul Choi <pchoi(a)nuro.ai> wrote: > Hello, > > We are running Mimic 13.2.8 with our cluster, and since upgrading to > 13.2.8 the Prometheus plugin seems to hang a lot. It used to respond

under

> 10s but now it often hangs. Restarting the mgr processes helps

temporarily

> but within minutes it gets stuck again. > > The active mgr doesn't exit when doing `systemctl stop ceph-mgr.target" > and needs to > be kill -9'ed. > > Is there anything I can do to address this issue, or at least get better > visibility into the issue? > > We only have a few plugins enabled: > $ ceph mgr module ls > { > "enabled_modules": [ > "balancer", > "prometheus", > "zabbix" > ], > > 3 mgr processes, but it's a pretty large cluster (near 4000 OSDs) and

it's

> a busy one with lots of rebalancing. (I don't know if a busy cluster

would

seriously affect the mgr's performance, but just throwing it out there) services: mon: 5 daemons, quorum woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1 mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 up:standby-replay osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs rgw: 4 daemons active Thanks in advance for your help, -Paul Choi

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Janek Bevendorff

23 Mar 23 Mar

8 a.m.

I am running the very latest version of Nautilus. I will try setting up an external exporter today and see if that fixes anything. Our cluster is somewhat large-ish with 1248 OSDs, so I expect stat collection to take "some" time, but it definitely shouldn't crush the MGRs all the time. On 21/03/2020 02:33, Paul Choi wrote:

...

Hi Janek, What version of Ceph are you using? We also have a much smaller cluster running Nautilus, with no MDS. No Prometheus issues there. I won't speculate further than this but perhaps Nautilus doesn't have the same issue as Mimic? On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff <janek.bevendorff(a)uni-weimar.de <mailto:janek.bevendorff@uni-weimar.de>> wrote: I think this is related to my previous post to this list about MGRs failing regularly and being overall quite slow to respond. The problem has existed before, but the new version has made it way worse. My MGRs keep dyring every few hours and need to be restarted. the Promtheus plugin works, but it's pretty slow and so is the dashboard. Unfortunately, nobody seems to have a solution for this and I wonder why not more people are complaining about this problem. On 20/03/2020 19:30, Paul Choi wrote:

If I "curl http://localhost:9283/metrics" and wait sufficiently long enough, I get this - says "No MON connection". But the mons are

health and

the cluster is functioning fine. That said, the mons' rocksdb sizes are fairly big because

there's lots of

rebalancing going on. The Prometheus endpoint hanging seems to

happen

regardless of the mon size anyhow. mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB) mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB) mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB) mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB) mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB) # fg curl -H "Connection: close" http://localhost:9283/metrics <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta> <title>503 Service Unavailable</title> <style type="text/css"> #powered_by { margin-top: 20px; border-top: 2px solid black; font-style: italic; } #traceback { color: red; } </style> </head> <body> <h2>503 Service Unavailable</h2> <p>No MON connection</p> <pre id="traceback">Traceback (most recent call last): File

"/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670,

in respond response.body = self.handler() File

"/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line

217, in __call__ self.body = self.oldhandler(*args, **kwargs) File

"/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61,

in __call__ return self.callable(*self.args, **self.kwargs) File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in

metrics

return self._metrics(instance) File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in

_metrics

raise cherrypy.HTTPError(503, 'No MON connection') HTTPError: (503, 'No MON connection') </pre> <div id="powered_by"> <span> Powered by <a href="http://www.cherrypy.org">CherryPy

3.5.0</a>

</span> </div> </body> </html> On Fri, Mar 20, 2020 at 6:33 AM Paul Choi <pchoi(a)nuro.ai

<mailto:pchoi@nuro.ai>> wrote:

> Hello, > > We are running Mimic 13.2.8 with our cluster, and since

upgrading to

> 13.2.8 the Prometheus plugin seems to hang a lot. It used to

respond under

> 10s but now it often hangs. Restarting the mgr processes helps

temporarily

> but within minutes it gets stuck again. > > The active mgr doesn't exit when doing `systemctl stop

ceph-mgr.target"

> and needs to > be kill -9'ed. > > Is there anything I can do to address this issue, or at least

get better

> visibility into the issue? > > We only have a few plugins enabled: > $ ceph mgr module ls > { > "enabled_modules": [ > "balancer", > "prometheus", > "zabbix" > ], > > 3 mgr processes, but it's a pretty large cluster (near 4000

OSDs) and it's

> a busy one with lots of rebalancing. (I don't know if a busy

cluster would

> seriously affect the mgr's performance, but just throwing it

out there)

> > services: > mon: 5 daemons, quorum > woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 > mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1 > mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1

up:standby-replay

osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs rgw: 4 daemons active Thanks in advance for your help, -Paul Choi

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io>

Janek Bevendorff

4:50 p.m.

I haven't seen any MGR hangs so far since I disabled the prometheus module. It seems like the module is not only slow, but kills the whole MGR when the cluster is sufficiently large, so these two issues are most likely connected. The issue has become much, much worse with 14.2.8. On 23/03/2020 09:00, Janek Bevendorff wrote: > I am running the very latest version of Nautilus. I will try setting up > an external exporter today and see if that fixes anything. Our cluster > is somewhat large-ish with 1248 OSDs, so I expect stat collection to > take "some" time, but it definitely shouldn't crush the MGRs all the time. > > On 21/03/2020 02:33, Paul Choi wrote: >> Hi Janek, >> >> What version of Ceph are you using? >> We also have a much smaller cluster running Nautilus, with no MDS. No >> Prometheus issues there. >> I won't speculate further than this but perhaps Nautilus doesn't have >> the same issue as Mimic? >> >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff >> <janek.bevendorff(a)uni-weimar.de >> <mailto:janek.bevendorff@uni-weimar.de>> wrote: >> >> I think this is related to my previous post to this list about MGRs >> failing regularly and being overall quite slow to respond. The problem >> has existed before, but the new version has made it way worse. My MGRs >> keep dyring every few hours and need to be restarted. the Promtheus >> plugin works, but it's pretty slow and so is the dashboard. >> Unfortunately, nobody seems to have a solution for this and I >> wonder why >> not more people are complaining about this problem. >> >> >> On 20/03/2020 19:30, Paul Choi wrote: >> > If I "curl http://localhost:9283/metrics" and wait sufficiently long >> > enough, I get this - says "No MON connection". But the mons are >> health and >> > the cluster is functioning fine. >> > That said, the mons' rocksdb sizes are fairly big because >> there's lots of >> > rebalancing going on. The Prometheus endpoint hanging seems to >> happen >> > regardless of the mon size anyhow. >> > >> > mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB) >> > mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB) >> > mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB) >> > mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB) >> > mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB) >> > >> > # fg >> > curl -H "Connection: close" http://localhost:9283/metrics >> > <!DOCTYPE html PUBLIC >> > "-//W3C//DTD XHTML 1.0 Transitional//EN" >> > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> >> > <html> >> > <head> >> > <meta http-equiv="Content-Type" content="text/html; >> > charset=utf-8"></meta> >> > <title>503 Service Unavailable</title> >> > <style type="text/css"> >> > #powered_by { >> > margin-top: 20px; >> > border-top: 2px solid black; >> > font-style: italic; >> > } >> > >> > #traceback { >> > color: red; >> > } >> > </style> >> > </head> >> > <body> >> > <h2>503 Service Unavailable</h2> >> > <p>No MON connection</p> >> > <pre id="traceback">Traceback (most recent call last): >> > File >> "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670, >> > in respond >> > response.body = self.handler() >> > File >> "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line >> > 217, in __call__ >> > self.body = self.oldhandler(*args, **kwargs) >> > File >> "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61, >> > in __call__ >> > return self.callable(*self.args, **self.kwargs) >> > File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in >> metrics >> > return self._metrics(instance) >> > File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in >> _metrics >> > raise cherrypy.HTTPError(503, 'No MON connection') >> > HTTPError: (503, 'No MON connection') >> > </pre> >> > <div id="powered_by"> >> > <span> >> > Powered by <a href="http://www.cherrypy.org">CherryPy >> 3.5.0</a> >> > </span> >> > </div> >> > </body> >> > </html> >> > >> > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi <pchoi(a)nuro.ai >> <mailto:pchoi@nuro.ai>> wrote: >> > >> >> Hello, >> >> >> >> We are running Mimic 13.2.8 with our cluster, and since >> upgrading to >> >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to >> respond under >> >> 10s but now it often hangs. Restarting the mgr processes helps >> temporarily >> >> but within minutes it gets stuck again. >> >> >> >> The active mgr doesn't exit when doing `systemctl stop >> ceph-mgr.target" >> >> and needs to >> >> be kill -9'ed. >> >> >> >> Is there anything I can do to address this issue, or at least >> get better >> >> visibility into the issue? >> >> >> >> We only have a few plugins enabled: >> >> $ ceph mgr module ls >> >> { >> >> "enabled_modules": [ >> >> "balancer", >> >> "prometheus", >> >> "zabbix" >> >> ], >> >> >> >> 3 mgr processes, but it's a pretty large cluster (near 4000 >> OSDs) and it's >> >> a busy one with lots of rebalancing. (I don't know if a busy >> cluster would >> >> seriously affect the mgr's performance, but just throwing it >> out there) >> >> >> >> services: >> >> mon: 5 daemons, quorum >> >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 >> >> mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1 >> >> mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 >> up:standby-replay >> >> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs >> >> rgw: 4 daemons active >> >> >> >> Thanks in advance for your help, >> >> >> >> -Paul Choi >> >> >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users(a)ceph.io >> <mailto:ceph-users@ceph.io> >> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >> <mailto:ceph-users-leave@ceph.io> >> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Janek Bevendorff

5:06 p.m.

I dug up this issue report, where the problem has been reported before: https://tracker.ceph.com/issues/39264 Unfortuantely, the issue hasn't got much (or any) attention yet. So let's get this fixed, the prometheus module is unusable in its current state. On 23/03/2020 17:50, Janek Bevendorff wrote: > I haven't seen any MGR hangs so far since I disabled the prometheus > module. It seems like the module is not only slow, but kills the whole > MGR when the cluster is sufficiently large, so these two issues are most > likely connected. The issue has become much, much worse with 14.2.8. > > > On 23/03/2020 09:00, Janek Bevendorff wrote: >> I am running the very latest version of Nautilus. I will try setting up >> an external exporter today and see if that fixes anything. Our cluster >> is somewhat large-ish with 1248 OSDs, so I expect stat collection to >> take "some" time, but it definitely shouldn't crush the MGRs all the time. >> >> On 21/03/2020 02:33, Paul Choi wrote: >>> Hi Janek, >>> >>> What version of Ceph are you using? >>> We also have a much smaller cluster running Nautilus, with no MDS. No >>> Prometheus issues there. >>> I won't speculate further than this but perhaps Nautilus doesn't have >>> the same issue as Mimic? >>> >>> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff >>> <janek.bevendorff(a)uni-weimar.de >>> <mailto:janek.bevendorff@uni-weimar.de>> wrote: >>> >>> I think this is related to my previous post to this list about MGRs >>> failing regularly and being overall quite slow to respond. The problem >>> has existed before, but the new version has made it way worse. My MGRs >>> keep dyring every few hours and need to be restarted. the Promtheus >>> plugin works, but it's pretty slow and so is the dashboard. >>> Unfortunately, nobody seems to have a solution for this and I >>> wonder why >>> not more people are complaining about this problem. >>> >>> >>> On 20/03/2020 19:30, Paul Choi wrote: >>> > If I "curl http://localhost:9283/metrics" and wait sufficiently long >>> > enough, I get this - says "No MON connection". But the mons are >>> health and >>> > the cluster is functioning fine. >>> > That said, the mons' rocksdb sizes are fairly big because >>> there's lots of >>> > rebalancing going on. The Prometheus endpoint hanging seems to >>> happen >>> > regardless of the mon size anyhow. >>> > >>> > mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB) >>> > mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB) >>> > mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB) >>> > mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB) >>> > mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB) >>> > >>> > # fg >>> > curl -H "Connection: close" http://localhost:9283/metrics >>> > <!DOCTYPE html PUBLIC >>> > "-//W3C//DTD XHTML 1.0 Transitional//EN" >>> > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> >>> > <html> >>> > <head> >>> > <meta http-equiv="Content-Type" content="text/html; >>> > charset=utf-8"></meta> >>> > <title>503 Service Unavailable</title> >>> > <style type="text/css"> >>> > #powered_by { >>> > margin-top: 20px; >>> > border-top: 2px solid black; >>> > font-style: italic; >>> > } >>> > >>> > #traceback { >>> > color: red; >>> > } >>> > </style> >>> > </head> >>> > <body> >>> > <h2>503 Service Unavailable</h2> >>> > <p>No MON connection</p> >>> > <pre id="traceback">Traceback (most recent call last): >>> > File >>> "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670, >>> > in respond >>> > response.body = self.handler() >>> > File >>> "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line >>> > 217, in __call__ >>> > self.body = self.oldhandler(*args, **kwargs) >>> > File >>> "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61, >>> > in __call__ >>> > return self.callable(*self.args, **self.kwargs) >>> > File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in >>> metrics >>> > return self._metrics(instance) >>> > File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in >>> _metrics >>> > raise cherrypy.HTTPError(503, 'No MON connection') >>> > HTTPError: (503, 'No MON connection') >>> > </pre> >>> > <div id="powered_by"> >>> > <span> >>> > Powered by <a href="http://www.cherrypy.org">CherryPy >>> 3.5.0</a> >>> > </span> >>> > </div> >>> > </body> >>> > </html> >>> > >>> > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi <pchoi(a)nuro.ai >>> <mailto:pchoi@nuro.ai>> wrote: >>> > >>> >> Hello, >>> >> >>> >> We are running Mimic 13.2.8 with our cluster, and since >>> upgrading to >>> >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to >>> respond under >>> >> 10s but now it often hangs. Restarting the mgr processes helps >>> temporarily >>> >> but within minutes it gets stuck again. >>> >> >>> >> The active mgr doesn't exit when doing `systemctl stop >>> ceph-mgr.target" >>> >> and needs to >>> >> be kill -9'ed. >>> >> >>> >> Is there anything I can do to address this issue, or at least >>> get better >>> >> visibility into the issue? >>> >> >>> >> We only have a few plugins enabled: >>> >> $ ceph mgr module ls >>> >> { >>> >> "enabled_modules": [ >>> >> "balancer", >>> >> "prometheus", >>> >> "zabbix" >>> >> ], >>> >> >>> >> 3 mgr processes, but it's a pretty large cluster (near 4000 >>> OSDs) and it's >>> >> a busy one with lots of rebalancing. (I don't know if a busy >>> cluster would >>> >> seriously affect the mgr's performance, but just throwing it >>> out there) >>> >> >>> >> services: >>> >> mon: 5 daemons, quorum >>> >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 >>> >> mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1 >>> >> mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 >>> up:standby-replay >>> >> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs >>> >> rgw: 4 daemons active >>> >> >>> >> Thanks in advance for your help, >>> >> >>> >> -Paul Choi >>> >> >>> > _______________________________________________ >>> > ceph-users mailing list -- ceph-users(a)ceph.io >>> <mailto:ceph-users@ceph.io> >>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> <mailto:ceph-users-leave@ceph.io> >>> >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Paul Choi

26 Mar 26 Mar

4:32 p.m.

...

time.

On 21/03/2020 02:33, Paul Choi wrote: > Hi Janek, > > What version of Ceph are you using? > We also have a much smaller cluster running Nautilus, with no MDS. No > Prometheus issues there. > I won't speculate further than this but perhaps Nautilus doesn't have > the same issue as Mimic? > > On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff > <janek.bevendorff(a)uni-weimar.de > <mailto:janek.bevendorff@uni-weimar.de>> wrote: > > I think this is related to my previous post to this list about MGRs > failing regularly and being overall quite slow to respond. The

problem

> has existed before, but the new version has made it way worse. My

MGRs

> keep dyring every few hours and need to be restarted. the Promtheus > plugin works, but it's pretty slow and so is the dashboard. > Unfortunately, nobody seems to have a solution for this and I > wonder why > not more people are complaining about this problem. > > > On 20/03/2020 19:30, Paul Choi wrote: > > If I "curl http://localhost:9283/metrics" and wait sufficiently

long

enough, I get this - says "No MON connection". But the mons are

health and

the cluster is functioning fine. That said, the mons' rocksdb sizes are fairly big because

there's lots of

rebalancing going on. The Prometheus endpoint hanging seems to

happen

"/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670,

in respond response.body = self.handler() File

"/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line

217, in __call__ self.body = self.oldhandler(*args, **kwargs) File

"/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61,

in __call__ return self.callable(*self.args, **self.kwargs) File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in

metrics

return self._metrics(instance) File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in

_metrics

raise cherrypy.HTTPError(503, 'No MON connection') HTTPError: (503, 'No MON connection') </pre> <div id="powered_by"> <span> Powered by <a href="http://www.cherrypy.org">CherryPy

3.5.0</a>

</span> </div> </body> </html> On Fri, Mar 20, 2020 at 6:33 AM Paul Choi <pchoi(a)nuro.ai

<mailto:pchoi@nuro.ai>> wrote:

> Hello, > > We are running Mimic 13.2.8 with our cluster, and since

upgrading to

> 13.2.8 the Prometheus plugin seems to hang a lot. It used to

respond under

> 10s but now it often hangs. Restarting the mgr processes helps

temporarily

> but within minutes it gets stuck again. > > The active mgr doesn't exit when doing `systemctl stop

ceph-mgr.target"

> and needs to > be kill -9'ed. > > Is there anything I can do to address this issue, or at least

get better

OSDs) and it's

> a busy one with lots of rebalancing. (I don't know if a busy

cluster would

> seriously affect the mgr's performance, but just throwing it

out there)

up:standby-replay

> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs > rgw: 4 daemons active > > Thanks in advance for your help, > > -Paul Choi > _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io>

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Janek Bevendorff

4:43 p.m.

If there is actually a connection, then it's no wonder our MDS kept crashing. Our Ceph has 9.2PiB of available space at the moment. On 26/03/2020 17:32, Paul Choi wrote:

...

I can't quite explain what happened, but the Prometheus endpoint became stable after the free disk space for the largest pool went substantially lower than 1PB. I wonder if there's some metric that exceeds the maximum size for some int, double, etc? -Paul On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff <janek.bevendorff(a)uni-weimar.de <mailto:janek.bevendorff@uni-weimar.de>> wrote: I haven't seen any MGR hangs so far since I disabled the prometheus module. It seems like the module is not only slow, but kills the whole MGR when the cluster is sufficiently large, so these two issues are most likely connected. The issue has become much, much worse with 14.2.8. On 23/03/2020 09:00, Janek Bevendorff wrote:

I am running the very latest version of Nautilus. I will try

setting up

an external exporter today and see if that fixes anything. Our

cluster

is somewhat large-ish with 1248 OSDs, so I expect stat collection to take "some" time, but it definitely shouldn't crush the MGRs all

the time.

On 21/03/2020 02:33, Paul Choi wrote: > Hi Janek, > > What version of Ceph are you using? > We also have a much smaller cluster running Nautilus, with no

MDS. No

> Prometheus issues there. > I won't speculate further than this but perhaps Nautilus

doesn't have

> the same issue as Mimic? > > On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff > <janek.bevendorff(a)uni-weimar.de

<mailto:janek.bevendorff@uni-weimar.de>

> <mailto:janek.bevendorff@uni-weimar.de

<mailto:janek.bevendorff@uni-weimar.de>>> wrote:

> > I think this is related to my previous post to this list

about MGRs

> failing regularly and being overall quite slow to respond.

The problem

> has existed before, but the new version has made it way

worse. My MGRs

> keep dyring every few hours and need to be restarted. the

Promtheus

> plugin works, but it's pretty slow and so is the dashboard. > Unfortunately, nobody seems to have a solution for this and I > wonder why > not more people are complaining about this problem. > > > On 20/03/2020 19:30, Paul Choi wrote: > > If I "curl http://localhost:9283/metrics" and wait

sufficiently long

> > enough, I get this - says "No MON connection". But the

mons are

> health and > > the cluster is functioning fine. > > That said, the mons' rocksdb sizes are fairly big because > there's lots of > > rebalancing going on. The Prometheus endpoint hanging

seems to

> happen > > regardless of the mon size anyhow. > > > > mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB) > > mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB) > > mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB) > > mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB) > > mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB) > > > > # fg > > curl -H "Connection: close" http://localhost:9283/metrics > > <!DOCTYPE html PUBLIC > > "-//W3C//DTD XHTML 1.0 Transitional//EN" > > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > > <html> > > <head> > > <meta http-equiv="Content-Type" content="text/html; > > charset=utf-8"></meta> > > <title>503 Service Unavailable</title> > > <style type="text/css"> > > #powered_by { > > margin-top: 20px; > > border-top: 2px solid black; > > font-style: italic; > > } > > > > #traceback { > > color: red; > > } > > </style> > > </head> > > <body> > > <h2>503 Service Unavailable</h2> > > <p>No MON connection</p> > > <pre id="traceback">Traceback (most recent call

last):

> > File > "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py",

line 670,

> > in respond > > response.body = self.handler() > > File >

"/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line

> > 217, in __call__ > > self.body = self.oldhandler(*args, **kwargs) > > File > "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py",

line 61,

> > in __call__ > > return self.callable(*self.args, **self.kwargs) > > File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in > metrics > > return self._metrics(instance) > > File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in > _metrics > > raise cherrypy.HTTPError(503, 'No MON connection') > > HTTPError: (503, 'No MON connection') > > </pre> > > <div id="powered_by"> > > <span> > > Powered by <a href="http://www.cherrypy.org">CherryPy > 3.5.0</a> > > </span> > > </div> > > </body> > > </html> > > > > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi <pchoi(a)nuro.ai

<mailto:pchoi@nuro.ai>

> <mailto:pchoi@nuro.ai <mailto:pchoi@nuro.ai>>> wrote: > > > >> Hello, > >> > >> We are running Mimic 13.2.8 with our cluster, and since > upgrading to > >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to > respond under > >> 10s but now it often hangs. Restarting the mgr processes

helps

> temporarily > >> but within minutes it gets stuck again. > >> > >> The active mgr doesn't exit when doing `systemctl stop > ceph-mgr.target" > >> and needs to > >> be kill -9'ed. > >> > >> Is there anything I can do to address this issue, or at

least

> get better > >> visibility into the issue? > >> > >> We only have a few plugins enabled: > >> $ ceph mgr module ls > >> { > >> "enabled_modules": [ > >> "balancer", > >> "prometheus", > >> "zabbix" > >> ], > >> > >> 3 mgr processes, but it's a pretty large cluster (near 4000 > OSDs) and it's > >> a busy one with lots of rebalancing. (I don't know if a busy > cluster would > >> seriously affect the mgr's performance, but just throwing it > out there) > >> > >> services: > >> mon: 5 daemons, quorum > >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 > >> mgr: woodenbox2(active), standbys: woodenbox0,

woodenbox1

> >> mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 > up:standby-replay > >> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs > >> rgw: 4 daemons active > >> > >> Thanks in advance for your help, > >> > >> -Paul Choi > >> > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

> <mailto:ceph-users@ceph.io <mailto:ceph-users@ceph.io>> > > To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io>

> <mailto:ceph-users-leave@ceph.io

<mailto:ceph-users-leave@ceph.io>>

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io

<mailto:ceph-users@ceph.io>

To unsubscribe send an email to ceph-users-leave(a)ceph.io

<mailto:ceph-users-leave@ceph.io>

Paul Choi

5:10 p.m.

...

If there is actually a connection, then it's no wonder our MDS kept crashing. Our Ceph has 9.2PiB of available space at the moment. On 26/03/2020 17:32, Paul Choi wrote: I can't quite explain what happened, but the Prometheus endpoint became stable after the free disk space for the largest pool went substantially lower than 1PB. I wonder if there's some metric that exceeds the maximum size for some int, double, etc? -Paul On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff < janek.bevendorff(a)uni-weimar.de> wrote:

time.

problem

> has existed before, but the new version has made it way worse. My

MGRs

long

> > enough, I get this - says "No MON connection". But the mons are > health and > > the cluster is functioning fine. > > That said, the mons' rocksdb sizes are fairly big because > there's lots of > > rebalancing going on. The Prometheus endpoint hanging seems to > happen > > regardless of the mon size anyhow. > > > > mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB) > > mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB) > > mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB) > > mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB) > > mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB) > > > > # fg > > curl -H "Connection: close" http://localhost:9283/metrics > > <!DOCTYPE html PUBLIC > > "-//W3C//DTD XHTML 1.0 Transitional//EN" > > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > > <html> > > <head> > > <meta http-equiv="Content-Type" content="text/html; > > charset=utf-8"></meta> > > <title>503 Service Unavailable</title> > > <style type="text/css"> > > #powered_by { > > margin-top: 20px; > > border-top: 2px solid black; > > font-style: italic; > > } > > > > #traceback { > > color: red; > > } > > </style> > > </head> > > <body> > > <h2>503 Service Unavailable</h2> > > <p>No MON connection</p> > > <pre id="traceback">Traceback (most recent call last): > > File > "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line

670,

> > in respond > > response.body = self.handler() > > File > "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line > > 217, in __call__ > > self.body = self.oldhandler(*args, **kwargs) > > File > "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line

61,

> in __call__ > return self.callable(*self.args, **self.kwargs) > File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in metrics > return self._metrics(instance) > File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in _metrics > raise cherrypy.HTTPError(503, 'No MON connection') > HTTPError: (503, 'No MON connection') > </pre> > <div id="powered_by"> > <span> > Powered by <a href="http://www.cherrypy.org">CherryPy 3.5.0</a> > </span> > </div> > </body> > </html> > > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi <pchoi(a)nuro.ai <mailto:pchoi@nuro.ai>> wrote: > >> Hello, >> >> We are running Mimic 13.2.8 with our cluster, and since upgrading to >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to respond under >> 10s but now it often hangs. Restarting the mgr processes helps temporarily >> but within minutes it gets stuck again. >> >> The active mgr doesn't exit when doing `systemctl stop ceph-mgr.target" >> and needs to >> be kill -9'ed. >> >> Is there anything I can do to address this issue, or at least get better >> visibility into the issue? >> >> We only have a few plugins enabled: >> $ ceph mgr module ls >> { >> "enabled_modules": [ >> "balancer", >> "prometheus", >> "zabbix" >> ], >> >> 3 mgr processes, but it's a pretty large cluster (near 4000 OSDs) and it's >> a busy one with lots of rebalancing. (I don't know if a busy cluster would >> seriously affect the mgr's performance, but just throwing it out there) >> >> services: >> mon: 5 daemons, quorum >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 >> mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1 >> mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 up:standby-replay >> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs >> rgw: 4 daemons active >> >> Thanks in advance for your help, >> >> -Paul Choi >> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> > To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Janek Bevendorff

27 Mar 27 Mar

8:25 a.m.

Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were failing constantly due to the prometheus module doing something funny. On 26/03/2020 18:10, Paul Choi wrote:

...

I won't speculate more into the MDS's stability, but I do wonder about the same thing. There is one file served by the MDS that would cause the ceph-fuse client to hang. It was a file that many people in the company relied on for data updates, so very noticeable. The only fix was to fail over the MDS. Since the free disk space dropped, I haven't heard anyone complain... <shrug> On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff <janek.bevendorff(a)uni-weimar.de <mailto:janek.bevendorff@uni-weimar.de>> wrote: If there is actually a connection, then it's no wonder our MDS kept crashing. Our Ceph has 9.2PiB of available space at the moment. On 26/03/2020 17:32, Paul Choi wrote: > I can't quite explain what happened, but the Prometheus endpoint > became stable after the free disk space for the largest pool went > substantially lower than 1PB. > I wonder if there's some metric that exceeds the maximum size for > some int, double, etc? > > -Paul > > On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff > <janek.bevendorff(a)uni-weimar.de > <mailto:janek.bevendorff@uni-weimar.de>> wrote: > > I haven't seen any MGR hangs so far since I disabled the > prometheus > module. It seems like the module is not only slow, but kills > the whole > MGR when the cluster is sufficiently large, so these two > issues are most > likely connected. The issue has become much, much worse with > 14.2.8. > > > On 23/03/2020 09:00, Janek Bevendorff wrote: > > I am running the very latest version of Nautilus. I will > try setting up > > an external exporter today and see if that fixes anything. > Our cluster > > is somewhat large-ish with 1248 OSDs, so I expect stat > collection to > > take "some" time, but it definitely shouldn't crush the > MGRs all the time. > > > > On 21/03/2020 02:33, Paul Choi wrote: > >> Hi Janek, > >> > >> What version of Ceph are you using? > >> We also have a much smaller cluster running Nautilus, with > no MDS. No > >> Prometheus issues there. > >> I won't speculate further than this but perhaps Nautilus > doesn't have > >> the same issue as Mimic? > >> > >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff > >> <janek.bevendorff(a)uni-weimar.de > <mailto:janek.bevendorff@uni-weimar.de> > >> <mailto:janek.bevendorff@uni-weimar.de > <mailto:janek.bevendorff@uni-weimar.de>>> wrote: > >> > >> I think this is related to my previous post to this > list about MGRs > >> failing regularly and being overall quite slow to > respond. The problem > >> has existed before, but the new version has made it > way worse. My MGRs > >> keep dyring every few hours and need to be restarted. > the Promtheus > >> plugin works, but it's pretty slow and so is the > dashboard. > >> Unfortunately, nobody seems to have a solution for > this and I > >> wonder why > >> not more people are complaining about this problem. > >> > >> > >> On 20/03/2020 19:30, Paul Choi wrote: > >> > If I "curl http://localhost:9283/metrics" and wait > sufficiently long > >> > enough, I get this - says "No MON connection". But > the mons are > >> health and > >> > the cluster is functioning fine. > >> > That said, the mons' rocksdb sizes are fairly big > because > >> there's lots of > >> > rebalancing going on. The Prometheus endpoint > hanging seems to > >> happen > >> > regardless of the mon size anyhow. > >> > > >> > mon.woodenbox0 is 41 GiB >= mon_data_size_warn > (15 GiB) > >> > mon.woodenbox2 is 26 GiB >= mon_data_size_warn > (15 GiB) > >> > mon.woodenbox4 is 42 GiB >= mon_data_size_warn > (15 GiB) > >> > mon.woodenbox3 is 43 GiB >= mon_data_size_warn > (15 GiB) > >> > mon.woodenbox1 is 38 GiB >= mon_data_size_warn > (15 GiB) > >> > > >> > # fg > >> > curl -H "Connection: close" > http://localhost:9283/metrics > >> > <!DOCTYPE html PUBLIC > >> > "-//W3C//DTD XHTML 1.0 Transitional//EN" > >> > > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > >> > <html> > >> > <head> > >> > <meta http-equiv="Content-Type" content="text/html; > >> > charset=utf-8"></meta> > >> > <title>503 Service Unavailable</title> > >> > <style type="text/css"> > >> > #powered_by { > >> > margin-top: 20px; > >> > border-top: 2px solid black; > >> > font-style: italic; > >> > } > >> > > >> > #traceback { > >> > color: red; > >> > } > >> > </style> > >> > </head> > >> > <body> > >> > <h2>503 Service Unavailable</h2> > >> > <p>No MON connection</p> > >> > <pre id="traceback">Traceback (most recent > call last): > >> > File > >> > "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", > line 670, > >> > in respond > >> > response.body = self.handler() > >> > File > >> > "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", > line > >> > 217, in __call__ > >> > self.body = self.oldhandler(*args, **kwargs) > >> > File > >> > "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", > line 61, > >> > in __call__ > >> > return self.callable(*self.args, **self.kwargs) > >> > File "/usr/lib/ceph/mgr/prometheus/module.py", > line 704, in > >> metrics > >> > return self._metrics(instance) > >> > File "/usr/lib/ceph/mgr/prometheus/module.py", > line 721, in > >> _metrics > >> > raise cherrypy.HTTPError(503, 'No MON connection') > >> > HTTPError: (503, 'No MON connection') > >> > </pre> > >> > <div id="powered_by"> > >> > <span> > >> > Powered by <a > href="http://www.cherrypy.org">CherryPy > >> 3.5.0</a> > >> > </span> > >> > </div> > >> > </body> > >> > </html> > >> > > >> > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi > <pchoi(a)nuro.ai <mailto:pchoi@nuro.ai> > >> <mailto:pchoi@nuro.ai <mailto:pchoi@nuro.ai>>> wrote: > >> > > >> >> Hello, > >> >> > >> >> We are running Mimic 13.2.8 with our cluster, and since > >> upgrading to > >> >> 13.2.8 the Prometheus plugin seems to hang a lot. > It used to > >> respond under > >> >> 10s but now it often hangs. Restarting the mgr > processes helps > >> temporarily > >> >> but within minutes it gets stuck again. > >> >> > >> >> The active mgr doesn't exit when doing `systemctl stop > >> ceph-mgr.target" > >> >> and needs to > >> >> be kill -9'ed. > >> >> > >> >> Is there anything I can do to address this issue, > or at least > >> get better > >> >> visibility into the issue? > >> >> > >> >> We only have a few plugins enabled: > >> >> $ ceph mgr module ls > >> >> { > >> >> "enabled_modules": [ > >> >> "balancer", > >> >> "prometheus", > >> >> "zabbix" > >> >> ], > >> >> > >> >> 3 mgr processes, but it's a pretty large cluster > (near 4000 > >> OSDs) and it's > >> >> a busy one with lots of rebalancing. (I don't know > if a busy > >> cluster would > >> >> seriously affect the mgr's performance, but just > throwing it > >> out there) > >> >> > >> >> services: > >> >> mon: 5 daemons, quorum > >> >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 > >> >> mgr: woodenbox2(active), standbys: woodenbox0, > woodenbox1 > >> >> mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 > >> up:standby-replay > >> >> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs > >> >> rgw: 4 daemons active > >> >> > >> >> Thanks in advance for your help, > >> >> > >> >> -Paul Choi > >> >> > >> > _______________________________________________ > >> > ceph-users mailing list -- ceph-users(a)ceph.io > <mailto:ceph-users@ceph.io> > >> <mailto:ceph-users@ceph.io <mailto:ceph-users@ceph.io>> > >> > To unsubscribe send an email to > ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io> > >> <mailto:ceph-users-leave@ceph.io > <mailto:ceph-users-leave@ceph.io>> > >> > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > <mailto:ceph-users@ceph.io> > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > <mailto:ceph-users-leave@ceph.io> >

shubjero

3:47 p.m.

I've reported stability problems with ceph-mgr w/ prometheus plugin enabled on all versions we ran in production which were several versions of Luminous and Mimic. Our solution was to disable the prometheus exporter. I am using Zabbix instead. Our cluster is 1404 OSD's in size with about 9PB raw with around 35% utilization. On Fri, Mar 27, 2020 at 4:26 AM Janek Bevendorff <janek.bevendorff(a)uni-weimar.de> wrote: > > Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were > failing constantly due to the prometheus module doing something funny. > > > On 26/03/2020 18:10, Paul Choi wrote: > > I won't speculate more into the MDS's stability, but I do wonder about > > the same thing. > > There is one file served by the MDS that would cause the ceph-fuse > > client to hang. It was a file that many people in the company relied > > on for data updates, so very noticeable. The only fix was to fail over > > the MDS. > > > > Since the free disk space dropped, I haven't heard anyone complain... > > <shrug> > > > > On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff > > <janek.bevendorff(a)uni-weimar.de > > <mailto:janek.bevendorff@uni-weimar.de>> wrote: > > > > If there is actually a connection, then it's no wonder our MDS > > kept crashing. Our Ceph has 9.2PiB of available space at the moment. > > > > > > On 26/03/2020 17:32, Paul Choi wrote: > >> I can't quite explain what happened, but the Prometheus endpoint > >> became stable after the free disk space for the largest pool went > >> substantially lower than 1PB. > >> I wonder if there's some metric that exceeds the maximum size for > >> some int, double, etc? > >> > >> -Paul > >> > >> On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff > >> <janek.bevendorff(a)uni-weimar.de > >> <mailto:janek.bevendorff@uni-weimar.de>> wrote: > >> > >> I haven't seen any MGR hangs so far since I disabled the > >> prometheus > >> module. It seems like the module is not only slow, but kills > >> the whole > >> MGR when the cluster is sufficiently large, so these two > >> issues are most > >> likely connected. The issue has become much, much worse with > >> 14.2.8. > >> > >> > >> On 23/03/2020 09:00, Janek Bevendorff wrote: > >> > I am running the very latest version of Nautilus. I will > >> try setting up > >> > an external exporter today and see if that fixes anything. > >> Our cluster > >> > is somewhat large-ish with 1248 OSDs, so I expect stat > >> collection to > >> > take "some" time, but it definitely shouldn't crush the > >> MGRs all the time. > >> > > >> > On 21/03/2020 02:33, Paul Choi wrote: > >> >> Hi Janek, > >> >> > >> >> What version of Ceph are you using? > >> >> We also have a much smaller cluster running Nautilus, with > >> no MDS. No > >> >> Prometheus issues there. > >> >> I won't speculate further than this but perhaps Nautilus > >> doesn't have > >> >> the same issue as Mimic? > >> >> > >> >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff > >> >> <janek.bevendorff(a)uni-weimar.de > >> <mailto:janek.bevendorff@uni-weimar.de> > >> >> <mailto:janek.bevendorff@uni-weimar.de > >> <mailto:janek.bevendorff@uni-weimar.de>>> wrote: > >> >> > >> >> I think this is related to my previous post to this > >> list about MGRs > >> >> failing regularly and being overall quite slow to > >> respond. The problem > >> >> has existed before, but the new version has made it > >> way worse. My MGRs > >> >> keep dyring every few hours and need to be restarted. > >> the Promtheus > >> >> plugin works, but it's pretty slow and so is the > >> dashboard. > >> >> Unfortunately, nobody seems to have a solution for > >> this and I > >> >> wonder why > >> >> not more people are complaining about this problem. > >> >> > >> >> > >> >> On 20/03/2020 19:30, Paul Choi wrote: > >> >> > If I "curl http://localhost:9283/metrics" and wait > >> sufficiently long > >> >> > enough, I get this - says "No MON connection". But > >> the mons are > >> >> health and > >> >> > the cluster is functioning fine. > >> >> > That said, the mons' rocksdb sizes are fairly big > >> because > >> >> there's lots of > >> >> > rebalancing going on. The Prometheus endpoint > >> hanging seems to > >> >> happen > >> >> > regardless of the mon size anyhow. > >> >> > > >> >> > mon.woodenbox0 is 41 GiB >= mon_data_size_warn > >> (15 GiB) > >> >> > mon.woodenbox2 is 26 GiB >= mon_data_size_warn > >> (15 GiB) > >> >> > mon.woodenbox4 is 42 GiB >= mon_data_size_warn > >> (15 GiB) > >> >> > mon.woodenbox3 is 43 GiB >= mon_data_size_warn > >> (15 GiB) > >> >> > mon.woodenbox1 is 38 GiB >= mon_data_size_warn > >> (15 GiB) > >> >> > > >> >> > # fg > >> >> > curl -H "Connection: close" > >> http://localhost:9283/metrics > >> >> > <!DOCTYPE html PUBLIC > >> >> > "-//W3C//DTD XHTML 1.0 Transitional//EN" > >> >> > > >> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > >> >> > <html> > >> >> > <head> > >> >> > <meta http-equiv="Content-Type" content="text/html; > >> >> > charset=utf-8"></meta> > >> >> > <title>503 Service Unavailable</title> > >> >> > <style type="text/css"> > >> >> > #powered_by { > >> >> > margin-top: 20px; > >> >> > border-top: 2px solid black; > >> >> > font-style: italic; > >> >> > } > >> >> > > >> >> > #traceback { > >> >> > color: red; > >> >> > } > >> >> > </style> > >> >> > </head> > >> >> > <body> > >> >> > <h2>503 Service Unavailable</h2> > >> >> > <p>No MON connection</p> > >> >> > <pre id="traceback">Traceback (most recent > >> call last): > >> >> > File > >> >> > >> "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", > >> line 670, > >> >> > in respond > >> >> > response.body = self.handler() > >> >> > File > >> >> > >> "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", > >> line > >> >> > 217, in __call__ > >> >> > self.body = self.oldhandler(*args, **kwargs) > >> >> > File > >> >> > >> "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", > >> line 61, > >> >> > in __call__ > >> >> > return self.callable(*self.args, **self.kwargs) > >> >> > File "/usr/lib/ceph/mgr/prometheus/module.py", > >> line 704, in > >> >> metrics > >> >> > return self._metrics(instance) > >> >> > File "/usr/lib/ceph/mgr/prometheus/module.py", > >> line 721, in > >> >> _metrics > >> >> > raise cherrypy.HTTPError(503, 'No MON connection') > >> >> > HTTPError: (503, 'No MON connection') > >> >> > </pre> > >> >> > <div id="powered_by"> > >> >> > <span> > >> >> > Powered by <a > >> href="http://www.cherrypy.org">CherryPy > >> >> 3.5.0</a> > >> >> > </span> > >> >> > </div> > >> >> > </body> > >> >> > </html> > >> >> > > >> >> > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi > >> <pchoi(a)nuro.ai <mailto:pchoi@nuro.ai> > >> >> <mailto:pchoi@nuro.ai <mailto:pchoi@nuro.ai>>> wrote: > >> >> > > >> >> >> Hello, > >> >> >> > >> >> >> We are running Mimic 13.2.8 with our cluster, and since > >> >> upgrading to > >> >> >> 13.2.8 the Prometheus plugin seems to hang a lot. > >> It used to > >> >> respond under > >> >> >> 10s but now it often hangs. Restarting the mgr > >> processes helps > >> >> temporarily > >> >> >> but within minutes it gets stuck again. > >> >> >> > >> >> >> The active mgr doesn't exit when doing `systemctl stop > >> >> ceph-mgr.target" > >> >> >> and needs to > >> >> >> be kill -9'ed. > >> >> >> > >> >> >> Is there anything I can do to address this issue, > >> or at least > >> >> get better > >> >> >> visibility into the issue? > >> >> >> > >> >> >> We only have a few plugins enabled: > >> >> >> $ ceph mgr module ls > >> >> >> { > >> >> >> "enabled_modules": [ > >> >> >> "balancer", > >> >> >> "prometheus", > >> >> >> "zabbix" > >> >> >> ], > >> >> >> > >> >> >> 3 mgr processes, but it's a pretty large cluster > >> (near 4000 > >> >> OSDs) and it's > >> >> >> a busy one with lots of rebalancing. (I don't know > >> if a busy > >> >> cluster would > >> >> >> seriously affect the mgr's performance, but just > >> throwing it > >> >> out there) > >> >> >> > >> >> >> services: > >> >> >> mon: 5 daemons, quorum > >> >> >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 > >> >> >> mgr: woodenbox2(active), standbys: woodenbox0, > >> woodenbox1 > >> >> >> mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 > >> >> up:standby-replay > >> >> >> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs > >> >> >> rgw: 4 daemons active > >> >> >> > >> >> >> Thanks in advance for your help, > >> >> >> > >> >> >> -Paul Choi > >> >> >> > >> >> > _______________________________________________ > >> >> > ceph-users mailing list -- ceph-users(a)ceph.io > >> <mailto:ceph-users@ceph.io> > >> >> <mailto:ceph-users@ceph.io <mailto:ceph-users@ceph.io>> > >> >> > To unsubscribe send an email to > >> ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io> > >> >> <mailto:ceph-users-leave@ceph.io > >> <mailto:ceph-users-leave@ceph.io>> > >> >> > >> > _______________________________________________ > >> > ceph-users mailing list -- ceph-users(a)ceph.io > >> <mailto:ceph-users@ceph.io> > >> > To unsubscribe send an email to ceph-users-leave(a)ceph.io > >> <mailto:ceph-users-leave@ceph.io> > >> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Jarett DeAngelis

3:51 p.m.

I’m actually very curious how well this is performing for you as I’ve definitely not seen a deployment this large. How do you use it?

...

On Mar 27, 2020, at 11:47 AM, shubjero <shubjero(a)gmail.com> wrote: I've reported stability problems with ceph-mgr w/ prometheus plugin enabled on all versions we ran in production which were several versions of Luminous and Mimic. Our solution was to disable the prometheus exporter. I am using Zabbix instead. Our cluster is 1404 OSD's in size with about 9PB raw with around 35% utilization. On Fri, Mar 27, 2020 at 4:26 AM Janek Bevendorff <janek.bevendorff(a)uni-weimar.de> wrote:

Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were failing constantly due to the prometheus module doing something funny. On 26/03/2020 18:10, Paul Choi wrote:

I won't speculate more into the MDS's stability, but I do wonder about the same thing. There is one file served by the MDS that would cause the ceph-fuse client to hang. It was a file that many people in the company relied on for data updates, so very noticeable. The only fix was to fail over the MDS. Since the free disk space dropped, I haven't heard anyone complain... <shrug> On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff <janek.bevendorff(a)uni-weimar.de <mailto:janek.bevendorff@uni-weimar.de>> wrote: If there is actually a connection, then it's no wonder our MDS kept crashing. Our Ceph has 9.2PiB of available space at the moment. On 26/03/2020 17:32, Paul Choi wrote: > I can't quite explain what happened, but the Prometheus endpoint > became stable after the free disk space for the largest pool went > substantially lower than 1PB. > I wonder if there's some metric that exceeds the maximum size for > some int, double, etc? > > -Paul > > On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff > <janek.bevendorff(a)uni-weimar.de > <mailto:janek.bevendorff@uni-weimar.de>> wrote: > > I haven't seen any MGR hangs so far since I disabled the > prometheus > module. It seems like the module is not only slow, but kills > the whole > MGR when the cluster is sufficiently large, so these two > issues are most > likely connected. The issue has become much, much worse with > 14.2.8. > > > On 23/03/2020 09:00, Janek Bevendorff wrote: >> I am running the very latest version of Nautilus. I will > try setting up >> an external exporter today and see if that fixes anything. > Our cluster >> is somewhat large-ish with 1248 OSDs, so I expect stat > collection to >> take "some" time, but it definitely shouldn't crush the > MGRs all the time. >> >> On 21/03/2020 02:33, Paul Choi wrote: >>> Hi Janek, >>> >>> What version of Ceph are you using? >>> We also have a much smaller cluster running Nautilus, with > no MDS. No >>> Prometheus issues there. >>> I won't speculate further than this but perhaps Nautilus > doesn't have >>> the same issue as Mimic? >>> >>> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff >>> <janek.bevendorff(a)uni-weimar.de > <mailto:janek.bevendorff@uni-weimar.de> >>> <mailto:janek.bevendorff@uni-weimar.de > <mailto:janek.bevendorff@uni-weimar.de>>> wrote: >>> >>> I think this is related to my previous post to this > list about MGRs >>> failing regularly and being overall quite slow to > respond. The problem >>> has existed before, but the new version has made it > way worse. My MGRs >>> keep dyring every few hours and need to be restarted. > the Promtheus >>> plugin works, but it's pretty slow and so is the > dashboard. >>> Unfortunately, nobody seems to have a solution for > this and I >>> wonder why >>> not more people are complaining about this problem. >>> >>> >>> On 20/03/2020 19:30, Paul Choi wrote: >>>> If I "curl http://localhost:9283/metrics" and wait > sufficiently long >>>> enough, I get this - says "No MON connection". But > the mons are >>> health and >>>> the cluster is functioning fine. >>>> That said, the mons' rocksdb sizes are fairly big > because >>> there's lots of >>>> rebalancing going on. The Prometheus endpoint > hanging seems to >>> happen >>>> regardless of the mon size anyhow. >>>> >>>> mon.woodenbox0 is 41 GiB >= mon_data_size_warn > (15 GiB) >>>> mon.woodenbox2 is 26 GiB >= mon_data_size_warn > (15 GiB) >>>> mon.woodenbox4 is 42 GiB >= mon_data_size_warn > (15 GiB) >>>> mon.woodenbox3 is 43 GiB >= mon_data_size_warn > (15 GiB) >>>> mon.woodenbox1 is 38 GiB >= mon_data_size_warn > (15 GiB) >>>> >>>> # fg >>>> curl -H "Connection: close" > http://localhost:9283/metrics >>>> <!DOCTYPE html PUBLIC >>>> "-//W3C//DTD XHTML 1.0 Transitional//EN" >>>> > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> >>>> <html> >>>> <head> >>>> <meta http-equiv="Content-Type" content="text/html; >>>> charset=utf-8"></meta> >>>> <title>503 Service Unavailable</title> >>>> <style type="text/css"> >>>> #powered_by { >>>> margin-top: 20px; >>>> border-top: 2px solid black; >>>> font-style: italic; >>>> } >>>> >>>> #traceback { >>>> color: red; >>>> } >>>> </style> >>>> </head> >>>> <body> >>>> <h2>503 Service Unavailable</h2> >>>> <p>No MON connection</p> >>>> <pre id="traceback">Traceback (most recent > call last): >>>> File >>> > "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", > line 670, >>>> in respond >>>> response.body = self.handler() >>>> File >>> > "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", > line >>>> 217, in __call__ >>>> self.body = self.oldhandler(*args, **kwargs) >>>> File >>> > "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", > line 61, >>>> in __call__ >>>> return self.callable(*self.args, **self.kwargs) >>>> File "/usr/lib/ceph/mgr/prometheus/module.py", > line 704, in >>> metrics >>>> return self._metrics(instance) >>>> File "/usr/lib/ceph/mgr/prometheus/module.py", > line 721, in >>> _metrics >>>> raise cherrypy.HTTPError(503, 'No MON connection') >>>> HTTPError: (503, 'No MON connection') >>>> </pre> >>>> <div id="powered_by"> >>>> <span> >>>> Powered by <a > href="http://www.cherrypy.org">CherryPy >>> 3.5.0</a> >>>> </span> >>>> </div> >>>> </body> >>>> </html> >>>> >>>> On Fri, Mar 20, 2020 at 6:33 AM Paul Choi > <pchoi(a)nuro.ai <mailto:pchoi@nuro.ai> >>> <mailto:pchoi@nuro.ai <mailto:pchoi@nuro.ai>>> wrote: >>>> >>>>> Hello, >>>>> >>>>> We are running Mimic 13.2.8 with our cluster, and since >>> upgrading to >>>>> 13.2.8 the Prometheus plugin seems to hang a lot. > It used to >>> respond under >>>>> 10s but now it often hangs. Restarting the mgr > processes helps >>> temporarily >>>>> but within minutes it gets stuck again. >>>>> >>>>> The active mgr doesn't exit when doing `systemctl stop >>> ceph-mgr.target" >>>>> and needs to >>>>> be kill -9'ed. >>>>> >>>>> Is there anything I can do to address this issue, > or at least >>> get better >>>>> visibility into the issue? >>>>> >>>>> We only have a few plugins enabled: >>>>> $ ceph mgr module ls >>>>> { >>>>> "enabled_modules": [ >>>>> "balancer", >>>>> "prometheus", >>>>> "zabbix" >>>>> ], >>>>> >>>>> 3 mgr processes, but it's a pretty large cluster > (near 4000 >>> OSDs) and it's >>>>> a busy one with lots of rebalancing. (I don't know > if a busy >>> cluster would >>>>> seriously affect the mgr's performance, but just > throwing it >>> out there) >>>>> >>>>> services: >>>>> mon: 5 daemons, quorum >>>>> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 >>>>> mgr: woodenbox2(active), standbys: woodenbox0, > woodenbox1 >>>>> mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 >>> up:standby-replay >>>>> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs >>>>> rgw: 4 daemons active >>>>> >>>>> Thanks in advance for your help, >>>>> >>>>> -Paul Choi >>>>> >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io > <mailto:ceph-users@ceph.io> >>> <mailto:ceph-users@ceph.io <mailto:ceph-users@ceph.io>> >>>> To unsubscribe send an email to > ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io> >>> <mailto:ceph-users-leave@ceph.io > <mailto:ceph-users-leave@ceph.io>> >>> >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io > <mailto:ceph-users@ceph.io> >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > <mailto:ceph-users-leave@ceph.io> >

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Janek Bevendorff

1 Apr 1 Apr

9:30 a.m.

...

I’m actually very curious how well this is performing for you as I’ve definitely not seen a deployment this large. How do you use it?

What exactly do you mean? Our cluster has 11PiB capacity of which about 15% are used at the moment (web-scale corpora and such). We have deployed 5 MONs and 5MGRs (both on the same hosts) and it works totally fine overall. We have some MDS performance issues here and there, but that's not too bad anymore after a few upstream patches and then we have this annoying Prometheus MGR problem, which kills our MGRs reliably after a few hours. > >> On Mar 27, 2020, at 11:47 AM, shubjero <shubjero(a)gmail.com> wrote: >> >> I've reported stability problems with ceph-mgr w/ prometheus plugin >> enabled on all versions we ran in production which were several >> versions of Luminous and Mimic. Our solution was to disable the >> prometheus exporter. I am using Zabbix instead. Our cluster is 1404 >> OSD's in size with about 9PB raw with around 35% utilization. >> >> On Fri, Mar 27, 2020 at 4:26 AM Janek Bevendorff >> <janek.bevendorff(a)uni-weimar.de> wrote: >>> Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were >>> failing constantly due to the prometheus module doing something funny. >>> >>> >>> On 26/03/2020 18:10, Paul Choi wrote: >>>> I won't speculate more into the MDS's stability, but I do wonder about >>>> the same thing. >>>> There is one file served by the MDS that would cause the ceph-fuse >>>> client to hang. It was a file that many people in the company relied >>>> on for data updates, so very noticeable. The only fix was to fail over >>>> the MDS. >>>> >>>> Since the free disk space dropped, I haven't heard anyone complain... >>>> <shrug> >>>> >>>> On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff >>>> <janek.bevendorff(a)uni-weimar.de >>>> <mailto:janek.bevendorff@uni-weimar.de>> wrote: >>>> >>>> If there is actually a connection, then it's no wonder our MDS >>>> kept crashing. Our Ceph has 9.2PiB of available space at the moment. >>>> >>>> >>>> On 26/03/2020 17:32, Paul Choi wrote: >>>>> I can't quite explain what happened, but the Prometheus endpoint >>>>> became stable after the free disk space for the largest pool went >>>>> substantially lower than 1PB. >>>>> I wonder if there's some metric that exceeds the maximum size for >>>>> some int, double, etc? >>>>> >>>>> -Paul >>>>> >>>>> On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff >>>>> <janek.bevendorff(a)uni-weimar.de >>>>> <mailto:janek.bevendorff@uni-weimar.de>> wrote: >>>>> >>>>> I haven't seen any MGR hangs so far since I disabled the >>>>> prometheus >>>>> module. It seems like the module is not only slow, but kills >>>>> the whole >>>>> MGR when the cluster is sufficiently large, so these two >>>>> issues are most >>>>> likely connected. The issue has become much, much worse with >>>>> 14.2.8. >>>>> >>>>> >>>>> On 23/03/2020 09:00, Janek Bevendorff wrote: >>>>>> I am running the very latest version of Nautilus. I will >>>>> try setting up >>>>>> an external exporter today and see if that fixes anything. >>>>> Our cluster >>>>>> is somewhat large-ish with 1248 OSDs, so I expect stat >>>>> collection to >>>>>> take "some" time, but it definitely shouldn't crush the >>>>> MGRs all the time. >>>>>> On 21/03/2020 02:33, Paul Choi wrote: >>>>>>> Hi Janek, >>>>>>> >>>>>>> What version of Ceph are you using? >>>>>>> We also have a much smaller cluster running Nautilus, with >>>>> no MDS. No >>>>>>> Prometheus issues there. >>>>>>> I won't speculate further than this but perhaps Nautilus >>>>> doesn't have >>>>>>> the same issue as Mimic? >>>>>>> >>>>>>> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff >>>>>>> <janek.bevendorff(a)uni-weimar.de >>>>> <mailto:janek.bevendorff@uni-weimar.de> >>>>>>> <mailto:janek.bevendorff@uni-weimar.de >>>>> <mailto:janek.bevendorff@uni-weimar.de>>> wrote: >>>>>>> I think this is related to my previous post to this >>>>> list about MGRs >>>>>>> failing regularly and being overall quite slow to >>>>> respond. The problem >>>>>>> has existed before, but the new version has made it >>>>> way worse. My MGRs >>>>>>> keep dyring every few hours and need to be restarted. >>>>> the Promtheus >>>>>>> plugin works, but it's pretty slow and so is the >>>>> dashboard. >>>>>>> Unfortunately, nobody seems to have a solution for >>>>> this and I >>>>>>> wonder why >>>>>>> not more people are complaining about this problem. >>>>>>> >>>>>>> >>>>>>> On 20/03/2020 19:30, Paul Choi wrote: >>>>>>>> If I "curl http://localhost:9283/metrics" and wait >>>>> sufficiently long >>>>>>>> enough, I get this - says "No MON connection". But >>>>> the mons are >>>>>>> health and >>>>>>>> the cluster is functioning fine. >>>>>>>> That said, the mons' rocksdb sizes are fairly big >>>>> because >>>>>>> there's lots of >>>>>>>> rebalancing going on. The Prometheus endpoint >>>>> hanging seems to >>>>>>> happen >>>>>>>> regardless of the mon size anyhow. >>>>>>>> >>>>>>>> mon.woodenbox0 is 41 GiB >= mon_data_size_warn >>>>> (15 GiB) >>>>>>>> mon.woodenbox2 is 26 GiB >= mon_data_size_warn >>>>> (15 GiB) >>>>>>>> mon.woodenbox4 is 42 GiB >= mon_data_size_warn >>>>> (15 GiB) >>>>>>>> mon.woodenbox3 is 43 GiB >= mon_data_size_warn >>>>> (15 GiB) >>>>>>>> mon.woodenbox1 is 38 GiB >= mon_data_size_warn >>>>> (15 GiB) >>>>>>>> # fg >>>>>>>> curl -H "Connection: close" >>>>> http://localhost:9283/metrics >>>>>>>> <!DOCTYPE html PUBLIC >>>>>>>> "-//W3C//DTD XHTML 1.0 Transitional//EN" >>>>>>>> >>>>> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> >>>>>>>> <html> >>>>>>>> <head> >>>>>>>> <meta http-equiv="Content-Type" content="text/html; >>>>>>>> charset=utf-8"></meta> >>>>>>>> <title>503 Service Unavailable</title> >>>>>>>> <style type="text/css"> >>>>>>>> #powered_by { >>>>>>>> margin-top: 20px; >>>>>>>> border-top: 2px solid black; >>>>>>>> font-style: italic; >>>>>>>> } >>>>>>>> >>>>>>>> #traceback { >>>>>>>> color: red; >>>>>>>> } >>>>>>>> </style> >>>>>>>> </head> >>>>>>>> <body> >>>>>>>> <h2>503 Service Unavailable</h2> >>>>>>>> <p>No MON connection</p> >>>>>>>> <pre id="traceback">Traceback (most recent >>>>> call last): >>>>>>>> File >>>>> "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", >>>>> line 670, >>>>>>>> in respond >>>>>>>> response.body = self.handler() >>>>>>>> File >>>>> "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", >>>>> line >>>>>>>> 217, in __call__ >>>>>>>> self.body = self.oldhandler(*args, **kwargs) >>>>>>>> File >>>>> "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", >>>>> line 61, >>>>>>>> in __call__ >>>>>>>> return self.callable(*self.args, **self.kwargs) >>>>>>>> File "/usr/lib/ceph/mgr/prometheus/module.py", >>>>> line 704, in >>>>>>> metrics >>>>>>>> return self._metrics(instance) >>>>>>>> File "/usr/lib/ceph/mgr/prometheus/module.py", >>>>> line 721, in >>>>>>> _metrics >>>>>>>> raise cherrypy.HTTPError(503, 'No MON connection') >>>>>>>> HTTPError: (503, 'No MON connection') >>>>>>>> </pre> >>>>>>>> <div id="powered_by"> >>>>>>>> <span> >>>>>>>> Powered by <a >>>>> href="http://www.cherrypy.org">CherryPy >>>>>>> 3.5.0</a> >>>>>>>> </span> >>>>>>>> </div> >>>>>>>> </body> >>>>>>>> </html> >>>>>>>> >>>>>>>> On Fri, Mar 20, 2020 at 6:33 AM Paul Choi >>>>> <pchoi(a)nuro.ai <mailto:pchoi@nuro.ai> >>>>>>> <mailto:pchoi@nuro.ai <mailto:pchoi@nuro.ai>>> wrote: >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> We are running Mimic 13.2.8 with our cluster, and since >>>>>>> upgrading to >>>>>>>>> 13.2.8 the Prometheus plugin seems to hang a lot. >>>>> It used to >>>>>>> respond under >>>>>>>>> 10s but now it often hangs. Restarting the mgr >>>>> processes helps >>>>>>> temporarily >>>>>>>>> but within minutes it gets stuck again. >>>>>>>>> >>>>>>>>> The active mgr doesn't exit when doing `systemctl stop >>>>>>> ceph-mgr.target" >>>>>>>>> and needs to >>>>>>>>> be kill -9'ed. >>>>>>>>> >>>>>>>>> Is there anything I can do to address this issue, >>>>> or at least >>>>>>> get better >>>>>>>>> visibility into the issue? >>>>>>>>> >>>>>>>>> We only have a few plugins enabled: >>>>>>>>> $ ceph mgr module ls >>>>>>>>> { >>>>>>>>> "enabled_modules": [ >>>>>>>>> "balancer", >>>>>>>>> "prometheus", >>>>>>>>> "zabbix" >>>>>>>>> ], >>>>>>>>> >>>>>>>>> 3 mgr processes, but it's a pretty large cluster >>>>> (near 4000 >>>>>>> OSDs) and it's >>>>>>>>> a busy one with lots of rebalancing. (I don't know >>>>> if a busy >>>>>>> cluster would >>>>>>>>> seriously affect the mgr's performance, but just >>>>> throwing it >>>>>>> out there) >>>>>>>>> services: >>>>>>>>> mon: 5 daemons, quorum >>>>>>>>> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 >>>>>>>>> mgr: woodenbox2(active), standbys: woodenbox0, >>>>> woodenbox1 >>>>>>>>> mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 >>>>>>> up:standby-replay >>>>>>>>> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs >>>>>>>>> rgw: 4 daemons active >>>>>>>>> >>>>>>>>> Thanks in advance for your help, >>>>>>>>> >>>>>>>>> -Paul Choi >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>> <mailto:ceph-users@ceph.io> >>>>>>> <mailto:ceph-users@ceph.io <mailto:ceph-users@ceph.io>> >>>>>>>> To unsubscribe send an email to >>>>> ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io> >>>>>>> <mailto:ceph-users-leave@ceph.io >>>>> <mailto:ceph-users-leave@ceph.io>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>> <mailto:ceph-users@ceph.io> >>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>> <mailto:ceph-users-leave@ceph.io> >>>>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io

1485

days inactive

1497

days old

ceph-users@ceph.io

Manage subscription

13 comments

4 participants

tags (0)

participants (4)

Janek Bevendorff
Jarett DeAngelis
Paul Choi
shubjero