[ceph-users] Re: mgr daemons becoming unresponsive

2 Nov 2019

Dear Reed,

yes, also the balancer is on for me - but the instabilities vanished as soon as I turned
off device health metrics. 

Cheers,
Oliver

Am 02.11.19 um 17:31 schrieb Reed Dier:
...
  Do you also have the balancer module on?

 I experienced extremely bad stability issues where the MGRs would silently die with the
balancer module on.
 And by on, I mean 'active:true` by way of `ceph balancer on`.

 Once I disabled the automatic balancer, it seemed to become much more stable.

 I can still manually run the balancer without issues (except for one pool), but the
balancer is what appeared to be my big driver of instability.

 Reed

  On Nov 2, 2019, at 11:24 AM, Oliver Freyermuth
&lt;freyermuth(a)physik.uni-bonn.de&gt; wrote:

 Hi Thomas,

 indeed, I also had the dashboard open at these times - but right now, after disabling
device health metrics,
 I can not retrigger it even when playing wildly on the dashboard. 

 So I'll now reenable health metrics and try to retrigger the issue with cranked up
debug levels as Sage suggested. 
 Maybe in your case, if you can stand mgr failures, this would also be interesting to get
the dashboard issue debugged? 

 Cheers,
 Oliver

 Am 02.11.19 um 08:23 schrieb Thomas:
  Hi Oliver,

 I experienced a situation where MGRs "goes crazy", means MGR was active but not
working.
 In the logs of the standby MGR nodes I found an error (after restarting service) that
pointed to Ceph Dashboard.

 Since disabling the dashboard my MGRs are stable again.

 Regards
 Thomas

 Am 02.11.2019 um 02:48 schrieb Oliver Freyermuth:
  Dear Cephers,

 interestingly, after:
   ceph device monitoring off
 the mgrs seem to be stable now - the active one still went silent a few minutes later,
 but the standby took over and was stable, and restarting the broken one, it's now
stable since an hour, too,
 so probably, a restart of the mgr is needed after disabling device monitoring to get
things stable again.

 So it seems to be caused by a problem with the device health metrics. In case this is a
red herring and mgrs become instable again in the next days,
 I'll let you know.

 Cheers,
     Oliver

 Am 01.11.19 um 23:09 schrieb Oliver Freyermuth:
> Dear Cephers,
>
> this is a 14.2.4 cluster with device health metrics enabled - since about a day, all
mgr daemons go "silent" on me after a few hours, i.e. "ceph -s"
shows:
>
>    cluster:
>      id:     269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9
>      health: HEALTH_WARN
>              no active mgr
>              1/3 mons down, quorum mon001,mon002
>      services:
>      mon:        3 daemons, quorum mon001,mon002 (age 57m), out of quorum: mon003
>      mgr:        no daemons active (since 56m)
>      ...
> (the third mon has a planned outage and will come back in a few days)
>
> Checking the logs of the mgr daemons, I find some "reset" messages at the
time when it goes "silent", first for the first mgr:
>
> 2019-11-01 21:34:40.286 7f2df6a6b700  0 log_channel(cluster) log [DBG] : pgmap v1798:
1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail
> 2019-11-01 21:34:41.458 7f2e0d59b700  0 client.0 ms_handle_reset on
v2:10.160.16.1:6800/401248
> 2019-11-01 21:34:42.287 7f2df6a6b700  0 log_channel(cluster) log [DBG] : pgmap v1799:
1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail
>
> and a bit later, on the standby mgr:
>
> 2019-11-01 22:18:14.892 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : pgmap v1798:
1585 pgs: 166 active+clean+snaptrim, 858 active+clean+snaptrim_wait, 561 active+clean; 1.1
TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail
> 2019-11-01 22:18:16.022 7f7be9e72700  0 client.0 ms_handle_reset on
v2:10.160.16.2:6800/352196
> 2019-11-01 22:18:16.893 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : pgmap v1799:
1585 pgs: 166 active+clean+snaptrim, 858 active+clean+snaptrim_wait, 561 active+clean; 1.1
TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail
>
> Interestingly, the dashboard still works, but presents outdated information, and for
example zero I/O going on.
> I believe this started to happen mainly after the third mon went into the known
downtime, but I am not fully sure if this was the trigger, since the cluster is still
growing.
> It may also have been the addition of 24 more OSDs.
>
>
> I also find other messages in the mgr logs which seem problematic, but I am not sure
they are related:
> ------------------------------
> 2019-11-01 21:17:09.849 7f2df4266700  0 mgr[devicehealth] Error reading OMAP: [errno
22] Failed to operate read op for oid
> Traceback (most recent call last):
>    File "/usr/share/ceph/mgr/devicehealth/module.py", line 396, in
put_device_metrics
>      ioctx.operate_read_op(op, devid)
>    File "rados.pyx", line 516, in rados.requires.wrapper.validate_func
(/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUIL
> D/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:4721)
>    File "rados.pyx", line 3474, in rados.Ioctx.operate_read_op
(/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:36554)
> InvalidArgumentError: [errno 22] Failed to operate read op for oid
> ------------------------------
> or:
> ------------------------------
> 2019-11-01 21:33:53.977 7f7bd38bc700  0 mgr[devicehealth] Fail to parse JSON result
from daemon osd.51 ()
> 2019-11-01 21:33:53.978 7f7bd38bc700  0 mgr[devicehealth] Fail to parse JSON result
from daemon osd.52 ()
> 2019-11-01 21:33:53.979 7f7bd38bc700  0 mgr[devicehealth] Fail to parse JSON result
from daemon osd.53 ()
> ------------------------------
>
> The reason why I am cautious about the health metrics is that I observed a crash when
trying to query them:
> ------------------------------
> 2019-11-01 20:21:23.661 7fa46314a700  0 log_channel(audit) log [DBG] :
from='client.174136 -' entity='client.admin' cmd=[{"prefix":
"device get-health-metrics", "devid": "osd.11",
"target": ["mgr", ""]}]: dispatch
> 2019-11-01 20:21:23.661 7fa46394b700  0 mgr[devicehealth] handle_command
> 2019-11-01 20:21:23.663 7fa46394b700 -1 *** Caught signal (Segmentation fault) **
>   in thread 7fa46394b700 thread_name:mgr-fin
>
>   ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)
>   1: (()+0xf5f0) [0x7fa488cee5f0]
>   2: (PyEval_EvalFrameEx()+0x1a9) [0x7fa48aeb50f9]
>   3: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
>   4: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
>   5: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
>   6: (PyEval_EvalCodeEx()+0x7ed) [0x7fa48aebe08d]
>   7: (()+0x709c8) [0x7fa48ae479c8]
>   8: (PyObject_Call()+0x43) [0x7fa48ae22ab3]
>   9: (()+0x5aaa5) [0x7fa48ae31aa5]
>   10: (PyObject_Call()+0x43) [0x7fa48ae22ab3]
>   11: (()+0x4bb95) [0x7fa48ae22b95]
>   12: (PyObject_CallMethod()+0xbb) [0x7fa48ae22ecb]
>   13: (ActivePyModule::handle_command(std::map<std::string,
boost::variant<std::string, bool, long, double, std::vector<std::string,
std::allocator<std::string> >, std::vector<long, std::allocator<long>
>, std::vector<double, std::allocator<double> > >,
std::less<void>, std::allocator<std::pair<std::string const,
boost::variant<std::string, bool, long, double, std::vector<std::string,
std::allocator<std::string> >, std::vector<long, std::allocator<long>
>, std::vector<double, std::allocator<double> > > > > >
const&, ceph::buffer::v14_2_0::list const&, std::basic_stringstream<char,
std::char_traits<char>, std::allocator<char> >*,
std::basic_stringstream<char, std::char_traits<char>, std::allocator<char>
>*)+0x20e) [0x55c3c1fefc5e]
>   14: (()+0x16c23d) [0x55c3c204023d]
>   15: (FunctionContext::finish(int)+0x2c) [0x55c3c2001eac]
>   16: (Context::complete(int)+0x9) [0x55c3c1ffe659]
>   17: (Finisher::finisher_thread_entry()+0x156) [0x7fa48b439cc6]
>   18: (()+0x7e65) [0x7fa488ce6e65]
>   19: (clone()+0x6d) [0x7fa48799488d]
>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
> ------------------------------
>
> I have issued:
> ceph device monitoring off
> for now and will keep waiting to see if mgrs go silent again. If there are any better
ideas or this issue is known, let me know.
>
> Cheers,
>     Oliver
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>

 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 
_______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 
_______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: mgr daemons becoming unresponsive