[ceph-users] Re: mgr daemons becoming unresponsive

2 Nov 2019

These issues sound a bit like a bug I reported a few days ago:
https://tracker.ceph.com/issues/39264
<https://tracker.ceph.com/issues/39264#change-149689>

Related: https://tracker.ceph.com/issues/39264
<https://tracker.ceph.com/issues/39264#change-149689>

On 02/11/2019 17:34, Oliver Freyermuth wrote:
...
  Dear Reed,

 yes, also the balancer is on for me - but the instabilities vanished as soon as I turned
off device health metrics. 

 Cheers,
 Oliver

 Am 02.11.19 um 17:31 schrieb Reed Dier:
  Do you also have the balancer module on?

 I experienced extremely bad stability issues where the MGRs would silently die with the
balancer module on.
 And by on, I mean 'active:true` by way of `ceph balancer on`.

 Once I disabled the automatic balancer, it seemed to become much more stable.

 I can still manually run the balancer without issues (except for one pool), but the
balancer is what appeared to be my big driver of instability.

 Reed

> On Nov 2, 2019, at 11:24 AM, Oliver Freyermuth &lt;freyermuth(a)physik.uni-bonn.de&gt;
wrote:
>
> Hi Thomas,
>
> indeed, I also had the dashboard open at these times - but right now, after disabling
device health metrics,
> I can not retrigger it even when playing wildly on the dashboard. 
>
> So I'll now reenable health metrics and try to retrigger the issue with cranked
up debug levels as Sage suggested. 
> Maybe in your case, if you can stand mgr failures, this would also be interesting to
get the dashboard issue debugged? 
>
> Cheers,
> Oliver
>
> Am 02.11.19 um 08:23 schrieb Thomas:
>> Hi Oliver,
>>
>> I experienced a situation where MGRs "goes crazy", means MGR was active
but not working.
>> In the logs of the standby MGR nodes I found an error (after restarting service)
that pointed to Ceph Dashboard.
>>
>> Since disabling the dashboard my MGRs are stable again.
>>
>> Regards
>> Thomas
>>
>> Am 02.11.2019 um 02:48 schrieb Oliver Freyermuth:
>>> Dear Cephers,
>>>
>>> interestingly, after:
>>>   ceph device monitoring off
>>> the mgrs seem to be stable now - the active one still went silent a few
minutes later,
>>> but the standby took over and was stable, and restarting the broken one,
it's now stable since an hour, too,
>>> so probably, a restart of the mgr is needed after disabling device monitoring
to get things stable again.
>>>
>>> So it seems to be caused by a problem with the device health metrics. In case
this is a red herring and mgrs become instable again in the next days,
>>> I'll let you know.
>>>
>>> Cheers,
>>>     Oliver
>>>
>>> Am 01.11.19 um 23:09 schrieb Oliver Freyermuth:
>>>> Dear Cephers,
>>>>
>>>> this is a 14.2.4 cluster with device health metrics enabled - since about
a day, all mgr daemons go "silent" on me after a few hours, i.e. "ceph
-s" shows:
>>>>
>>>>    cluster:
>>>>      id:     269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9
>>>>      health: HEALTH_WARN
>>>>              no active mgr
>>>>              1/3 mons down, quorum mon001,mon002
>>>>      services:
>>>>      mon:        3 daemons, quorum mon001,mon002 (age 57m), out of
quorum: mon003
>>>>      mgr:        no daemons active (since 56m)
>>>>      ...
>>>> (the third mon has a planned outage and will come back in a few days)
>>>>
>>>> Checking the logs of the mgr daemons, I find some "reset"
messages at the time when it goes "silent", first for the first mgr:
>>>>
>>>> 2019-11-01 21:34:40.286 7f2df6a6b700  0 log_channel(cluster) log [DBG] :
pgmap v1798: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB
avail
>>>> 2019-11-01 21:34:41.458 7f2e0d59b700  0 client.0 ms_handle_reset on
v2:10.160.16.1:6800/401248
>>>> 2019-11-01 21:34:42.287 7f2df6a6b700  0 log_channel(cluster) log [DBG] :
pgmap v1799: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB
avail
>>>>
>>>> and a bit later, on the standby mgr:
>>>>
>>>> 2019-11-01 22:18:14.892 7f7bcc8ae700  0 log_channel(cluster) log [DBG] :
pgmap v1798: 1585 pgs: 166 active+clean+snaptrim, 858 active+clean+snaptrim_wait, 561
active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail
>>>> 2019-11-01 22:18:16.022 7f7be9e72700  0 client.0 ms_handle_reset on
v2:10.160.16.2:6800/352196
>>>> 2019-11-01 22:18:16.893 7f7bcc8ae700  0 log_channel(cluster) log [DBG] :
pgmap v1799: 1585 pgs: 166 active+clean+snaptrim, 858 active+clean+snaptrim_wait, 561
active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail
>>>>
>>>> Interestingly, the dashboard still works, but presents outdated
information, and for example zero I/O going on.
>>>> I believe this started to happen mainly after the third mon went into the
known downtime, but I am not fully sure if this was the trigger, since the cluster is
still growing.
>>>> It may also have been the addition of 24 more OSDs.
>>>>
>>>>
>>>> I also find other messages in the mgr logs which seem problematic, but I
am not sure they are related:
>>>> ------------------------------
>>>> 2019-11-01 21:17:09.849 7f2df4266700  0 mgr[devicehealth] Error reading
OMAP: [errno 22] Failed to operate read op for oid
>>>> Traceback (most recent call last):
>>>>    File "/usr/share/ceph/mgr/devicehealth/module.py", line 396,
in put_device_metrics
>>>>      ioctx.operate_read_op(op, devid)
>>>>    File "rados.pyx", line 516, in
rados.requires.wrapper.validate_func
(/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUIL
>>>> D/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:4721)
>>>>    File "rados.pyx", line 3474, in rados.Ioctx.operate_read_op
(/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:36554)
>>>> InvalidArgumentError: [errno 22] Failed to operate read op for oid
>>>> ------------------------------
>>>> or:
>>>> ------------------------------
>>>> 2019-11-01 21:33:53.977 7f7bd38bc700  0 mgr[devicehealth] Fail to parse
JSON result from daemon osd.51 ()
>>>> 2019-11-01 21:33:53.978 7f7bd38bc700  0 mgr[devicehealth] Fail to parse
JSON result from daemon osd.52 ()
>>>> 2019-11-01 21:33:53.979 7f7bd38bc700  0 mgr[devicehealth] Fail to parse
JSON result from daemon osd.53 ()
>>>> ------------------------------
>>>>
>>>> The reason why I am cautious about the health metrics is that I observed
a crash when trying to query them:
>>>> ------------------------------
>>>> 2019-11-01 20:21:23.661 7fa46314a700  0 log_channel(audit) log [DBG] :
from='client.174136 -' entity='client.admin' cmd=[{"prefix":
"device get-health-metrics", "devid": "osd.11",
"target": ["mgr", ""]}]: dispatch
>>>> 2019-11-01 20:21:23.661 7fa46394b700  0 mgr[devicehealth] handle_command
>>>> 2019-11-01 20:21:23.663 7fa46394b700 -1 *** Caught signal (Segmentation
fault) **
>>>>   in thread 7fa46394b700 thread_name:mgr-fin
>>>>
>>>>   ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus
(stable)
>>>>   1: (()+0xf5f0) [0x7fa488cee5f0]
>>>>   2: (PyEval_EvalFrameEx()+0x1a9) [0x7fa48aeb50f9]
>>>>   3: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
>>>>   4: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
>>>>   5: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
>>>>   6: (PyEval_EvalCodeEx()+0x7ed) [0x7fa48aebe08d]
>>>>   7: (()+0x709c8) [0x7fa48ae479c8]
>>>>   8: (PyObject_Call()+0x43) [0x7fa48ae22ab3]
>>>>   9: (()+0x5aaa5) [0x7fa48ae31aa5]
>>>>   10: (PyObject_Call()+0x43) [0x7fa48ae22ab3]
>>>>   11: (()+0x4bb95) [0x7fa48ae22b95]
>>>>   12: (PyObject_CallMethod()+0xbb) [0x7fa48ae22ecb]
>>>>   13: (ActivePyModule::handle_command(std::map<std::string,
boost::variant<std::string, bool, long, double, std::vector<std::string,
std::allocator<std::string> >, std::vector<long, std::allocator<long>
>, std::vector<double, std::allocator<double> > >,
std::less<void>, std::allocator<std::pair<std::string const,
boost::variant<std::string, bool, long, double, std::vector<std::string,
std::allocator<std::string> >, std::vector<long, std::allocator<long>
>, std::vector<double, std::allocator<double> > > > > >
const&, ceph::buffer::v14_2_0::list const&, std::basic_stringstream<char,
std::char_traits<char>, std::allocator<char> >*,
std::basic_stringstream<char, std::char_traits<char>, std::allocator<char>
>*)+0x20e) [0x55c3c1fefc5e]
>>>>   14: (()+0x16c23d) [0x55c3c204023d]
>>>>   15: (FunctionContext::finish(int)+0x2c) [0x55c3c2001eac]
>>>>   16: (Context::complete(int)+0x9) [0x55c3c1ffe659]
>>>>   17: (Finisher::finisher_thread_entry()+0x156) [0x7fa48b439cc6]
>>>>   18: (()+0x7e65) [0x7fa488ce6e65]
>>>>   19: (clone()+0x6d) [0x7fa48799488d]
>>>>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
>>>> ------------------------------
>>>>
>>>> I have issued:
>>>> ceph device monitoring off
>>>> for now and will keep waiting to see if mgrs go silent again. If there
are any better ideas or this issue is known, let me know.
>>>>
>>>> Cheers,
>>>>     Oliver
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io 

 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: mgr daemons becoming unresponsive