mgr daemons becoming unresponsive - ceph-users

1 Nov 2019

Dear Cephers,

this is a 14.2.4 cluster with device health metrics enabled - since about a day, all mgr
daemons go "silent" on me after a few hours, i.e. "ceph -s" shows:

  cluster:
    id:     269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9
    health: HEALTH_WARN
            no active mgr
            1/3 mons down, quorum mon001,mon002

  services:
    mon:        3 daemons, quorum mon001,mon002 (age 57m), out of quorum: mon003
    mgr:        no daemons active (since 56m)
    ...
(the third mon has a planned outage and will come back in a few days)

Checking the logs of the mgr daemons, I find some "reset" messages at the time
when it goes "silent", first for the first mgr:

2019-11-01 21:34:40.286 7f2df6a6b700  0 log_channel(cluster) log [DBG] : pgmap v1798: 1585
pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail
2019-11-01 21:34:41.458 7f2e0d59b700  0 client.0 ms_handle_reset on
v2:10.160.16.1:6800/401248
2019-11-01 21:34:42.287 7f2df6a6b700  0 log_channel(cluster) log [DBG] : pgmap v1799: 1585
pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail

and a bit later, on the standby mgr:

2019-11-01 22:18:14.892 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : pgmap v1798: 1585
pgs: 166 active+clean+snaptrim, 858 active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB
data, 2.3 TiB used, 136 TiB / 138 TiB avail
2019-11-01 22:18:16.022 7f7be9e72700  0 client.0 ms_handle_reset on
v2:10.160.16.2:6800/352196
2019-11-01 22:18:16.893 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : pgmap v1799: 1585
pgs: 166 active+clean+snaptrim, 858 active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB
data, 2.3 TiB used, 136 TiB / 138 TiB avail

Interestingly, the dashboard still works, but presents outdated information, and for
example zero I/O going on. 
I believe this started to happen mainly after the third mon went into the known downtime,
but I am not fully sure if this was the trigger, since the cluster is still growing. 
It may also have been the addition of 24 more OSDs. 

I also find other messages in the mgr logs which seem problematic, but I am not sure they
are related:
------------------------------
2019-11-01 21:17:09.849 7f2df4266700  0 mgr[devicehealth] Error reading OMAP: [errno 22]
Failed to operate read op for oid 
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 396, in
put_device_metrics
    ioctx.operate_read_op(op, devid)
  File "rados.pyx", line 516, in rados.requires.wrapper.validate_func
(/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUIL
D/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:4721)
  File "rados.pyx", line 3474, in rados.Ioctx.operate_read_op
(/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:36554)
InvalidArgumentError: [errno 22] Failed to operate read op for oid 
------------------------------
or:
------------------------------
2019-11-01 21:33:53.977 7f7bd38bc700  0 mgr[devicehealth] Fail to parse JSON result from
daemon osd.51 ()
2019-11-01 21:33:53.978 7f7bd38bc700  0 mgr[devicehealth] Fail to parse JSON result from
daemon osd.52 ()
2019-11-01 21:33:53.979 7f7bd38bc700  0 mgr[devicehealth] Fail to parse JSON result from
daemon osd.53 ()
------------------------------

The reason why I am cautious about the health metrics is that I observed a crash when
trying to query them:
------------------------------
2019-11-01 20:21:23.661 7fa46314a700  0 log_channel(audit) log [DBG] :
from='client.174136 -' entity='client.admin' cmd=[{"prefix":
"device get-health-metrics", "devid": "osd.11",
"target": ["mgr", ""]}]: dispatch
2019-11-01 20:21:23.661 7fa46394b700  0 mgr[devicehealth] handle_command
2019-11-01 20:21:23.663 7fa46394b700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fa46394b700 thread_name:mgr-fin

 ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)
 1: (()+0xf5f0) [0x7fa488cee5f0]
 2: (PyEval_EvalFrameEx()+0x1a9) [0x7fa48aeb50f9]
 3: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
 4: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
 5: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
 6: (PyEval_EvalCodeEx()+0x7ed) [0x7fa48aebe08d]
 7: (()+0x709c8) [0x7fa48ae479c8]
 8: (PyObject_Call()+0x43) [0x7fa48ae22ab3]
 9: (()+0x5aaa5) [0x7fa48ae31aa5]
 10: (PyObject_Call()+0x43) [0x7fa48ae22ab3]
 11: (()+0x4bb95) [0x7fa48ae22b95]
 12: (PyObject_CallMethod()+0xbb) [0x7fa48ae22ecb]
 13: (ActivePyModule::handle_command(std::map<std::string,
boost::variant<std::string, bool, long, double, std::vector<std::string,
std::allocator<std::string> >, std::vector<long, std::allocator<long>
>, std::vector<double, std::allocator<double> > >,
std::less<void>, std::allocator<std::pair<std::string const,
boost::variant<std::string, bool, long, double, std::vector<std::string,
std::allocator<std::string> >, std::vector<long, std::allocator<long>
>, std::vector<double, std::allocator<double> > > > > >
const&, ceph::buffer::v14_2_0::list const&, std::basic_stringstream<char,
std::char_traits<char>, std::allocator<char> >*,
std::basic_stringstream<char, std::char_traits<char>, std::allocator<char>
>*)+0x20e) [0x55c3c1fefc5e]
 14: (()+0x16c23d) [0x55c3c204023d]
 15: (FunctionContext::finish(int)+0x2c) [0x55c3c2001eac]
 16: (Context::complete(int)+0x9) [0x55c3c1ffe659]
 17: (Finisher::finisher_thread_entry()+0x156) [0x7fa48b439cc6]
 18: (()+0x7e65) [0x7fa488ce6e65]
 19: (clone()+0x6d) [0x7fa48799488d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
------------------------------

I have issued:
ceph device monitoring off
for now and will keep waiting to see if mgrs go silent again. If there are any better
ideas or this issue is known, let me know. 

Cheers,
	Oliver