Re: mgr daemons becoming unresponsive - ceph-users

6 Nov 2019

Well, even after restarting the MGR service the relevant log is spoiled
with this error messages:
2019-11-06 17:46:22.363 7f81ffdcc700  0 auth: could not find secret_id=3865
2019-11-06 17:46:22.363 7f81ffdcc700  0 cephx: verify_authorizer could
not get service secret for service mgr secret_id=3865

As you can see the secret_id changes.
However I have no idea what's the related service for this secret_id.

And in my opinion these errors are preventing the MGR from doing it's
job: bringing the cluster to Healthy state.

Am 06.11.2019 um 17:41 schrieb Mac Wynkoop:
...
  I actually just had some unresponsive mgr daemons. If
It happens
 again, I'll see if it's the same error if it happens again. Restarting
 them fixed the issue.
 Mac Wynkoop

 On Wed, Nov 6, 2019 at 8:43 AM Thomas Schneider &lt;74cmonty(a)gmail.com
 <mailto:74cmonty@gmail.com>> wrote:

     Hi,

     does anybody get this error messages in MGR log?
     2019-11-06 15:41:44.765 7f10db740700  0 auth: could not find
     secret_id=3863
     2019-11-06 15:41:44.765 7f10db740700  0 cephx: verify_authorizer could
     not get service secret for service mgr secret_id=3863

     THX

     Am 06.11.2019 um 10:43 schrieb Oliver Freyermuth:
  Hi together,

 interestingly, now that the third mon is missing for almost a week
 (those planned interventions always take longer than expected...),
 we get mgr failovers (but without crashes).

 In the mgr log, I find:

 2019-11-06 07:50:05.409 7fce8a0dc700  0 client.0 ms_handle_reset on
 v2:10.160.16.1:6800/618072 <http://10.160.16.1:6800/618072>
 ...
 ... the normal churning ...
 ...
 2019-11-06 07:52:44.113 7fce8a0dc700 -1 mgr handle_mgr_map I was
 active but no longer am
 2019-11-06 07:52:44.113 7fce8a0dc700  1 mgr respawn  e:
 '/usr/bin/ceph-mgr'

 In the mon log, I see:
 ...
 2019-11-06 07:44:11.565 7f1f44453700  4 rocksdb:
 [db/db_impl_files.cc:356] [JOB 225] Try to delete WAL files size
 10830909, prev total WAL file size 10839895, number of live WAL      files 2.

 2019-11-06 07:44:11.565 7f1f3a43f700  4 rocksdb:
 [db/db_impl_compaction_flush.cc:1403] [default] Manual compaction
 starting
 2019-11-06 07:44:11.565 7f1f44c54700  4 rocksdb: (Original Log Time
 2019/11/06-07:44:11.565802) [db/db_impl_compaction_flush.cc:2374]
 [default] Manual compaction from level-0 to level-6 from 'mgrstat ..
 'mgrstat; will stop at (end)
 ...
 2019-11-06 07:50:36.734 7f1f3a43f700  4 rocksdb:
 [db/db_impl_compaction_flush.cc:1403] [default] Manual compaction
 starting
 2019-11-06 07:52:27.046 7f1f4144d700  0 log_channel(cluster) log      [INF]
  : Manager daemon mon001 is unresponsive,
replacing it with standby
 daemon mon002
 ...

 There's a lot of compaction going on (probably due to the prolonged
 HEALTH_WARN state, so not really unexpected)
 so I wonder whether the actual cause for identifying the mgr as
 "unresponsive" is the heavy compaction on the mons.
 It will be interesting to see what happens when we finally have the
 third mon back and the cluster becomes healthy again...

 Did somebody see something similar after running for a week or more
 with Nautilus on old and slow hardware?

 Cheers,
     Oliver

 Am 02.11.19 um 18:20 schrieb Oliver Freyermuth:
> Dear Sage,
>
> good news - it happened again, with debug logs!
> There's nothing obvious to my eye, it's uploaded as:
> 0b2d0c09-46f3-4126-aa27-e2d2e8572741
> It seems the failure was roughly in parallel to me wanting to      access
 > the dashboard. It must have happened within
the last ~5-10      minutes of
 > the log.
>
> I'll now go back to "stable operation", in case you need anything
> else, just let me know.
>
> Cheers and all the best,
>     Oliver
>
> Am 02.11.19 um 17:38 schrieb Oliver Freyermuth:
>> Dear Sage,
>>
>> at least for the simple case:
>>   ceph device get-health-metrics osd.11
>> => mgr crashes (but in that case, it crashes fully, i.e. the     
process
 >> is gone)
>> I have now uploaded a verbose log as:
>> ceph-post-file: e3bd60ad-cbce-4308-8b07-7ebe7998572e
>>
>> One potential cause of this (and maybe the other issues) might be
>> because some of our OSDs are on non-JBOD controllers and hence are
>> made by forming a Raid 0 per disk,
>> so a simple smartctl on the device will not work (but
>> -dmegaraid,<number> would be needed).
>>
>> Now I have both mgrs active again, debug logging on, device health
>> metrics on again,
>> and am waiting for them to become silent again. Let's hope the     
issue
 >> reappears before the disks run full of
logs ;-).
>>
>> Cheers,
>>     Oliver
>>
>> Am 02.11.19 um 02:56 schrieb Sage Weil:
>>> On Sat, 2 Nov 2019, Oliver Freyermuth wrote:
>>>> Dear Cephers,
>>>>
>>>> interestingly, after:
>>>>   ceph device monitoring off
>>>> the mgrs seem to be stable now - the active one still went     
silent
 >>>> a few minutes later,
>>>> but the standby took over and was stable, and restarting the
>>>> broken one, it's now stable since an hour, too,
>>>> so probably, a restart of the mgr is needed after disabling     
device
 >>>> monitoring to get things stable
again.
>>>>
>>>> So it seems to be caused by a problem with the device health
>>>> metrics. In case this is a red herring and mgrs become instable
>>>> again in the next days,
>>>> I'll let you know.
>>>
>>> If this seems to stabilize things, and you can tolerate      inducing
the
 >>> failure again, reproducing the
problem with mgr logs cranked up
>>> (debug_mgr
>>> = 20, debug_ms = 1) would probably give us a good idea of why the
>>> mgr is
>>> hanging.  Let us know!
>>>
>>> Thanks,
>>> sage
>>>
>>>   >
>>>> Cheers,
>>>>     Oliver
>>>>
>>>> Am 01.11.19 um 23:09 schrieb Oliver Freyermuth:
>>>>> Dear Cephers,
>>>>>
>>>>> this is a 14.2.4 cluster with device health metrics enabled -
>>>>> since about a day, all mgr daemons go "silent" on me after
     a few
 >>>>> hours, i.e. "ceph
-s" shows:
>>>>>
>>>>>    cluster:
>>>>>      id:     269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9
>>>>>      health: HEALTH_WARN
>>>>>              no active mgr
>>>>>              1/3 mons down, quorum mon001,mon002
>>>>>      services:
>>>>>      mon:        3 daemons, quorum mon001,mon002 (age 57m), out
>>>>> of quorum: mon003
>>>>>      mgr:        no daemons active (since 56m)
>>>>>      ...
>>>>> (the third mon has a planned outage and will come back in a few
>>>>> days)
>>>>>
>>>>> Checking the logs of the mgr daemons, I find some "reset"
>>>>> messages at the time when it goes "silent", first for the
     first mgr:
 >>>>>
>>>>> 2019-11-01 21:34:40.286 7f2df6a6b700  0     
log_channel(cluster) log
 >>>>> [DBG] : pgmap v1798: 1585
pgs: 1585 active+clean; 1.1 TiB data,
>>>>> 2.3 TiB used, 136 TiB / 138 TiB avail
>>>>> 2019-11-01 21:34:41.458 7f2e0d59b700  0 client.0     
ms_handle_reset
 >>>>> on v2:10.160.16.1:6800/401248
<http://10.160.16.1:6800/401248>
>>>>> 2019-11-01 21:34:42.287 7f2df6a6b700  0     
log_channel(cluster) log
 >>>>> [DBG] : pgmap v1799: 1585
pgs: 1585 active+clean; 1.1 TiB data,
>>>>> 2.3 TiB used, 136 TiB / 138 TiB avail
>>>>>
>>>>> and a bit later, on the standby mgr:
>>>>>
>>>>> 2019-11-01 22:18:14.892 7f7bcc8ae700  0     
log_channel(cluster) log
 >>>>> [DBG] : pgmap v1798: 1585
pgs: 166 active+clean+snaptrim, 858
>>>>> active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3
>>>>> TiB used, 136 TiB / 138 TiB avail
>>>>> 2019-11-01 22:18:16.022 7f7be9e72700  0 client.0     
ms_handle_reset
 >>>>> on v2:10.160.16.2:6800/352196
<http://10.160.16.2:6800/352196>
>>>>> 2019-11-01 22:18:16.893 7f7bcc8ae700  0     
log_channel(cluster) log
 >>>>> [DBG] : pgmap v1799: 1585
pgs: 166 active+clean+snaptrim, 858
>>>>> active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3
>>>>> TiB used, 136 TiB / 138 TiB avail
>>>>>
>>>>> Interestingly, the dashboard still works, but presents outdated
>>>>> information, and for example zero I/O going on.
>>>>> I believe this started to happen mainly after the third mon    
 went
 >>>>> into the known downtime, but
I am not fully sure if this      was the
 >>>>> trigger, since the cluster is
still growing.
>>>>> It may also have been the addition of 24 more OSDs.
>>>>>
>>>>>
>>>>> I also find other messages in the mgr logs which seem
>>>>> problematic, but I am not sure they are related:
>>>>> ------------------------------
>>>>> 2019-11-01 21:17:09.849 7f2df4266700  0 mgr[devicehealth] Error
>>>>> reading OMAP: [errno 22] Failed to operate read op for oid
>>>>> Traceback (most recent call last):
>>>>>    File "/usr/share/ceph/mgr/devicehealth/module.py", line
396,
>>>>> in put_device_metrics
>>>>>      ioctx.operate_read_op(op, devid)
>>>>>    File "rados.pyx", line 516, in
>>>>> rados.requires.wrapper.validate_func
>>>>>     
(/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUIL
 >>>>>
D/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:4721)
>>>>>    File "rados.pyx", line 3474, in
rados.Ioctx.operate_read_op
>>>>>     
(/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:36554)
 >>>>> InvalidArgumentError: [errno
22] Failed to operate read op      for oid
 >>>>>
------------------------------
>>>>> or:
>>>>> ------------------------------
>>>>> 2019-11-01 21:33:53.977 7f7bd38bc700  0 mgr[devicehealth]     
Fail to
 >>>>> parse JSON result from daemon
osd.51 ()
>>>>> 2019-11-01 21:33:53.978 7f7bd38bc700  0 mgr[devicehealth]     
Fail to
 >>>>> parse JSON result from daemon
osd.52 ()
>>>>> 2019-11-01 21:33:53.979 7f7bd38bc700  0 mgr[devicehealth]     
Fail to
 >>>>> parse JSON result from daemon
osd.53 ()
>>>>> ------------------------------
>>>>>
>>>>> The reason why I am cautious about the health metrics is that I
>>>>> observed a crash when trying to query them:
>>>>> ------------------------------
>>>>> 2019-11-01 20:21:23.661 7fa46314a700  0 log_channel(audit) log
>>>>> [DBG] : from='client.174136 -' entity='client.admin'
>>>>> cmd=[{"prefix": "device get-health-metrics",
"devid": "osd.11",
>>>>> "target": ["mgr", ""]}]: dispatch
>>>>> 2019-11-01 20:21:23.661 7fa46394b700  0 mgr[devicehealth]
>>>>> handle_command
>>>>> 2019-11-01 20:21:23.663 7fa46394b700 -1 *** Caught signal
>>>>> (Segmentation fault) **
>>>>>   in thread 7fa46394b700 thread_name:mgr-fin
>>>>>
>>>>>   ceph version 14.2.4     
(75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
 >>>>> nautilus (stable)
>>>>>   1: (()+0xf5f0) [0x7fa488cee5f0]
>>>>>   2: (PyEval_EvalFrameEx()+0x1a9) [0x7fa48aeb50f9]
>>>>>   3: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
>>>>>   4: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
>>>>>   5: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
>>>>>   6: (PyEval_EvalCodeEx()+0x7ed) [0x7fa48aebe08d]
>>>>>   7: (()+0x709c8) [0x7fa48ae479c8]
>>>>>   8: (PyObject_Call()+0x43) [0x7fa48ae22ab3]
>>>>>   9: (()+0x5aaa5) [0x7fa48ae31aa5]
>>>>>   10: (PyObject_Call()+0x43) [0x7fa48ae22ab3]
>>>>>   11: (()+0x4bb95) [0x7fa48ae22b95]
>>>>>   12: (PyObject_CallMethod()+0xbb) [0x7fa48ae22ecb]
>>>>>   13: (ActivePyModule::handle_command(std::map<std::string,
>>>>> boost::variant<std::string, bool, long, double,
>>>>> std::vector<std::string, std::allocator<std::string> >,
>>>>> std::vector<long, std::allocator<long> >,
std::vector<double,
>>>>> std::allocator<double> > >, std::less<void>,
>>>>> std::allocator<std::pair<std::string const,
>>>>> boost::variant<std::string, bool, long, double,
>>>>> std::vector<std::string, std::allocator<std::string> >,
>>>>> std::vector<long, std::allocator<long> >,
std::vector<double,
>>>>> std::allocator<double> > > > > > const&,
>>>>> ceph::buffer::v14_2_0::list const&,     
std::basic_stringstream<char,
 >>>>> std::char_traits<char>,
std::allocator<char> >*,
>>>>> std::basic_stringstream<char, std::char_traits<char>,
>>>>> std::allocator<char> >*)+0x20e) [0x55c3c1fefc5e]
>>>>>   14: (()+0x16c23d) [0x55c3c204023d]
>>>>>   15: (FunctionContext::finish(int)+0x2c) [0x55c3c2001eac]
>>>>>   16: (Context::complete(int)+0x9) [0x55c3c1ffe659]
>>>>>   17: (Finisher::finisher_thread_entry()+0x156)     
[0x7fa48b439cc6]
 >>>>>   18: (()+0x7e65)
[0x7fa488ce6e65]
>>>>>   19: (clone()+0x6d) [0x7fa48799488d]
>>>>>   NOTE: a copy of the executable, or `objdump -rdS     
<executable>`
 >>>>> is needed to interpret this.
>>>>> ------------------------------
>>>>>
>>>>> I have issued:
>>>>> ceph device monitoring off
>>>>> for now and will keep waiting to see if mgrs go silent     
again. If
 >>>>> there are any better ideas or
this issue is known, let me know.
>>>>>
>>>>> Cheers,
>>>>>     Oliver
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users(a)ceph.io     
<mailto:ceph-users@ceph.io>
 >>>>> To unsubscribe send an email
to ceph-users-leave(a)ceph.io      <mailto:ceph-users-leave@ceph.io>
 >>>>>
>>>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io     
<mailto:ceph-users@ceph.io>
 >> To unsubscribe send an email to
ceph-users-leave(a)ceph.io      <mailto:ceph-users-leave@ceph.io>
 >>
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io     
<mailto:ceph-users@ceph.io>
 > To unsubscribe send an email to
ceph-users-leave(a)ceph.io      <mailto:ceph-users-leave@ceph.io>

 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io     
<mailto:ceph-users@ceph.io>
  To unsubscribe send an email to
ceph-users-leave(a)ceph.io      <mailto:ceph-users-leave@ceph.io>
     _______________________________________________
     ceph-users mailing list -- ceph-users(a)ceph.io
     <mailto:ceph-users@ceph.io>
     To unsubscribe send an email to ceph-users-leave(a)ceph.io
     <mailto:ceph-users-leave@ceph.io>