[ceph-users] Re: mgr's stop responding, dropping out of cluster with _check_auth_rotating

11 Dec 2020

On 11/12/2020 00:12, David Orman wrote:
...
  Hi Janek,

 We realize this, we referenced that issue in our initial email. We do want
 the metrics exposed by Ceph internally, and would prefer to work towards a
 fix upstream. We appreciate the suggestion for a workaround, however!

 Again, we're happy to provide whatever information we can that would be of
 assistance. If there's some debug setting that is preferred, we are happy
 to implement it, as this is currently a test cluster for us to work through
 issues such as this one.

Have you tried disabling Prometheus just to see if this also fixes the 
issue for you?

Wido

> David
> 
> On Thu, Dec 10, 2020 at 12:02 PM Janek Bevendorff <
> janek.bevendorff(a)uni-weimar.de&gt; wrote:
> 
>> Do you have the prometheus module enabled? Turn that off, it's causing
>> issues. I replaced it with another ceph exporter from Github and almost
>> forgot about it.
>>
>> Here's the relevant issue report:
>> https://tracker.ceph.com/issues/39264#change-179946
>>
>> On 10/12/2020 16:43, Welby McRoberts wrote:
>>> Hi Folks
>>>
>>> We've noticed that in a cluster of 21 nodes (5 mgrs&mons & 504
OSDs with
>> 24
>>> per node) that the mgr's are, after a non specific period of time,
>> dropping
>>> out of the cluster. The logs only show the following:
>>>
>>> debug 2020-12-10T02:02:50.409+0000 7f1005840700  0 log_channel(cluster)
>> log
>>> [DBG] : pgmap v14163: 4129 pgs: 4129 active+clean; 10 GiB data, 31 TiB
>>> used, 6.3 PiB / 6.3 PiB avail
>>> debug 2020-12-10T03:20:59.223+0000 7f10624eb700 -1 monclient:
>>> _check_auth_rotating possible clock skew, rotating keys expired way too
>>> early (before 2020-12-10T02:20:59.226159+0000)
>>> debug 2020-12-10T03:21:00.223+0000 7f10624eb700 -1 monclient:
>>> _check_auth_rotating possible clock skew, rotating keys expired way too
>>> early (before 2020-12-10T02:21:00.226310+0000)
>>>
>>> The _check_auth_rotating repeats approximately every second. The
>> instances
>>> are all syncing their time with NTP and have no issues on that front. A
>>> restart of the mgr fixes the issue.
>>>
>>> It appears that this may be related to
>> https://tracker.ceph.com/issues/39264.
>>> The suggestion seems to be to disable prometheus metrics, however, this
>>> obviously isn't realistic for a production environment where metrics are
>>> critical for operations.
>>>
>>> Please let us know what additional information we can provide to assist
>> in
>>> resolving this critical issue.
>>>
>>> Cheers
>>> Welby
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: mgr's stop responding, dropping out of cluster with _check_auth_rotating