No, as the number of responses we've seen in the
mailing lists and on the
bug report(s) have indicated it fixed the situation, we didn't proceed down
that path (it seemed highly probable it would resolve things). If it's of
additional value, we can disable the module temporarily to see if the
problem no longer presents itself, but our intent would not be to continue
to leave the module disabled and instead work towards resolution of the
issue at hand.
Let us know if disabling this module would assist in troubleshooting, and
we're happy to do so.
FWIW - we've also built a container with all of the debuginfo packages and
gdb setup to inspect the unresponsive ceph-mgr process, but our
understanding of ceph's internal workings is not deep enough to determine
why it appears to be deadlocking. That said, we welcome any requests for
any additional information we can provide to assist in determining the
cause/implementation of a solution.
David
On Fri, Dec 11, 2020 at 8:10 AM Wido den Hollander <wido(a)42on.com> wrote:
On 11/12/2020 00:12, David Orman wrote:
Hi Janek,
We realize this, we referenced that issue in our initial email. We do
want
the metrics exposed by Ceph internally, and would
prefer to work
towards a
fix upstream. We appreciate the suggestion for a
workaround, however!
Again, we're happy to provide whatever information we can that would be
of
assistance. If there's some debug setting
that is preferred, we are
happy
to implement it, as this is currently a test
cluster for us to work
through
issues such as this one.
Have you tried disabling Prometheus just to see if this also fixes the
issue for you?
Wido
David
On Thu, Dec 10, 2020 at 12:02 PM Janek Bevendorff <
janek.bevendorff(a)uni-weimar.de> wrote:
> Do you have the prometheus module enabled? Turn that off, it's causing
> issues. I replaced it with another ceph exporter from Github and almost
> forgot about it.
>
> Here's the relevant issue report:
>
https://tracker.ceph.com/issues/39264#change-179946
>
> On 10/12/2020 16:43, Welby McRoberts wrote:
>> Hi Folks
>>
>> We've noticed that in a cluster of 21 nodes (5 mgrs&mons & 504 OSDs
with
> 24
>> per node) that the mgr's are, after a non specific period of time,
> dropping
>> out of the cluster. The logs only show the following:
>>
>> debug 2020-12-10T02:02:50.409+0000 7f1005840700 0
log_channel(cluster)
> log
>> [DBG] : pgmap v14163: 4129 pgs: 4129 active+clean; 10 GiB data, 31 TiB
>> used, 6.3 PiB / 6.3 PiB avail
>> debug 2020-12-10T03:20:59.223+0000 7f10624eb700 -1 monclient:
>> _check_auth_rotating possible clock skew, rotating keys expired way
too
>> early (before
2020-12-10T02:20:59.226159+0000)
>> debug 2020-12-10T03:21:00.223+0000 7f10624eb700 -1 monclient:
>> _check_auth_rotating possible clock skew, rotating keys expired way
too
>> early (before
2020-12-10T02:21:00.226310+0000)
>>
>> The _check_auth_rotating repeats approximately every second. The
> instances
>> are all syncing their time with NTP and have no issues on that front.
A
>> restart of the mgr fixes the issue.
>>
>> It appears that this may be related to
>
https://tracker.ceph.com/issues/39264.
>> The suggestion seems to be to disable prometheus metrics, however,
this
>> obviously isn't realistic for a
production environment where metrics
are
>> critical for operations.
>>
>> Please let us know what additional information we can provide to
assist
in
> resolving this critical issue.
>
> Cheers
> Welby
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io