mgr's stop responding, dropping out of cluster with _check_auth_rotating

List overview All Threads
Download

newer

older

Ceph with Firewall

Re: Setting up NFS with Octopus

Welby McRoberts

10 Dec 2020 10 Dec '20

7:13 p.m.

Hi Folks We've noticed that in a cluster of 21 nodes (5 mgrs&mons & 504 OSDs with 24 per node) that the mgr's are, after a non specific period of time, dropping out of the cluster. The logs only show the following: debug 2020-12-10T02:02:50.409+0000 7f1005840700 0 log_channel(cluster) log [DBG] : pgmap v14163: 4129 pgs: 4129 active+clean; 10 GiB data, 31 TiB used, 6.3 PiB / 6.3 PiB avail debug 2020-12-10T03:20:59.223+0000 7f10624eb700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-12-10T02:20:59.226159+0000) debug 2020-12-10T03:21:00.223+0000 7f10624eb700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-12-10T02:21:00.226310+0000) The _check_auth_rotating repeats approximately every second. The instances are all syncing their time with NTP and have no issues on that front. A restart of the mgr fixes the issue. It appears that this may be related to https://tracker.ceph.com/issues/39264. The suggestion seems to be to disable prometheus metrics, however, this obviously isn't realistic for a production environment where metrics are critical for operations. Please let us know what additional information we can provide to assist in resolving this critical issue. Cheers Welby

Show replies by date

Janek Bevendorff

10 Dec 10 Dec

9:31 p.m.

Do you have the prometheus module enabled? Turn that off, it's causing issues. I replaced it with another ceph exporter from Github and almost forgot about it. Here's the relevant issue report: https://tracker.ceph.com/issues/39264#change-179946 On 10/12/2020 16:43, Welby McRoberts wrote: > Hi Folks > > We've noticed that in a cluster of 21 nodes (5 mgrs&mons & 504 OSDs with 24 > per node) that the mgr's are, after a non specific period of time, dropping > out of the cluster. The logs only show the following: > > debug 2020-12-10T02:02:50.409+0000 7f1005840700 0 log_channel(cluster) log > [DBG] : pgmap v14163: 4129 pgs: 4129 active+clean; 10 GiB data, 31 TiB > used, 6.3 PiB / 6.3 PiB avail > debug 2020-12-10T03:20:59.223+0000 7f10624eb700 -1 monclient: > _check_auth_rotating possible clock skew, rotating keys expired way too > early (before 2020-12-10T02:20:59.226159+0000) > debug 2020-12-10T03:21:00.223+0000 7f10624eb700 -1 monclient: > _check_auth_rotating possible clock skew, rotating keys expired way too > early (before 2020-12-10T02:21:00.226310+0000) > > The _check_auth_rotating repeats approximately every second. The instances > are all syncing their time with NTP and have no issues on that front. A > restart of the mgr fixes the issue. > > It appears that this may be related to https://tracker.ceph.com/issues/39264. > The suggestion seems to be to disable prometheus metrics, however, this > obviously isn't realistic for a production environment where metrics are > critical for operations. > > Please let us know what additional information we can provide to assist in > resolving this critical issue. > > Cheers > Welby > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Janek Bevendorff

9:35 p.m.

FYI, this is the ceph-exporter we're using at the moment: https://github.com/digitalocean/ceph_exporter It's not as good, but it does the job mostly. Some more specific metrics are missing, but the majority is there. On 10/12/2020 19:01, Janek Bevendorff wrote: > Do you have the prometheus module enabled? Turn that off, it's causing > issues. I replaced it with another ceph exporter from Github and > almost forgot about it. > > Here's the relevant issue report: > https://tracker.ceph.com/issues/39264#change-179946 > > On 10/12/2020 16:43, Welby McRoberts wrote: >> Hi Folks >> >> We've noticed that in a cluster of 21 nodes (5 mgrs&mons & 504 OSDs >> with 24 >> per node) that the mgr's are, after a non specific period of time, >> dropping >> out of the cluster. The logs only show the following: >> >> debug 2020-12-10T02:02:50.409+0000 7f1005840700 0 >> log_channel(cluster) log >> [DBG] : pgmap v14163: 4129 pgs: 4129 active+clean; 10 GiB data, 31 TiB >> used, 6.3 PiB / 6.3 PiB avail >> debug 2020-12-10T03:20:59.223+0000 7f10624eb700 -1 monclient: >> _check_auth_rotating possible clock skew, rotating keys expired way too >> early (before 2020-12-10T02:20:59.226159+0000) >> debug 2020-12-10T03:21:00.223+0000 7f10624eb700 -1 monclient: >> _check_auth_rotating possible clock skew, rotating keys expired way too >> early (before 2020-12-10T02:21:00.226310+0000) >> >> The _check_auth_rotating repeats approximately every second. The >> instances >> are all syncing their time with NTP and have no issues on that front. A >> restart of the mgr fixes the issue. >> >> It appears that this may be related to >> https://tracker.ceph.com/issues/39264. >> The suggestion seems to be to disable prometheus metrics, however, this >> obviously isn't realistic for a production environment where metrics are >> critical for operations. >> >> Please let us know what additional information we can provide to >> assist in >> resolving this critical issue. >> >> Cheers >> Welby >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

David Orman

11 Dec 11 Dec

2:42 a.m.

Hi Janek, We realize this, we referenced that issue in our initial email. We do want the metrics exposed by Ceph internally, and would prefer to work towards a fix upstream. We appreciate the suggestion for a workaround, however! Again, we're happy to provide whatever information we can that would be of assistance. If there's some debug setting that is preferred, we are happy to implement it, as this is currently a test cluster for us to work through issues such as this one. David On Thu, Dec 10, 2020 at 12:02 PM Janek Bevendorff < janek.bevendorff(a)uni-weimar.de> wrote:

...

Hi Folks We've noticed that in a cluster of 21 nodes (5 mgrs&mons & 504 OSDs with

per node) that the mgr's are, after a non specific period of time,

dropping

out of the cluster. The logs only show the following: debug 2020-12-10T02:02:50.409+0000 7f1005840700 0 log_channel(cluster)

log

[DBG] : pgmap v14163: 4129 pgs: 4129 active+clean; 10 GiB data, 31 TiB used, 6.3 PiB / 6.3 PiB avail debug 2020-12-10T03:20:59.223+0000 7f10624eb700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-12-10T02:20:59.226159+0000) debug 2020-12-10T03:21:00.223+0000 7f10624eb700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-12-10T02:21:00.226310+0000) The _check_auth_rotating repeats approximately every second. The

instances

are all syncing their time with NTP and have no issues on that front. A restart of the mgr fixes the issue. It appears that this may be related to

https://tracker.ceph.com/issues/39264.

The suggestion seems to be to disable prometheus metrics, however, this obviously isn't realistic for a production environment where metrics are critical for operations. Please let us know what additional information we can provide to assist

resolving this critical issue. Cheers Welby _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Wido den Hollander

5:40 p.m.

On 11/12/2020 00:12, David Orman wrote:

...

Have you tried disabling Prometheus just to see if this also fixes the issue for you? Wido > David > > On Thu, Dec 10, 2020 at 12:02 PM Janek Bevendorff < > janek.bevendorff(a)uni-weimar.de> wrote: > >> Do you have the prometheus module enabled? Turn that off, it's causing >> issues. I replaced it with another ceph exporter from Github and almost >> forgot about it. >> >> Here's the relevant issue report: >> https://tracker.ceph.com/issues/39264#change-179946 >> >> On 10/12/2020 16:43, Welby McRoberts wrote: >>> Hi Folks >>> >>> We've noticed that in a cluster of 21 nodes (5 mgrs&mons & 504 OSDs with >> 24 >>> per node) that the mgr's are, after a non specific period of time, >> dropping >>> out of the cluster. The logs only show the following: >>> >>> debug 2020-12-10T02:02:50.409+0000 7f1005840700 0 log_channel(cluster) >> log >>> [DBG] : pgmap v14163: 4129 pgs: 4129 active+clean; 10 GiB data, 31 TiB >>> used, 6.3 PiB / 6.3 PiB avail >>> debug 2020-12-10T03:20:59.223+0000 7f10624eb700 -1 monclient: >>> _check_auth_rotating possible clock skew, rotating keys expired way too >>> early (before 2020-12-10T02:20:59.226159+0000) >>> debug 2020-12-10T03:21:00.223+0000 7f10624eb700 -1 monclient: >>> _check_auth_rotating possible clock skew, rotating keys expired way too >>> early (before 2020-12-10T02:21:00.226310+0000) >>> >>> The _check_auth_rotating repeats approximately every second. The >> instances >>> are all syncing their time with NTP and have no issues on that front. A >>> restart of the mgr fixes the issue. >>> >>> It appears that this may be related to >> https://tracker.ceph.com/issues/39264. >>> The suggestion seems to be to disable prometheus metrics, however, this >>> obviously isn't realistic for a production environment where metrics are >>> critical for operations. >>> >>> Please let us know what additional information we can provide to assist >> in >>> resolving this critical issue. >>> >>> Cheers >>> Welby >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io >

David Orman

6:40 p.m.

No, as the number of responses we've seen in the mailing lists and on the bug report(s) have indicated it fixed the situation, we didn't proceed down that path (it seemed highly probable it would resolve things). If it's of additional value, we can disable the module temporarily to see if the problem no longer presents itself, but our intent would not be to continue to leave the module disabled and instead work towards resolution of the issue at hand. Let us know if disabling this module would assist in troubleshooting, and we're happy to do so. FWIW - we've also built a container with all of the debuginfo packages and gdb setup to inspect the unresponsive ceph-mgr process, but our understanding of ceph's internal workings is not deep enough to determine why it appears to be deadlocking. That said, we welcome any requests for any additional information we can provide to assist in determining the cause/implementation of a solution. David On Fri, Dec 11, 2020 at 8:10 AM Wido den Hollander <wido(a)42on.com> wrote:

...

On 11/12/2020 00:12, David Orman wrote:

Hi Janek, We realize this, we referenced that issue in our initial email. We do

want

the metrics exposed by Ceph internally, and would prefer to work towards

fix upstream. We appreciate the suggestion for a workaround, however! Again, we're happy to provide whatever information we can that would be

assistance. If there's some debug setting that is preferred, we are happy to implement it, as this is currently a test cluster for us to work

through

issues such as this one.

Have you tried disabling Prometheus just to see if this also fixes the issue for you? Wido

David On Thu, Dec 10, 2020 at 12:02 PM Janek Bevendorff < janek.bevendorff(a)uni-weimar.de> wrote: > Do you have the prometheus module enabled? Turn that off, it's causing > issues. I replaced it with another ceph exporter from Github and almost > forgot about it. > > Here's the relevant issue report: > https://tracker.ceph.com/issues/39264#change-179946 > > On 10/12/2020 16:43, Welby McRoberts wrote: >> Hi Folks >> >> We've noticed that in a cluster of 21 nodes (5 mgrs&mons & 504 OSDs

with

> 24 >> per node) that the mgr's are, after a non specific period of time, > dropping >> out of the cluster. The logs only show the following: >> >> debug 2020-12-10T02:02:50.409+0000 7f1005840700 0 log_channel(cluster) > log >> [DBG] : pgmap v14163: 4129 pgs: 4129 active+clean; 10 GiB data, 31 TiB >> used, 6.3 PiB / 6.3 PiB avail >> debug 2020-12-10T03:20:59.223+0000 7f10624eb700 -1 monclient: >> _check_auth_rotating possible clock skew, rotating keys expired way too >> early (before 2020-12-10T02:20:59.226159+0000) >> debug 2020-12-10T03:21:00.223+0000 7f10624eb700 -1 monclient: >> _check_auth_rotating possible clock skew, rotating keys expired way too >> early (before 2020-12-10T02:21:00.226310+0000) >> >> The _check_auth_rotating repeats approximately every second. The > instances >> are all syncing their time with NTP and have no issues on that front. A >> restart of the mgr fixes the issue. >> >> It appears that this may be related to > https://tracker.ceph.com/issues/39264. >> The suggestion seems to be to disable prometheus metrics, however, this >> obviously isn't realistic for a production environment where metrics

are

critical for operations. Please let us know what additional information we can provide to assist

resolving this critical issue. Cheers Welby _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

David Orman

21 Dec 21 Dec

7:47 p.m.

We've got a PR in to fix this; we validated it resolves the issue in our larger clusters. We could use some help getting this moved forward since it seems to impact a number of users: https://github.com/ceph/ceph/pull/38677 On Fri, Dec 11, 2020 at 9:10 AM David Orman <ormandj(a)corenode.com> wrote:

...

On 11/12/2020 00:12, David Orman wrote:

Hi Janek, We realize this, we referenced that issue in our initial email. We do

want

the metrics exposed by Ceph internally, and would prefer to work

towards a

fix upstream. We appreciate the suggestion for a workaround, however! Again, we're happy to provide whatever information we can that would be

assistance. If there's some debug setting that is preferred, we are

happy

to implement it, as this is currently a test cluster for us to work

through

issues such as this one.

Have you tried disabling Prometheus just to see if this also fixes the issue for you? Wido

with

> 24 >> per node) that the mgr's are, after a non specific period of time, > dropping >> out of the cluster. The logs only show the following: >> >> debug 2020-12-10T02:02:50.409+0000 7f1005840700 0

log_channel(cluster)

> log >> [DBG] : pgmap v14163: 4129 pgs: 4129 active+clean; 10 GiB data, 31 TiB >> used, 6.3 PiB / 6.3 PiB avail >> debug 2020-12-10T03:20:59.223+0000 7f10624eb700 -1 monclient: >> _check_auth_rotating possible clock skew, rotating keys expired way

too

>> early (before 2020-12-10T02:20:59.226159+0000) >> debug 2020-12-10T03:21:00.223+0000 7f10624eb700 -1 monclient: >> _check_auth_rotating possible clock skew, rotating keys expired way

too

>> early (before 2020-12-10T02:21:00.226310+0000) >> >> The _check_auth_rotating repeats approximately every second. The > instances >> are all syncing their time with NTP and have no issues on that front.

>> restart of the mgr fixes the issue. >> >> It appears that this may be related to > https://tracker.ceph.com/issues/39264. >> The suggestion seems to be to disable prometheus metrics, however,

this

>> obviously isn't realistic for a production environment where metrics

are

>> critical for operations. >> >> Please let us know what additional information we can provide to

assist

in > resolving this critical issue. > > Cheers > Welby > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

1223

days inactive

1234

days old

ceph-users@ceph.io

Manage subscription

6 comments

4 participants

tags (0)

participants (4)

David Orman
Janek Bevendorff
Welby McRoberts
Wido den Hollander