Hi,
I have the same situation with some OSD on Octopus 15.2.5 (Ubuntu 20,04).
But, I have no problem with MGR. Any clue about this?
Best regards,
Date: Tue, 9 Jun 2020 23:47:24 +0200
From: Wido den Hollander <wido(a)42on.com>
Subject: [ceph-users] Octopus OSDs dropping out of cluster:
_check_auth_rotating possible clock skew, rotating keys expired way
too early
To: "ceph-users(a)ceph.io" <ceph-users(a)ceph.io>
Message-ID: <be7aadc4-2142-ea31-caa8-28ca6db03d15(a)42on.com>
Content-Type: text/plain; charset=utf-8
Hi,
On a recently deployed Octopus (15.2.2) cluster (240 OSDs) we are seeing
OSDs randomly drop out of the cluster.
Usually it's 2 to 4 OSDs spread out over different nodes. Each node has
16 OSDs and not all the failing OSDs are on the same node.
The OSDs are marked as down and all they keep print in their logs:
monclient: _check_auth_rotating possible clock skew, rotating keys
expired way too early (before 2020-06-04T07:57:17.706529-0400)
Looking at their status through the admin socket:
{
"cluster_fsid": "68653193-9b84-478d-bc39-1a811dd50836",
"osd_fsid": "87231b5d-ae5f-4901-93c5-18034381e5ec",
"whoami": 206,
"state": "active",
"oldest_map": 73697,
"newest_map": 75795,
"num_pgs": 19
}
The message brought me to my own ticket I created 2 years ago:
https://tracker.ceph.com/issues/23460
The first thing I've checked is NTP/time. Double, triple check this. All
the times are in sync on the cluster. Nothing wrong there.
Again, it's not all the OSDs on a node failing. Just 1 or 2 dropping out.
Restarting them brings them back right away and then within 24h some
other OSDs will drop out.
Has anybody seen this behavior with Octopus as well?
Wido