I have a server with 12 OSDs on it. Five of them are unable to start, and give the
following error message in the their logs:
2020-01-28 13:00:41.760 7f61fb490c80 0 monclient: wait_auth_rotating timed out after 30
2020-01-28 13:00:41.760 7f61fb490c80 -1 osd.178 411005 unable to obtain rotating service
keys; retrying
These OSDs were up and running when they initially just died on me. I tried to restart
them and they failed to come up. I rebooted the node and they did not recover. All 5 died
within a few hours and were all 5 down by time I started poking them. I previously had
this happen with 2 other OSDs, one each on 2 servers each with 12 OSDs. I ended up just
purging and recreating those OSDs. I would really like to find a solution to fix this
problem that does not involve purging the OSDs.
I have tried stopping and starting all monitors and managers, one at a time, and all at
the same time. Additionally, all servers in the cluster have been restarted over the past
couple of days for various other reasons.
I am on Ceph 14.2.6, Debian buster and am using the Debian packages. All of my servers are
kept in the time sync via ntp, and this has been verified multiple times that everything
remains in time sync.
I have googled the error message and tried all of the solutions offered from that, but
nothing makes any difference.
I would appreciate any constructive advice.
Thanks.
-- ray
Show replies by date