Release: 16.2.7 (pacific)
Infra: 4 x Nodes (4xOSD HDD), 3 x Nodes (mon/mds, 1 x OSD NVMe)
We recently had a couple of node which went offline unexpectedly triggering a rebalance
which is still ongoing.
The OSDs on the restarted node are marked as down and they keep showing in the log
`authenticated timed out`, after a period of time they get marked `autoout`.
We tried setting `noout` on the cluster which has stopped them being marked out but they
still never authenticate.
We can access all the ceph tooling from those nodes which indicates connection to mons.
The node keyring/time are both in sync.
We are at a loss to why we can not get the OSDs to authenticate.
Any help would be apreciated.
```
cluster:
id: d5126e5a-882e-11ec-954e-90e2baec3d2c
health: HEALTH_WARN
7 failed cephadm daemon(s)
2 stray daemon(s) not managed by cephadm
insufficient standby MDS daemons available
nodown,noout flag(s) set
8 osds down
2 hosts (8 osds) down
Degraded data redundancy: 195930251/392039621 objects degraded (49.977%), 160
pgs degraded, 160 pgs undersized
2 pgs not deep-scrubbed in time
services:
mon: 3 daemons, quorum ceph5,ceph7,ceph6 (age 38h)
mgr: ceph2.tofizp(active, since 9M), standbys: ceph1.vnkagp
mds: 3/3 daemons up
osd: 19 osds: 11 up (since 38h), 19 in (since 45h); 5 remapped pgs
flags nodown,noout
data:
volumes: 1/1 healthy
pools: 6 pools, 257 pgs
objects: 102.94M objects, 67 TiB
usage: 68 TiB used, 50 TiB / 118 TiB avail
pgs: 195930251/392039621 objects degraded (49.977%)
3205811/392039621 objects misplaced (0.818%)
155 active+undersized+degraded
97 active+clean
3 active+undersized+degraded+remapped+backfill_wait
2 active+undersized+degraded+remapped+backfilling
io:
client: 511 B/s rd, 102 KiB/s wr, 0 op/s rd, 2 op/s wr
recovery: 13 MiB/s, 16 objects/s
```