[ceph-users] OSD fail to authenticate after node outage

6 Feb 2023

Release: 16.2.7 (pacific)
Infra: 4 x Nodes (4xOSD HDD), 3 x Nodes (mon/mds, 1 x OSD NVMe)

We recently had a couple of node which went offline unexpectedly triggering a rebalance
which is still ongoing.
The OSDs on the restarted node are marked as down and they keep showing in the log
`authenticated timed out`, after a period of time they get marked `autoout`.
We tried setting `noout` on the cluster which has stopped them being marked out but they
still never authenticate.
We can access all the ceph tooling from those nodes which indicates connection to mons.
The node keyring/time are both in sync.
We are at a loss to why we can not get the OSDs to authenticate.

Any help would be apreciated.

```
  cluster:
    id:     d5126e5a-882e-11ec-954e-90e2baec3d2c
    health: HEALTH_WARN
            7 failed cephadm daemon(s)
            2 stray daemon(s) not managed by cephadm
            insufficient standby MDS daemons available
            nodown,noout flag(s) set
            8 osds down
            2 hosts (8 osds) down
            Degraded data redundancy: 195930251/392039621 objects degraded (49.977%), 160
pgs degraded, 160 pgs undersized
            2 pgs not deep-scrubbed in time

  services:
    mon: 3 daemons, quorum ceph5,ceph7,ceph6 (age 38h)
    mgr: ceph2.tofizp(active, since 9M), standbys: ceph1.vnkagp
    mds: 3/3 daemons up
    osd: 19 osds: 11 up (since 38h), 19 in (since 45h); 5 remapped pgs
         flags nodown,noout

  data:
    volumes: 1/1 healthy
    pools:   6 pools, 257 pgs
    objects: 102.94M objects, 67 TiB
    usage:   68 TiB used, 50 TiB / 118 TiB avail
    pgs:     195930251/392039621 objects degraded (49.977%)
             3205811/392039621 objects misplaced (0.818%)
             155 active+undersized+degraded
             97  active+clean
             3   active+undersized+degraded+remapped+backfill_wait
             2   active+undersized+degraded+remapped+backfilling

  io:
    client:   511 B/s rd, 102 KiB/s wr, 0 op/s rd, 2 op/s wr
    recovery: 13 MiB/s, 16 objects/s
```

2024

2023

2022

2021

2020

2019

[ceph-users] OSD fail to authenticate after node outage