OSDs will not start - ceph-users

25 Jan 2023

Good day all,

I've an issue with a few OSDs (in two different nodes) that attempt to
start but fail / crash quite quickly. They are all LVM disks.
I've tried upgrading software, health checks on the hardware (nodes and
disks) and there doesn't seem to be any issues there.

Recently I've had a few "other" disks physically fail in the cluster and
now have one PG down which is blocking some IO on CephFS.
I've added the output of the osd journalctl and the osd log below in case
it's helpful to identify anything obvious.
I also set debug bluefs = 20 , saw this in another post.
I recently manually upgraded this node to (17.2.0) before the problem
began, later to (17.2.5). - The other osds in this node start / run fine.

The other node (15.2.17) also has a few osds that will not start and some
that run without issue.
Could anyone point me in the right direction to investigate and solve my
osd issues.

https://pastebin.com/3PkCabdf
https://pastebin.com/BT9bnhSb

Production system mainly used for CephFS
OS: Ubuntu 20.04.5 LTS
Ceph versions: 15.2.17 - Octopus (one OSD node manually upgraded to 17.2.5
- Quincy)
Erasure data pool (K=4, M=2)  - The journal's for each osd are co-located
on each drive

Kind regards
Geoffrey Rhodes