Hey all,
I haven't managed to solve this issue yet.
To simplify things, I'm looking to restart one OSD which crashes shortly
after starting.
As mentioned before I've ruled out this being related to hardware.
Not a dev but looking at the log my error occurs at this
<https://github.com/ceph/ceph/blob/quincy/src/os/bluestore/BlueFS.cc#L1419>
point
in the code. (
https://github.com/ceph/ceph/blob/quincy/src/os/bluestore/BlueFS.cc#L1419)
Any suggestions on a way forward would be greatly appreciated.
Tried the *ceph-bluestore-tool* to repair / fsck / etc. but all fail with
the BlueFS reply error.
If I could use the *ceph-objectstore-tool* to export the shard of the PG
that's down I'd try that but it also fails with the same BlueFS replay
error.
I've added the output of the osd journalctl and the osd log below in case
it's helpful to identify anything obvious.
Set debug bluefs = 20 , saw this in another post.
https://pastebin.com/3PkCabdf
https://pastebin.com/BT9bnhSb
Kind regards
Geoffrey Rhodes
On Wed, 25 Jan 2023 at 12:44, Geoffrey Rhodes <geoffrey(a)rhodes.org.za>
wrote:
Good day all,
I've an issue with a few OSDs (in two different nodes) that attempt to
start but fail / crash quite quickly. They are all LVM disks.
I've tried upgrading software, health checks on the hardware (nodes and
disks) and there doesn't seem to be any issues there.
Recently I've had a few "other" disks physically fail in the cluster and
now have one PG down which is blocking some IO on CephFS.
I've added the output of the osd journalctl and the osd log below in case
it's helpful to identify anything obvious.
I also set debug bluefs = 20 , saw this in another post.
I recently manually upgraded this node to (17.2.0) before the problem
began, later to (17.2.5). - The other osds in this node start / run fine.
The other node (15.2.17) also has a few osds that will not start and some
that run without issue.
Could anyone point me in the right direction to investigate and solve my
osd issues.
https://pastebin.com/3PkCabdf
https://pastebin.com/BT9bnhSb
Production system mainly used for CephFS
OS: Ubuntu 20.04.5 LTS
Ceph versions: 15.2.17 - Octopus (one OSD node manually upgraded to
17.2.5 - Quincy)
Erasure data pool (K=4, M=2) - The journal's for each osd are co-located
on each drive
Kind regards
Geoffrey Rhodes