Hi Alison,
I have observed exactly that with OSDs "converted" from ceph-disk to
ceph-volume. Someone thought it would be a great idea to store the /dev-device name in the
config instead of the uuid or any other stable device path:
# cat /etc/ceph/osd/287-2eaf591b-bced-4097-9499-5fda071c6161.json
{
...
"block": {
"path":
"/dev/disk/by-partuuid/0c8a9f89-efa7-4c75-87ad-2f0d5aa2d649",
"uuid": "0c8a9f89-efa7-4c75-87ad-2f0d5aa2d649"
},
...
"data": {
"path": "/dev/sdm1",
"uuid": "2eaf591b-bced-4097-9499-5fda071c6161"
},
...
}
Funnily enough, it has the by-uuid path stored as well, but the /dev path is actually used
during activation. My "fix" is to re-generate the OSD-json just before every
ceph-disk OSD start.
You seem to be using LVM OSDs already, so this is a bit weird (can't be the exact same
issue). Still, I would not be surprised if you are bitten by something similar, some
stored config (cache) overrides the actual drive location. It is really a bliss that the
developers implemented a check that a partition actually points to the data with the
correct OSD ID, otherwise our cluster would be rigged by now.
I would start by using low-level commands (ceph-volume) directly to see if the issue is
low-level or sits in some higher-level interface. Log-in to the OSD node and check what
"ceph-volume inventory" says and if you can manually activate/deactivate the OSD
on disk (be careful to include the --no-systemd option everywhere to avoid unintended
change of persistent configurations).
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: apeisker(a)fnal.gov <apeisker(a)fnal.gov>
Sent: Friday, August 25, 2023 10:29 PM
To: ceph-users(a)ceph.io
Subject: [ceph-users] Re: A couple OSDs not starting after host reboot
Hi,
Thank you for your reply. I don’t think the device names changed, but ceph seems to be
confused about which device the OSD is on. It’s reporting that there are 2 OSDs on the
same device although this is not true.
ceph device ls-by-host <osd-node> | grep sdu
ATA_HGST_HUH728080ALN600_VJH4GLUX sdu osd.665
ATA_HGST_HUH728080ALN600_VJH60MAX sdu osd.657
The osd.665 is actually on device sdm. Could this be the cause of the issue? Is there a
way to correct it?
Thanks,
Alison
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io