Hi,
I recently restarted a storage node for our Ceph cluster and had an
issue bringing one of the OSDs back online. This storage node has
multiple HDs each as a devoted OSD for a data pool, and a single nVME
drive with an LVM partition assigned as an OSD in a metadata pool.
After rebooting the host, the OSD using an LVM partition did not
restart. When trying to manually start the OSD using systemctl, I can
follow the launch of a podman container and see an error message prior
to the container shutting down again:
Sep 23 14:02:06 X bash[30318]: Running command:
/usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev
/dev/boot/cephfs_meta --path /var/lib/ceph/osd/ceph-165
--no-mon-config
Sep 23 14:02:06 X bash[30318]: stderr: failed to read label for
/dev/boot/cephfs_meta: (2) No such file or directory
Sep 23 14:02:06 X bash[30318]: --> RuntimeError: command returned
non-zero exit status: 1
1. I can see the existence of the /dev/boot/cephfs_meta symlink to a
device ../dm-3
2. `lsblk` shows the lvm partition 'boot-cephfs_meta' under nvme0n1p3
3. `sudo lvscan --all` shows the it as activated:
` ACTIVE '/dev/boot/cephfs_meta' [3.42 TiB] inherit`
This is on a CentOS 8 system, with ceph version 15.2.1
(9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable)
Related issues I have found include:
1.
https://github.com/rook/rook/issues/2591
2.
https://github.com/rook/rook/issues/3289
There were indicated solutions for these involving installing the
LVM2 package, which I completed with `sudo dnf install lvm2`, then
tried a restart of the system and restart of the container. This was
not able to resolve the problem for LVM-partition based OSD.
This LVM-based OSD was initially created with a `ceph-volume`
command: `ceph-volume lvm create --bluestore --data /dev/sd<x>
--block.db
/dev/nvme0n1<partition-nr>`
Is there a workaround for this problem where the container process is
unable to read the label of the LVM partition and fails to start the
OSD?
Thanks,
Matt
--
Matt Larson, PhD
Madison, WI 53705 U.S.A.