OSDs fail to start due to permissions on block.db device (Nautilus 14.2.11) - Dev

25 Aug 2020

I have a cluster (14.2.11) with bluestore osds that have the block.DB on an SSD partition
that is separate from the primary osd device.  On some of my storage servers, the osd
processes fail to start up at boot time because the permissions on the block.db device are
not being changed from root:root, or are being reset by udev after the ceph-volume-systemd
has run successfully.  The problem only occurs on a couple of the storage servers, though
all of them are configured the same and are running the same software versions.

I suspect a race condition or conflict with the udev rules, but I have not been successful
in identifying where the problem lies and udev is a complete nightmare to debug and
diagnose.

One workaround solution is to update /usr/lib/ceph/ceph-osd-prestart.sh so that it checks
(and corrects) the permissions on the block.db device so that the osd can start correctly.
 This particular script looks like is hasn't been updated to support bluestore, so I
added a some lines to address the problem, which works for me.

Has anyone else seen a similar issue and found a different solution?

Here is the code I added to the ceph-osd-prestart.sh script:

...
blockdb="$data/block.db"
if [ -L "$blockdb" -a -e "$blockdb" ]; then
    dev_db=`readlink -f $blockdb`
    owner=`stat -c %U $dev_db`
    if [ $owner != 'ceph' ]; then
        echo "ceph-osd(${cluster:-ceph}-$id): bluestore DB ($dev_db) has incorrect
permissions, fixing." 1>&2
        chown ceph:ceph $dev_db
    fi
fi
...

thanks,
   Wyllys Ingersoll