CEPH Quincy installation with multipathd enabled - ceph-users

2 Apr 2024

Greetings community,

we have a setup comprising of 6 servers hosting CentOS 8 Minimal Installation with CEPH
Quincy version 18.2.2 supported by 20Gbps fiber optics NICs and a dual Xeon Intel
processors, bootstrapped the installation on the first node then expanded to the others
using the cephadm method, having the monitor services deployed on 5 of these nodes as well
as 3 manager nodes. Each server has an NVMe boot disk as well as a 1TBs SATA SSD over
which the OSDs are deployed. An EC profile was created with k=3 and m=3, serving a CephFS
filesystem on top with NFS exports to serve other servers. Up to this point, the setup is
quite stable in the sense that upon emergency reboot or network connection failure the
OSDs did not fail and remain functional/started normally after reboot.

At a certain point in our project, we had the need to activate the multipathd service,
adding the boot drive partition and the CEPH SSD to its blacklist as to not be initialized
for use by an mpath partition, the blacklist goes like so:

boot blacklist:
===============
blacklist {
    wwid "eui.<drive_id>"
}

SATA SSD blacklist:
===================
blacklist {
    wwid "naa.<drive_id>"
}

The above blacklist configuration ensures that both the boot disk as well as CEPH's
OSD function properly, with the following being lsblk output:

NAME                                                                                      
           	  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                                                                                       
             	8:0    0 894.3G  0 disk
└─ceph--<id>-osd--block--<block_id> 																	  252:3    0 894.3G  0
lvm
nvme0n1                                                                                   
           	  259:0    0 238.5G  0 disk
├─nvme0n1p1                                                                               
           	  259:1    0   600M  0 part /boot/efi
├─nvme0n1p2                                                                               
           	  259:2    0     1G  0 part /boot
└─nvme0n1p3                                                                               
           	  259:3    0 236.9G  0 part
  ├─centos-root                                                                           
               252:0    0   170G  0 lvm  /
  ├─centos-swap                                                                           
               252:1    0  23.4G  0 lvm  [SWAP]
  ├─centos-var_log_audit                                                                  
               252:2    0   7.5G  0 lvm  /var/log/audit
  ├─centos-home                                                                           
               252:4    0    26G  0 lvm  /home
  └─centos-var_log                                                                        
               252:5    0    10G  0 lvm  /var/log

In addition to the above multipathd configuration, we have use_devicesfile=1 in
/etc/lvm/lvm.conf, with /etc/lvm/devices/system.devices file being like so, with PVID used
from the output of the pvdisplay command, and the IDNAME value extracted from the ouput of
"ls -lha /dev/disk/by-id":

VERSION=1.1.1
IDTYPE=sys_wwid IDNAME=eui.<drive_id> DEVNAME=/dev/nvme0n1p3 PVID=<pvid>
PART=3
IDTYPE=sys_wwid IDNAME=naa.<drive_id> DEVNAME=/dev/sda PVID=<pvid>

Issues started when performing certain tests regarding the system's integrity, most
important of which is emergency shutdown's and reboot of all the nodes, the behavior
that follows is that the OSDs are not started automatically as well as their respective
LVM volumes not properly showing (except on a single node for some reason), hence the
lsblk ouput changes like the snippet below, requiring us rebooting the nodes one by one
until all the OSDs are back online:

NAME                                                                                      
           	  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                                                                                       
             	8:0    0 894.3G  0 disk
nvme0n1                                                                                   
           	  259:0    0 238.5G  0 disk
├─nvme0n1p1                                                                               
           	  259:1    0   600M  0 part /boot/efi
├─nvme0n1p2                                                                               
           	  259:2    0     1G  0 part /boot
└─nvme0n1p3                                                                               
           	  259:3    0 236.9G  0 part
  ├─centos-root                                                                           
               252:0    0   170G  0 lvm  /
  ├─centos-swap                                                                           
               252:1    0  23.4G  0 lvm  [SWAP]
  ├─centos-var_log_audit                                                                  
               252:2    0   7.5G  0 lvm  /var/log/audit
  ├─centos-home                                                                           
               252:4    0    26G  0 lvm  /home
  └─centos-var_log                                                                        
               252:5    0    10G  0 lvm  /var/log

Without the LVM configuration and the multipathd service enabled, everything works fine,
this behavior started happening after the changes. Attempting to manually restart the OSDs
from a manager node using ceph orch daemon restart osd.n results in an error state, and
even when manually starting the OSD on each node via the bash
/var/lib/ceph/<fsid>/osd.0/unit.run we receive the following error:

--> Failed to activate via raw: did not find any matching OSD to activate
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0
Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev
/dev/ceph-<id>/osd-block-<block_id> --path /var/lib/ceph/osd/ceph-0
--no-mon-config
 stderr: failed to read label for /dev/ceph-<id>/osd-block-<block_id>: (2) No
such file or directory
2024-03-30T12:42:54.014+0000 7f845296a980 -1
bluestore(/dev/ceph-<id>/osd-block-<block_id>) _read_bdev_label failed to open
/dev/ceph-<id>/osd-block-<block_id>: (2) No such file or directory
--> Failed to activate via LVM: command returned non-zero exit status: 1
--> Failed to activate via simple: 'Namespace' object has no attribute
'json_config'
--> Failed to activate any OSD(s)

A successful run of the same command but with success results in the following output:

/bin/bash /var/lib/ceph/<fsid>/osd.0/unit.run
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0
Running command: /usr/bin/ceph-bluestore-tool prime-osd-dir --path
/var/lib/ceph/osd/ceph-0 --no-mon-config --dev
/dev/mapper/ceph--<id>-osd--block--<block_id>
Running command: /usr/bin/chown -h ceph:ceph
/dev/mapper/ceph--<id>-osd--block--<block_id>
Running command: /usr/bin/chown -R ceph:ceph /dev/dm-5
Running command: /usr/bin/ln -s /dev/mapper/ceph--<id>-osd--block--<block_id>
/var/lib/ceph/osd/ceph-0/block
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0
--> ceph-volume raw activate successful for osd ID: 0
ceph-<fsid>-osd-0
4361e2f166bcdeee6e9020dcbb153d3d7eec04e71d5b0b250440d4a3a0833f2c

It seems to us as if the logical volume in cases of failure is not even detected at boot
by the device-mapper, which is weird, it's also not showing in the output of the
dmsetup ls command in cases of failure. What could we be missing here? What seems to be
the conflict between CEPH OSDs and the multipathd service or even the LVM configuration?
Should the system.devices entry be different than what we set? Is the multipathd
blacklisting configuration missing something? We have been working on trial and error
experiments for more than a week now and looked at the lvm2 as well as multipathd logs (we
could provide them upon request) but to no avail as nothing indicates any errors, just
normal logs with the difference being the missing CEPH OSD LVM volume.

Best regards