Ceph-ansible: add a new HDD to an already provisioned WAL device - ceph-users

17 Jan 2023

Hello all,

we’ve set up a new Ceph cluster with a number of nodes which are all identically
configured.
There is one device vda which should act as WAL device for all other devices.
Additionally, there are four other devices vdb, vdc, vdd, vde which use vda as WAL.
The whole cluster was set up using ceph-ansible (branch stable-7.0) and Ceph version
17.2.0.
Device configuration in osds.yml looks as follows:
   devices: [/dev/vdb, /dev/vdc, /dev/vdd, /dev/vde]
   bluestore_wal_devices: [/dev/vda]
As expected vda contains four logical volumes for WAL each 1/4 of the overall vda disk
size (‘ceph-ansible/group_vars/all.yml’ has default ‘block_db_size: -1’).

After the initial setup, we’ve added an additional device vdf which should become a new
OSD. The new OSD should use vda for WAL as well. This means the previous four WAL LVs have
to be resized down to 1/5 and a new LV has to be added.

Is it possible to retroactively add a new device to an already provisioned WAL device?

We suspect that this is not possible because the ceph-bluestore-tool does not provide any
way to shrink an existing BlueFS device. Only expanding is currently possible
(https://docs.ceph.com/en/quincy/man/8/ceph-bluestore-tool/).
Simply adding the new device to the devices list and rerunning the playbook does nothing.
And so does only setting “devices: [/dev/vdf]” and “bluestore_wal_devices: [/dev/vda]”. In
both cases vda is rejected because “Insufficient space (<10 extents) on vgs” which
makes sense because vda is already fully used by the previous four OSD WALs.

Thanks for the help and kind regards.

Additional notes:
- We’re testing pre-production on an emulated cluster hence the device names vdx and
unusually small device sizes.
- The output of `lsblk` after the initial setup looks as follows:
```
vda                                                                                       
           252:0    0    8G  0 disk 
├─ceph--36607c7f--e51c--452e--a44a--225d8d0b0aa8-osd--wal--3677c354--8d7d--4db9--a2b7--68aeb8248d40
  253:2    0    2G  0 lvm  
├─ceph--36607c7f--e51c--452e--a44a--225d8d0b0aa8-osd--wal--52d71122--b573--4077--9633--968c178612fd
  253:4    0    2G  0 lvm  
├─ceph--36607c7f--e51c--452e--a44a--225d8d0b0aa8-osd--wal--2d7eb467--cfb1--4a00--8a45--273932036599
  253:6    0    2G  0 lvm  
└─ceph--36607c7f--e51c--452e--a44a--225d8d0b0aa8-osd--wal--d7b13b79--219c--4002--9e92--370dff7a5376
  253:8    0    2G  0 lvm  
vdb                                                                                       
           252:16   0    8G  0 disk 
└─ceph--49ddaa8b--5d8f--4267--85f9--5cac608ce53d-osd--block--861a53c7--ee57--4c5f--9546--1dd7cb0185ef
253:1    0    8G  0 lvm  
vdc                                                                                       
           252:32   0    5G  0 disk 
└─ceph--1ed9ee91--e071--4ea4--9703--d56d84d9ae0a-osd--block--8aacb66a--e29b--4b7a--8ad5--a9fb1f81c6d6
253:3    0    5G  0 lvm  
vdd                                                                                       
           252:48   0    5G  0 disk 
└─ceph--554cdd8b--e722--41a9--8f64--c09c857cc0dc-osd--block--4dee3e1b--b50d--4154--b2ff--80cadb67e2a0
253:5    0    5G  0 lvm  
vde                                                                                       
           252:64   0    5G  0 disk 
└─ceph--5d58de32--ca55--4895--8ac7--af94ee07672e-osd--block--3f563f40--0c1e--4cca--9325--d9534cceb711
253:7    0    5G  0 lvm  
vdf                                                                                       
           252:80   0    5G  0 disk
```
- Ceph status is happy and healthy:
```
  cluster:
    id:     ff043ce8-xxxx-xxxx-xxxx-e98d073c9d09
    health: HEALTH_WARN
            mons are allowing insecure global_id reclaim

  services:
    mon: 3 daemons, quorum baloo-1,baloo-2,baloo-3 (age 13m)
    mgr: baloo-2(active, since 5m), standbys: baloo-3, baloo-1
    mds: 1/1 daemons up, 1 standby
    osd: 24 osds: 24 up (since 4m), 24 in (since 5m)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   7 pools, 177 pgs
    objects: 213 objects, 584 KiB
    usage:   98 MiB used, 138 GiB / 138 GiB avail
    pgs:     177 active+clean
```