In case anyone was wondering, I figured out the problem...
This nasty bug in Pacific 16.2.10
https://tracker.ceph.com/issues/56031 - I think it is
fixed in the upcoming .11 release and in Quincy.
This bug causes the computation of the bluestore DB partition to be much smaller than it
should be, so if you request a reasonable size which is smaller than the incorrectly
computed maximum size, the DB creation will fail.
Our problem was that we added 3 new SSDs that were considered "unused" by the
system, giving us a total of 8 (5 used, 3 unused). When the orchestrator issues a
"ceph-volume lvm batch" command, it passes 40 data devices and 8 db devices.
Normally, you would expect it to divide them into 5 slots per DB device (40/8). But when
it computes the size of the slots, that is where the problem occurs.
ceph-volume first sees the 3 unused devices in a group and incorrectly decides that the
slots needed is 3 * 5 = 15 slots, then divides the size of a single DB device by 15, thus
making a max DB size 3x smaller than it should be. If the code had also used the size of
all of the devices in the group, then computed the max size, it would have been fine, but
it only accounts for the size of the 1st DB device in the group resulting in a size 3x
smaller than it should be.
The workaround is to trick ceph into grouping all of the DB devices into unique groups of
1 by putting a minimal VG with a unique name on each of the unused SSDs so that when
ceph-volume computes the sizing, it sees groups of 1 and thus doesn't multiply the
number of slots incorrectly. I used "vgcreate bug1 -s 1M /dev/xyz" to create a
bogus VG on each of the unused SSDs, now I have properly sized DB devices on the new SSDs
(the "bugX" VGs can then be removed once there are legitimate DB VGs on the
device).
Question - Because our cluster was initially layed out using the buggy ceph-volume
(16.2.10), we now have hundreds of DB devices that are far smaller than they should be
(far less than the recommended 1-4% of the data devices size). Is it possible to resize
the DB devices without destroying and recreating the OSD itself?
What are the implications of having bluestore DB devices that are far smaller than they
should be?
thanks,
Wyllys Ingersoll
________________________________
From: Wyll Ingersoll <wyllys.ingersoll(a)keepertech.com>
Sent: Friday, January 13, 2023 4:35 PM
To: ceph-users(a)ceph.io <ceph-users(a)ceph.io>
Subject: [ceph-users] ceph orch osd spec questions
Ceph Pacific 16.2.9
We have a storage server with multiple 1.7TB SSDs dedicated to the bluestore DB usage.
The osd spec originally was misconfigured slightly and had set the "limit"
parameter on the db_devices to 5 (there are 8 SSDs available) and did not specify a
block_db_size. ceph layed out the original 40 OSDs and put 8 DBs across 5 of the SSDs
(because of limit param). Ceph seems to have auto-sized the bluestore DB partitions to be
about 45GB, which is far less than the recommended 1-4% (using 10TB drives). How does
ceph-volume determine the size of the bluestore DB/WAL partitions when it is not specified
in the spec?
We updated the spec and specified a block_db_size of 300G and removed the
"limit" value. Now we can see in the cephadm.log that the ceph-volume command
being issued is using the correct list of SSD devices (all 8) as options to the lvm batch
(--db-devices ...), but it keeps failing to create the new OSD because we are asking for
300G and it thinks there is only 44G available even though the last 3 SSDs in the list are
empty (1.7T). So, it appears that somehow the orchestrator is ignoring the last 3 SSDs.
I have verified that these SSDs are wiped clean, have no partitions or LVM, and no label
(sgdisk -Z, wipefs -a). They appear as available in the inventory and not locked or
otherwise in use.
Also, the "db_slots" spec parameter is ignored in pacific due to a bug so there
is no way to tell the orchestrator to use "block_db_slots". Adding it to the
spec like "block_db_size" fails since it is not recognized.
Any help figuring out why these SSDs are being ignored would be much appreciated.
Our spec for this host looks like this:
---
spec:
data_devices:
rotational: 1
size: '3TB:'
db_devices:
rotational: 0
size: ':2T'
vendor: 'SEAGATE'
block_db_size: 300G
---
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io