Ah, our old friend the P5316.
A few things to remember about these:
* 64KB IU means that you'll burn through endurance if you do a lot of writes smaller
than that. The firmware will try to coalesce smaller writes, especially if they're
sequential. You probably want to keep your RGW / CephFS index / medata pools on other
media.
* With Quincy or later and a reasonably recent kernel you can set
bluestore_use_optimal_io_size_for_min_alloc_size to true and OSDs deployed on these should
automatically be created with a 64KB min_alloc_size. If you're writing a lot of
objects smaller than, say, 256KB -- especially if using EC -- a more nuanced approach may
be warranted. ISTR that your data are large sequential files, so probably you can exploit
this. For sure you want these OSDs to not have the default 4KB min_alloc_size; that would
result in lowered write performance and especially endurance burn. The min_alloc_size
cannot be changed after an OSD is created; instead one would need to destroy and
recreate.
cf.
https://github.com/ceph/ceph/pulls?q=is%3Apr+author%3Acurtbruns
https://www.youtube.com/watch?v=w91e0EjWD6E
Optimizing RGW Object Storage Mixed Media through Storage Classes and Lua Scripting
youtube.com
On Oct 24, 2023, at 11:42, Matt Larson
<larsonmattr(a)gmail.com> wrote:
I am looking to create a new pool that would be backed by a particular set
of drives that are larger nVME SSDs (Intel SSDPF2NV153TZ, 15TB drives).
Particularly, I am wondering about what is the best way to move devices
from one pool and to direct them to be used in a new pool to be created. In
this case, the documentation suggests I could want to assign them to a new
device-class and have a placement rule that targets that device-class in
the new pool.
If you're using cephadm / ceph orch you can craft an OSD spec that uses or ignores
drives based on size or model.
Multiple pools can share OSDs, for your use-case though you probably don't want to.
Currently the Ceph cluster has two device classes 'hdd' and 'ssd', and
the
larger 15TB drives were automatically assigned to the 'ssd' device class
that is in use by a different pool. The `ssd` device classes are used in a
placement rule targeting that class.
The names of device classes are actually semi-arbitrary. The above distinction is made on
the basis of whether or not the kernel believes a given device to rotate.
The documentation describes that I could set a device
class for an OSD with
a command like:
`ceph osd crush set-device-class CLASS OSD_ID [OSD_ID ..]`
Class names can be arbitrary strings like 'big_nvme".
or "qlc"
Before setting a new
device class to an OSD that already has an assigned device class, should
use `ceph osd crush rm-device-class ssd osd.XX`.
Yep. I suspect that's a guardrail to prevent inadvertently trampling.
Can I proceed to directly remove these OSDs from the current device class
and assign to a new device class?
Carpe NAND!
Should they be moved one by one? What is
the way to safely protect data from the existing pool that they are mapped
to?
Are there other SSDs in said existing pool? If you reassign all of these, will there be
enough survivors to meet replication policy and hold all the data?
One by one would be safe. Doing more than one might be faster and more efficient,
depending on your hardware and topology. For sure you don't want to reassign more
than one per CRUSH failure domain at a time (host, rack, depends on your setup). If your
topology, RAM, and clients are amenable, you could do all OSDs in a single failure domain
at once, then proceed to the next only after all PGs are active+clean.
Thanks,
Matt
--
Matt Larson, PhD
Madison, WI 53705 U.S.A.
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io