Do you have any data on the reliability of QLC NVMe
drives?
They were my job for a year, so yes, I do. The published specs are accurate. A QLC drive
built from the same NAND as a TLC drive will have more capacity, but less endurance.
Depending on the model, you may wish to enable
`bluestore_use_optimal_io_size_for_min_alloc_size` when creating your OSDs. The Intel /
Soldigim P5316, for example, has a 64KB IU size, so performance and endurance will benefit
from aligning OSD `min_alloc_size` to that value. Note that this is baked in at creation,
you cannot change it on a given OSD after the fact, but you can redeploy the OSD and let
it recover.
Other SKUs have 8KB or 16KB IU sizes, some have 4KB which requires no specific
min_alloc_size. Note that QLC is a good fit for workloads where writes tend to be
sequential and reasonably large on average and infrequent. I know of successful QLC RGW
clusters that see 0.01 DWPD. Yes, that decimal point is in the correct place. Millions
of 1KB files overwritten once an hour aren't a good workload for QLC. Backups,
archives, even something like an OpenStack Glance pool are good fits. I'm about to
trial QLC as Prometheus LTS as well. Read-mostly workloads are good fits, as the read
performance is in the ballpark of TLC. Write performance is still going to be way better
than any HDD, and you aren't stuck with legacy SATA slots. You also don't have to
buy or manage a fussy HBA.
How old is your deep archive cluster, how many NVMes
it has, and how many did you
have to replace?
I don't personally have one at the moment.
Even with TLC, endurance is, dare I say, overrated. 99% of enterprise SSDs never burn
more than 15% of their rated endurance. SSDs from at least some manufacturers have a
timed workload feature in firmware that will estimate drive lifetime when presented with a
real-world workload -- this is based on observed PE cycles.
Pretty much any SSD will report lifetime used or remaining, so TLC, QLC, even MLC or SLC
you should collect those metrics in your time-series DB and watch both for drives nearing
EOL and their burn rates.
On Sun, Apr 21, 2024 at 11:06 PM Anthony D'Atri <anthony.datri(a)gmail.com>
wrote:
A deep archive cluster benefits from NVMe too. You can use QLC up to 60TB in size, 32 of
those in one RU makes for a cluster that doesn’t take up the whole DC.
On Apr 21, 2024, at 5:42 AM, Darren Soothill
<darren.soothill(a)croit.io> wrote:
Hi Niklaus,
Lots of questions here but let me tray and get through some of them.
Personally unless a cluster is for deep archive then I would never suggest configuring or
deploying a cluster without Rocks DB and WAL on NVME.
There are a number of benefits to this in terms of performance and recovery. Small writes
go to the NVME first before being written to the HDD and it makes many recovery operations
far more efficient.
As to how much faster it makes things that very much depends on the type of workload you
have on the system. Lots of small writes will make a significant difference. Very large
writes not as much of a difference.
Things like compactions of the RocksDB database are a lot faster as they are now running
from NVME and not from the HDD.
We normally work with a upto 1:12 ratio so 1 NVME for every 12 HDD’s. This is assuming
the NVME’s being used are good mixed use enterprise NVME’s with power loss protection.
As to failures yes a failure of the NVME would mean a loss of 12 OSD’s but this is no
worse than a failure of an entire node. This is something Ceph is designed to handle.
I certainly wouldn’t be thinking about putting the NVME’s into raid sets as that will
degrade the performance of them when you are trying to get better performance.
Darren Soothill
Looking for help with your Ceph cluster? Contact us at
https://croit.io/
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web:
https://croit.io/ | YouTube:
https://goo.gl/PGE1Bx
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
--
Alexander E. Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io