We are starting to use 18TB spindles, have loads of cold data and only a thin layer of hot
data. One 4/8TB NVMe drive as a cache in front of 6x18TB will provide close to or even
matching SSD performance for the hot data at a reasonable extra cost per TB storage. My
plan is to wait for 1-2 more years for prices for PCI-NVMe to drop and then start using
this method. The second advantage is, that one can continue to deploy collocated HDD OSDs
as WAL/DB will certainly land and stay in cache. The cache can be added to existing OSDs
without redeployment. In addition, dm-cache uses a hit count method for computing
promotion to cache, which works very different from promotion to ceph cache pools.
Dm-cache can afford that due to its local nature. In particular, it doesn't promote on
just 1 access, which means that a weekly or monthly backup will not flush the entire cache
every time.
All SSD pools for this data (ceph-fs in EC pool on HDD) will be unaffordable to us for a
long time. Not to mention that these large SSDs are almost certainly QLC, which have much
less sustained throughput compared with the 18TB He-drives (they have higher IOP/s though,
which is not so relevant for our FS use workloads). The cache method will provide at least
the additional IOP/s that WAB/DB devices would, but due to its size also data caching. We
need to go NVMe, because the servers we plan to use (R740xd2) provide the largest capacity
configuration with 24xHDD+4xPCI NVMe. You can either choose 2 extra drives or 4 PCI NVMe,
but not both. So, NVMe cannot be exchanged by fast SSDs as they would eat drive slots.
There were a few threads over the past 1-2 years where people dropped in some of these
observations and I just took note of it. It is used in production already and from what I
got people are happy with it. Much easier than WAL/DB partitions plus all the sizing
problems for L0/L1/... are sorted trivially. With the size of NVMe growing rapidly beyond
what WAL/DB devices can utilize and since LVM is the new OSD device, using LVM dm-cache
seems to be the way forward for me.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Anthony D'Atri <anthony.datri(a)gmail.com>
Sent: 16 November 2020 03:00:38
To: Frank Schilder
Subject: Re: [ceph-users] Re: which of cpu frequency and number of threads servers osd
better?
Thanks. I’m curious how the economics for that compare with just using all SSDs:
* HDDs are cheaper
* But colo SSDs are operationally simpler
* And depending on configuration you can provision a cheaper HBA
On Nov 14, 2020, at 2:04 AM, Frank Schilder
<frans(a)dtu.dk> wrote:
My plan is to use at least 500GB NVMe per HDD OSD. I have not started that yet, but there
are threads of other people sharing their experience. If you go beyond 300GB per OSD,
apparently the WAL/DB options cannot really use the extra capacity. With dm-cache or the
like you would additionally start holding hot data in cache.
Ideally, I can split a 4TB or even a 8TB NVMe over 6 OSDs.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Anthony D'Atri <anthony.datri(a)gmail.com>
Sent: 14 November 2020 10:57:57
To: Frank Schilder
Subject: Re: [ceph-users] Re: which of cpu frequency and number of threads servers osd
better?
Guten Tag.
My plan for the future is to use dm-cache for LVM
OSDs instead of WAL/DB device.
Do you have any insights into the benefits of that approach instead of WAL/DB, and of
dm-cache vs bcache vs dm-writecache vs … ? And any for sizing the cache device and
handling failures? Presumably the DB will be active enough that it will persist in the
cache, so sizing should be at a minimum that to hold 2 copies of the DB to accomodate
compaction?
I have an existing RGW cluster on HDDs that utilizes a cache tier; the high water mark is
set fairly low so that it doesn’t fill up, something that apparently happened last
Christmas. I’ve been wanting to get a feel for OSD cache as an alternative to deprecated
and fussy cache tiering, as well as something like a Varnish cache on RGW load balancers
to short-circult small requests.
— Anthony
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io