Thanks for the feedback everyone! It seems we have more to look into regarding NVMe
enterprise storage solutions. The workload doesn’t demand NVMe performance, so SSD seems
to be the most cost effective way to handle this. The performance discussion is very
interesting!
Regards,
Brent
-----Original Message-----
From: Stefan Kooman <stefan(a)bit.nl>
Sent: Wednesday, September 23, 2020 3:49 AM
To: Brent Kennedy <bkennedy(a)cfl.rr.com>om>; 'ceph-users'
<ceph-users(a)ceph.io>
Subject: Re: [ceph-users] NVMe's
On 2020-09-23 07:39, Brent Kennedy wrote:
We currently run a SSD cluster and HDD clusters and
are looking at
possibly creating a cluster for NVMe storage. For spinners and SSDs,
it seemed the max recommended per osd host server was 16 OSDs ( I know
it depends on the CPUs and RAM, like 1 cpu core and 2GB memory ).
Questions:
1. If we do a jbod setup, the servers can hold 48 NVMes, if the
servers were bought with 48 cores and 100+ GB of RAM, would this make sense?
As always ... it depends :-). But I would not recommend it. For NVMe you want to use more
like 10 GB per OSD (osd memory target) and have some spare RAM for buffer cache. The
amount of CPU would be sufficient for normal use, but might not be enough when in a
recovery situation / RocksDB housekeeping etc. But it depends on what Ceph features you
want to use (RBD won't use much OMAP/META, so you would be OK with that use case).
2. Should we just raid 5 by groups of NVMe drives instead ( and buy
less CPU/RAM )? There is a reluctance to waste even a single drive on
raid because redundancy is basically cephs job.
Yeah, let Ceph handle to redundancy. You don't want to use hardware raid controllers.
3. The plan was to build this with octopus (
hopefully there are no
issues we should know about ). Though I just saw one posted today,
but this is a few months off.
Should be OK, especially for new clusters. Test, test, test.
4. Any feedback on max OSDs?
I would recommend like 10 NVMe per server. More nodes is always better than more dense
nodes from a performance perspective, has less impact when one node fails. The more nodes
the less impact when one node fails, faster recovery / backfill, etc.
5. Right now they run 10Gb everywhere with 80Gb
uplinks, I was
thinking this would need at least 40Gb links to every node ( the hope
is to use these to speed up image processing at the application layer locally in the DC
).
Do you want to be able to fully utilize all NVMe regarding throughput?
That will be an issue. You will be limited by bandwith to backfill those OSDs (especially
if you need to backfill a whole node at once).
I haven't spoken to the Dell engineers yet but my
concern with NVMe is
that the raid controller would end up being the bottleneck ( next in
line after network connectivity ).
Most probably, yes, plus increased latency. My standpoint is to not use hardware
raidcontrollers for NVMe storage.
Gr. Stefan