## Requirements
* ~1 PB usable space for file storage, extensible in the future
* The files are mostly "hot" data, no cold storage
* Purpose : storage for big files being essentially used on windows workstations (10G
access)
* Performance is better :)
## Global design
* 8+3 Erasure Coded pool
EC performance for RBD is going to be mediocre at best, esp. on spinners.
* ZFS on RBD, exposed via samba shares (cluster with
failover)
Why ZFS? Mind you I like ZFS, but layering it on top of RBD is more overhead and
complexity.
* 128 GB RAM
Nowhere near enough. You’re going to want 256 at the very least.
* Networking : 2 x Cisco N3K 3132Q or 3164Q
* 2 x 40G per server for ceph network (LACP/VPC for HA)
* 2 x 40G per server for public network (LACP/VPC for HA)
Don’t bother with a replication network.
* We're used to run mons and mgrs daemons on a few
of our OSD nodes, without
any issue so far : is this a bad idea for a big cluster ?
Contention for resources can lead to a vicious circle. Failure/maint of mon/mgr/OSD at
the same time can be ugly. Put your mons on something cheap, 5 of them or 3 if you must.
* We thought using cache tiering on an SSD pool, but a
large part of the PB is
used on a daily basis, so we expect the cache to be not so effective and
really expensive ?
Cache tiering is deprecated at best. Not a good idea to invest in it. If you’re going to
use SSDs, there are better ways.
* Could a 2x10G network be enough ?
Yes.
* ZFS on Ceph ? Any thoughts ?
ZFS is great, but unless you have a specific need, it sounds like a lot of overhead and
complexity.
* Hardware raid with Battery Backed write-cache - will
allow OSD to ack writes before hitting spinning rust.
Disagree. See my litany from a few months ago. Use a plain, IT-mode HBA. Take the $$
you save and put it toward building your cluster out of SSDs instead of HDDs. That way
you don’t have to mess with the management hassles of maintaining and allocating external
WAL+DB partitions too.
3x replication instead of EC
This. The performance of EC RBD vols will likely disappoint you, esp on spinners. Having
suffered 3R RBD on LFF spinners, I predict that you would also be unhappy unless your
use-case is only archival / backups or some other cold, latency-tolerance workload.