Rather than a cache tier, I would put an NVMe device in each OSD box
for Bluestore's DB and WAL. This will significantly improve small IOs.
14*16 HDDs / 11 chunks = 20 HDD's worth of write IOPs. If you expect
these files to be written sequentially, this is probably ok.
Mons and mgr on OSD nodes should be ok as far as I know (we are doing
it). However, if you use CephFS, you will want the fastest CPUs you
can get for the MDS, as it only scales to 2-3 cores in our tests. I
would avoid the complexity of CephFS if you do not need POSIX
semantics.
On Tue, Dec 3, 2019 at 3:07 PM Fabien Sirjean <fsirjean(a)eddie.fdn.fr> wrote:
>
> Hi Ceph users !
>
> After years of using Ceph, we plan to build soon a new cluster bigger than what
> we've done in the past. As the project is still in reflection, I'd like to
> have your thoughts on our planned design : any feedback is welcome :)
>
>
> ## Requirements
>
> * ~1 PB usable space for file storage, extensible in the future
> * The files are mostly "hot" data, no cold storage
> * Purpose : storage for big files being essentially used on windows workstations
(10G access)
> * Performance is better :)
>
>
> ## Global design
>
> * 8+3 Erasure Coded pool
> * ZFS on RBD, exposed via samba shares (cluster with failover)
>
>
> ## Hardware
>
> * 1 rack (multi-site would be better, of course...)
>
> * OSD nodes : 14 x supermicro servers
> * 24 usable bays in 2U rackspace
> * 16 x 10 TB nearline SAS HDD (8 bays for future needs)
> * 2 x Xeon Silver 4212 (12C/24T)
> * 128 GB RAM
> * 4 x 40G QSFP+
>
> * Networking : 2 x Cisco N3K 3132Q or 3164Q
> * 2 x 40G per server for ceph network (LACP/VPC for HA)
> * 2 x 40G per server for public network (LACP/VPC for HA)
> * QSFP+ DAC cables
>
>
> ## Sizing
>
> If we've done the maths well, we expect to have :
>
> * 2.24 PB of raw storage, extensible to 3.36 PB by adding HDD
> * 1.63 PB expected usable space with 8+3 EC, extensible to 2.44 PB
> * ~1 PB of usable space if we want to keep the OSD use under 66% to allow
> loosing nodes without problem, extensible to 1.6 PB (same condition)
>
>
> ## Reflections
>
> * We're used to run mons and mgrs daemons on a few of our OSD nodes, without
> any issue so far : is this a bad idea for a big cluster ?
>
> * We thought using cache tiering on an SSD pool, but a large part of the PB is
> used on a daily basis, so we expect the cache to be not so effective and
> really expensive ?
>
> * Could a 2x10G network be enough ?
>
> * ZFS on Ceph ? Any thoughts ?
>
> * What about CephFS ? We'd like to use RBD diff for backups but it looks
> impossible to use snapshot diff with Cephfs ?
>
>
> Thanks for reading, and sharing your experiences !
>
> F.
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io