Hello,

   * 2 x Xeon Silver 4212 (12C/24T)

I would choose single cpu AMD EPYC systems for lower price with better performance. Supermicro does have some good systems for AMD as well. 

   * 16 x 10 TB nearline SAS HDD (8 bays for future needs)

Don't waste money here as well. No real gain. Invest it better in more or faster (ssd) disks.

   * 4 x 40G QSFP+

With 24x spinning media, even a single 40G link will be enough. No gain for a lot of money again.

  * 2 x 40G per server for ceph network (LACP/VPC for HA)
  * 2 x 40G per server for public network (LACP/VPC for HA)

Use vlans if you really want to separate the networks. Most of the time we see new customers coming in with problems on such configurations and we don't suggest tu configure Ceph that way from our experience.

 * ZFS on RBD, exposed via samba shares (cluster with failover)

Maybe, just maybe think about just using samba on top of cephfs to export the data. No need for all the overhead and possible bugs you would encounter.

* We're used to run mons and mgrs daemons on a few of our OSD nodes, without any issue so far : is this a bad idea for a big cluster ?

We always do so and never had a problem with it. Just make sure the MON has enough resources for your workload. 

* We thought using cache tiering on an SSD pool, but a large part of the PB is used on a daily basis, so we expect the cache to be not so effective and really expensive ?

Tend to be error prone and we saw a lot of cluster meltdowns in the last 7 years due to cache tiering. Just go for an all flash cluster use db/wal devices to improve performance.

 * Could a 2x10G network be enough ?

Yes ;), but maybe on recovery workloads it will slow down the recovery a bit. However I don't believe that it will be a problem in your mentioned szenario.

 * ZFS on Ceph ? Any thoughts ?

just don't ;)

 * What about CephFS ? We'd like to use RBD diff for backups but it looks impossible to use snapshot diff with Cephfs ?

Please see https://docs.ceph.com/docs/master/dev/cephfs-snapshots/

If you do have questions or want some consulting to get the best Ceph cluster for the job. Please feel free to contact us.
 
--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.verges@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Di., 3. Dez. 2019 um 21:07 Uhr schrieb Fabien Sirjean <fsirjean@eddie.fdn.fr>:
Hi Ceph users !

After years of using Ceph, we plan to build soon a new cluster bigger than what
we've done in the past. As the project is still in reflection, I'd like to
have your thoughts on our planned design : any feedback is welcome :)


## Requirements

 * ~1 PB usable space for file storage, extensible in the future
 * The files are mostly "hot" data, no cold storage
 * Purpose : storage for big files being essentially used on windows workstations (10G access)
 * Performance is better :)


## Global design

 * 8+3 Erasure Coded pool
 * ZFS on RBD, exposed via samba shares (cluster with failover)


## Hardware

 * 1 rack (multi-site would be better, of course...)

 * OSD nodes : 14 x supermicro servers
   * 24 usable bays in 2U rackspace
   * 16 x 10 TB nearline SAS HDD (8 bays for future needs)
   * 2 x Xeon Silver 4212 (12C/24T)
   * 128 GB RAM
   * 4 x 40G QSFP+

 * Networking : 2 x Cisco N3K 3132Q or 3164Q
   * 2 x 40G per server for ceph network (LACP/VPC for HA)
   * 2 x 40G per server for public network (LACP/VPC for HA)
   * QSFP+ DAC cables


## Sizing

If we've done the maths well, we expect to have :

 * 2.24 PB of raw storage, extensible to 3.36 PB by adding HDD
 * 1.63 PB expected usable space with 8+3 EC, extensible to 2.44 PB
 * ~1 PB of usable space if we want to keep the OSD use under 66% to allow
   loosing nodes without problem, extensible to 1.6 PB (same condition)


## Reflections

 * We're used to run mons and mgrs daemons on a few of our OSD nodes, without
   any issue so far : is this a bad idea for a big cluster ?

 * We thought using cache tiering on an SSD pool, but a large part of the PB is
   used on a daily basis, so we expect the cache to be not so effective and
   really expensive ?

 * Could a 2x10G network be enough ?

 * ZFS on Ceph ? Any thoughts ?

 * What about CephFS ? We'd like to use RBD diff for backups but it looks
   impossible to use snapshot diff with Cephfs ?


Thanks for reading, and sharing your experiences !

F.

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-leave@ceph.io