[ceph-users] Re: Building a petabyte cluster from scratch

3 Dec 2019

...
  After years of using Ceph, we plan to build soon a new
cluster bigger than
 what
 we've done in the past. As the project is still in reflection, I'd like to
 have your thoughts on our planned design : any feedback is welcome :)

 ## Requirements

  * ~1 PB usable space for file storage, extensible in the future
  * The files are mostly "hot" data, no cold storage
  * Purpose : storage for big files being essentially used on windows
 workstations (10G access)
  * Performance is better :)

 ## Global design

  * 8+3 Erasure Coded pool
  * ZFS on RBD, exposed via samba shares (cluster with failover)

 ## Hardware

  * 1 rack (multi-site would be better, of course...)

  * OSD nodes : 14 x supermicro servers
    * 24 usable bays in 2U rackspace
    * 16 x 10 TB nearline SAS HDD (8 bays for future needs)
    * 2 x Xeon Silver 4212 (12C/24T)
    * 128 GB RAM
    * 4 x 40G QSFP+

  * Networking : 2 x Cisco N3K 3132Q or 3164Q
    * 2 x 40G per server for ceph network (LACP/VPC for HA)
    * 2 x 40G per server for public network (LACP/VPC for HA)
    * QSFP+ DAC cables

 ## Sizing

 If we've done the maths well, we expect to have :

  * 2.24 PB of raw storage, extensible to 3.36 PB by adding HDD
  * 1.63 PB expected usable space with 8+3 EC, extensible to 2.44 PB
  * ~1 PB of usable space if we want to keep the OSD use under 66% to allow
    loosing nodes without problem, extensible to 1.6 PB (same condition)

 ## Reflections

  * We're used to run mons and mgrs daemons on a few of our OSD nodes,
 without
    any issue so far : is this a bad idea for a big cluster ?

  * We thought using cache tiering on an SSD pool, but a large part of the
 PB is
    used on a daily basis, so we expect the cache to be not so effective
 and
    really expensive ?

  * Could a 2x10G network be enough ? 
I would say yes, those slow disks will not deliver more anyway.
This is going to be a relative "slow" setup with limited amount of
read-caching - with 16 drives / 128GB memory it'll be a few GB per
OSD for read caching - menaning that all read-and-write will hit
the slow drives underneath.

And that in a "double slow" fashion - where one write will hit 8 + 3 OSD's
and wait for sync-ack back to the master - same with reads that
will hit 8+3 OSD's before returning to the client.

Workload depending - this may just work for you - but it is definately
not fast.

Suggestions for improvements:

* Hardware raid with Battery Backed write-cache - will allow OSD to ack
writes before hitting spinning rust.
* More memory for OSD-level read-caching.
* 3x replication instead of EC
.. (we have all above in a "similar" setup ~1PB - 10 OSD - hosts).
SSD-tiering pool (havent been there - but would like to test it out).

-- 
Jesper

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Building a petabyte cluster from scratch