CEPH failure domain - power considerations - ceph-users

28 May 2020

Hi, in our production cluster, we have the following setup

- 10 nodes
	- 3 drives / server (so far), mix of SSD and HDD (different pools) + NVMe
	- dual 10G in LACP, linked to two different switches (Cisco vPC)
	- OSDs, MONs and MGRs are colocated

- A + B power feeds, 2 ATS (each receiving A+B) - ATS1 and ATS2
- 2 PDU rails, each connected to an ATS (PDU1 = ATS1, PDU2 = ATS2)

- switches have dual PSUs and are connected to both rails
- CEPH nodes - single power supply

- Odd nodes (1,3,5...) are connected to PDU1
- Even nodes (2,4,6...) are connected to PDU2

... I can provide a drawing if it helps :)

Now, the default crush map ensures that multiple copies of the same object
won't find their way on the same host, which is fine. But I'm thinking
that in case of power failure [1] of either ATS or PDU, we'd be losing half
the nodes in the cluster at the same time. How would I go about tuning our
map so it took into account that, for a 3 copy replicated pool, we don't
have those stored on hosts, say, 5,7,9 ?

And, what about when using EC pools ? We currently have 5+2 SSD pools -
how would we avoid losing availability in case of a power loss where 50%
of the server are offline ?

I've gone over https://docs.ceph.com/docs/master/rados/operations/crush-map/
but don't believe I'm at the stage where I dare make changes without
incurring a huge data migration (probably can't be avoided).

Any input appreciated.

Cheers,
Phil

[1] both power feeds lost at the same time is really hard to protect against :)