I wonder that
when a osd came back from power-lost, all the data
scrubbing and there are 2 other copies.
PLP is important on mostly Block Storage, Ceph should easily recover
from that situation.
That's why I don't understand why I should pay more for PLP and other
protections.
I'm no expert (or power user) al all, but my reasoning is: if something power-related
can take down one of my servers it can just as easily take down *all* my ceph servers at
once.
And that could just as easily render all three copies inacessible.
Or even two. I’ve been through a protracted outage (not power related) that involved
widespread OSD flapping. Despite having not lost OSDs in the end, somehow a single RADOS
object ended up lost, in an RBD head. Very much a corner case, but if we’d been using 2R
it would have been gruesome.
On another occasion I saw a power inductor / PSU failure take down power in an entire DC
row. Fortunately we were using redundant PSUs on different circuits. One node went down
nonetheless — the PSU on the surviving power feed had a previous issue that wasn’t caught
because PSUs weren’t monitored. As with active/passive network bonds, this showed the
importance of monitoring and addressing latent faults so you don’t find them at exactly
the wrong time.