[ceph-users] Re: NVMe and 2x Replica

5 Feb 2021

On 04/02/2021 18:57, Adam Boyhan wrote:
...
  All great input and points guys.

 Helps me lean towards 3 copes a bit more.

 I mean honestly NVMe cost per TB isn't that much more than SATA SSD now. Somewhat
surprised the salesmen aren't pitching 3x replication as it makes them more money.

To add to this, I have seen real cases as a Ceph consultant where size=2 
and min_size=1 on all flash lead to data loss.

Picture this:

- One node is down (Maintenance, failure, etc, etc)
- NVMe device in other node dies
- You loose data

Although you can bring back the other node which was down but not broken 
you are missing data. The data on the NVMe devices in there is outdated 
and thus the PGs will not become active.

size=2 is only safe with min_size=2, but that doesn't really provide HA.

The same goes with ZFS in mirror, raidz1, etc. If you loose one device 
the chances are real you loose the other device before the array has 
healed itself.

With Ceph it's slighly more complex, but the same principles apply.

No, with NVMe I still would highly advise against using size=2, min_size=1

The question is not if you will loose data, but the question is: When 
will you loose data? Within one year, 2? 3? 10?

Wido

> 
> 
> 
> From: "Anthony D'Atri" &lt;anthony.datri(a)gmail.com&gt;
> To: "ceph-users" &lt;ceph-users(a)ceph.io&gt;
> Sent: Thursday, February 4, 2021 12:47:27 PM
> Subject: [ceph-users] Re: NVMe and 2x Replica
> 
>> I searched each to find the section where 2x was discussed. What I found was
interesting. First, there are really only 2 positions here: Micron's and Red
Hat's. Supermicro copies Micron's positon paragraph word for word. Not surprising
considering that they are advertising a Supermicro / Micron solution.
> 
> FWIW, at Cephalocon another vendor made a similar claim during a talk.
> 
> * Failure rates are averages, not minima. Some drives will always fail sooner
> * Firmware and other design flaws can result in much higher rates of failure or
insidious UREs that can result in partial data unavailability or loss
> * Latent soft failures may not be detected until a deep scrub succeeds, which could
be weeks later
> * In a distributed system, there are up/down/failure scenarios where the location of
even one good / canonical / latest copy of data is unclear, especially when drive or HBA
cache is in play.
> * One of these is a power failure. Sure PDU / PSU redundancy helps, but stuff
happens, like a DC underprovisioning amps, so that a spike in user traffic results in the
whole row going down :-x Various unpleasant things can happen.
> 
> I was championing R3 even pre-Ceph when I was using ZFS or HBA RAID. As others have
written, as drives get larger the time to fill them with replica data increases, as does
the chance of overlapping failures. I’ve experieneced R2 overlapping failures more than
once, with and before Ceph.
> 
> My sense has been that not many people run R2 for data they care about, and as has
been written recently 2,2 EC is safer with the same raw:usable ratio. I’ve figured that
vendors make R2 statements like these as a selling point to assert lower TCO. My first
response is often “How much would it cost you directly, and indirectly in terms of user /
customer goodwill, to loose data?”.
> 
>> Personally, this looks like marketing BS to me. SSD shops want to sell SSDs, but
because of the cost difference they have to convince buyers that their products are
competitive.
> 
> ^this. I’m watching the QLC arena with interest for the potential to narrow the CapEx
gap. Durability has been one concern, though I’m seeing newer products claiming that eg.
ZNS improves that. It also seems that there are something like what, *4* separate EDSFF /
ruler form factors, I really want to embrace those eg. for object clusters, but I’m VERY
wary of the longevity of competing standards and any single-source for chassies or
drives.
> 
>> Our products cost twice as much, but LOOK you only need 2/3 as many, and you get
all these other benefits (performance). Plus, if you replace everything in 2 or 3 years
anyway, then you won't have to worry about them failing.
> 
> Refresh timelines. You’re funny ;) Every time, every single time, that I’ve worked in
an organization that claims a 3 (or 5, or whatever) hardware refresh cycle, it hasn’t
happened. When you start getting close, the capex doesn’t materialize, or the opex cost of
DC hands and operational oversight. “How do you know that the drives will start failing or
getting slower? Let’s revisit this in 6 months”. Etc.
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: NVMe and 2x Replica