NVMe and 2x Replica

List overview All Threads
Download

newer

older

mon db high iops

ceph-volume deactivate errs out...

Adam Boyhan

4 Feb 2021 4 Feb '21

8:37 p.m.

I know there is already a few threads about 2x replication but I wanted to start one dedicated to discussion on NVMe. There are some older threads, but nothing recent that addresses how the vendors are now pushing the idea of 2x. We are in the process of considering Ceph to replace our Nimble setup. We will have two completely separate clusters at two different sites that we are using rbd-mirror snapshot replication. The plan would be to run 2x replication on each cluster. 3x is still an option, but for obvious reasons 2x is enticing. Both clusters will be spot on to the super micro example in the white paper below. It seems all the big vendors feel 2x is safe with NVMe but I get the feeling this community feels otherwise. Trying to wrap my head around were the disconnect is between the big players and the community. I could be missing something, but even our Supermicro contact that we worked the config out with was in agreement with 2x on NVMe. Appreciate the input! [ https://www.supermicro.com/white_paper/white_paper_Ceph-Ultra.pdf | https://www.supermicro.com/white_paper/white_paper_Ceph-Ultra.pdf ] [ https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-referen… ] [ https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-referen… | https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-referen… ] [ https://www.samsung.com/semiconductor/global.semi/file/resource/2020/05/red… | https://www.samsung.com/semiconductor/global.semi/file/resource/2020/05/red… ]

Show replies by date

DHilsbos＠performair.com

5 Feb 5 Feb

12:57 a.m.

Adam; Earlier this week, another thread presented 3 white papers in support of running 2x on NVMe for Ceph. I searched each to find the section where 2x was discussed. What I found was interesting. First, there are really only 2 positions here: Micron's and Red Hat's. Supermicro copies Micron's positon paragraph word for word. Not surprising considering that they are advertising a Supermicro / Micron solution. This is Micron's statement: " NVMe SSDs have high reliability with high MTBR and low bit error rate. 2x replication is recommended in production when deploying OSDs on NVMe versus the 3x replication common with legacy storage." This is Red Hat's statement: " Given the better MTBF and MTTR of flash-based media, many Ceph customers have chosen to run 2x replications in production when deploying OSDs on flash. This differs from magnetic media deployments, which typically use 3x replication." Looking at these statements, these acronyms pop out at me: MTBR and MTTR. MTBR is Mean Time Between Replacements, while MTTR is Mean Time Till Replacement. Essentially; this is saying that most companies replaces these drives before they have to worry about large numbers failing. Regarding MTBF; I can't find any data to support Red Hat's assertion that MTBF is better for flash. I looked at both Western Digital Gold, and Seagate Exos 12 TB drives, and found they both list a MTBF of 2.5 million hours. I was unable to find any information on the MTBF of Micron drives, but the MTBF of Kingston's DC1000B 240GB drive is 2 million hours. Personally, this looks like marketing BS to me. SSD shops want to sell SSDs, but because of the cost difference they have to convince buyers that their products are competitive. Pitch is thus: Our products cost twice as much, but LOOK you only need 2/3 as many, and you get all these other benefits (performance). Plus, if you replace everything in 2 or 3 years anyway, then you won't have to worry about them failing. I'll address general concerns of 2x replication in another email. Thank you, Dominic L. Hilsbos, MBA Director - Information Technology Perform Air International Inc. DHilsbos(a)PerformAir.com www.PerformAir.com -----Original Message----- From: Adam Boyhan [mailto:adamb@medent.com] Sent: Thursday, February 4, 2021 4:38 AM To: ceph-users Subject: [ceph-users] NVMe and 2x Replica I know there is already a few threads about 2x replication but I wanted to start one dedicated to discussion on NVMe. There are some older threads, but nothing recent that addresses how the vendors are now pushing the idea of 2x. We are in the process of considering Ceph to replace our Nimble setup. We will have two completely separate clusters at two different sites that we are using rbd-mirror snapshot replication. The plan would be to run 2x replication on each cluster. 3x is still an option, but for obvious reasons 2x is enticing. Both clusters will be spot on to the super micro example in the white paper below. It seems all the big vendors feel 2x is safe with NVMe but I get the feeling this community feels otherwise. Trying to wrap my head around were the disconnect is between the big players and the community. I could be missing something, but even our Supermicro contact that we worked the config out with was in agreement with 2x on NVMe. Appreciate the input! [ https://www.supermicro.com/white_paper/white_paper_Ceph-Ultra.pdf | https://www.supermicro.com/white_paper/white_paper_Ceph-Ultra.pdf ] [ https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-referen… ] [ https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-referen… | https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-referen… ] [ https://www.samsung.com/semiconductor/global.semi/file/resource/2020/05/red… | https://www.samsung.com/semiconductor/global.semi/file/resource/2020/05/red… ] _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Anthony D'Atri

2:47 a.m.

...

I searched each to find the section where 2x was discussed. What I found was interesting. First, there are really only 2 positions here: Micron's and Red Hat's. Supermicro copies Micron's positon paragraph word for word. Not surprising considering that they are advertising a Supermicro / Micron solution.

FWIW, at Cephalocon another vendor made a similar claim during a talk. * Failure rates are averages, not minima. Some drives will always fail sooner * Firmware and other design flaws can result in much higher rates of failure or insidious UREs that can result in partial data unavailability or loss * Latent soft failures may not be detected until a deep scrub succeeds, which could be weeks later * In a distributed system, there are up/down/failure scenarios where the location of even one good / canonical / latest copy of data is unclear, especially when drive or HBA cache is in play. * One of these is a power failure. Sure PDU / PSU redundancy helps, but stuff happens, like a DC underprovisioning amps, so that a spike in user traffic results in the whole row going down :-x Various unpleasant things can happen. I was championing R3 even pre-Ceph when I was using ZFS or HBA RAID. As others have written, as drives get larger the time to fill them with replica data increases, as does the chance of overlapping failures. I’ve experieneced R2 overlapping failures more than once, with and before Ceph. My sense has been that not many people run R2 for data they care about, and as has been written recently 2,2 EC is safer with the same raw:usable ratio. I’ve figured that vendors make R2 statements like these as a selling point to assert lower TCO. My first response is often “How much would it cost you directly, and indirectly in terms of user / customer goodwill, to loose data?”.

...

Personally, this looks like marketing BS to me. SSD shops want to sell SSDs, but because of the cost difference they have to convince buyers that their products are competitive.

^this. I’m watching the QLC arena with interest for the potential to narrow the CapEx gap. Durability has been one concern, though I’m seeing newer products claiming that eg. ZNS improves that. It also seems that there are something like what, *4* separate EDSFF / ruler form factors, I really want to embrace those eg. for object clusters, but I’m VERY wary of the longevity of competing standards and any single-source for chassies or drives.

...

Our products cost twice as much, but LOOK you only need 2/3 as many, and you get all these other benefits (performance). Plus, if you replace everything in 2 or 3 years anyway, then you won't have to worry about them failing.

Adam Boyhan

2:57 a.m.

All great input and points guys. Helps me lean towards 3 copes a bit more. I mean honestly NVMe cost per TB isn't that much more than SATA SSD now. Somewhat surprised the salesmen aren't pitching 3x replication as it makes them more money. From: "Anthony D'Atri" <anthony.datri(a)gmail.com> To: "ceph-users" <ceph-users(a)ceph.io> Sent: Thursday, February 4, 2021 12:47:27 PM Subject: [ceph-users] Re: NVMe and 2x Replica

...

Personally, this looks like marketing BS to me. SSD shops want to sell SSDs, but because of the cost difference they have to convince buyers that their products are competitive.

...

Refresh timelines. You’re funny ;) Every time, every single time, that I’ve worked in an organization that claims a 3 (or 5, or whatever) hardware refresh cycle, it hasn’t happened. When you start getting close, the capex doesn’t materialize, or the opex cost of DC hands and operational oversight. “How do you know that the drives will start failing or getting slower? Let’s revisit this in 6 months”. Etc. _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

DHilsbos＠performair.com

3:17 a.m.

My impression is that cost / TB for a drive may be approaching parity, but the TB /drive is still well below (or at least at densities approaching parity, cost / TB is still quite high). I can get a Micron 15TB SSD for $2600, but why would I when I can get a 18TB Seagate IronWolf for <$600, a 18TB Seagate Exos for <$500, or a 18TB WD Gold for <$600? Personally I wouldn't use drives that big, in our little tiny clusters, but it exemplifies the issues around discussing cost parity. As such a cluster needs more dives for the same total size (thus more nodes), which drives up the cost / TB for a cluster. My 2 cents. Dominic L. Hilsbos, MBA Director – Information Technology Perform Air International Inc. DHilsbos(a)PerformAir.com www.PerformAir.com -----Original Message----- From: Adam Boyhan [mailto:adamb@medent.com] Sent: Thursday, February 4, 2021 10:58 AM To: Anthony D'Atri Cc: ceph-users Subject: [ceph-users] Re: NVMe and 2x Replica All great input and points guys. Helps me lean towards 3 copes a bit more. I mean honestly NVMe cost per TB isn't that much more than SATA SSD now. Somewhat surprised the salesmen aren't pitching 3x replication as it makes them more money. From: "Anthony D'Atri" <anthony.datri(a)gmail.com> To: "ceph-users" <ceph-users(a)ceph.io> Sent: Thursday, February 4, 2021 12:47:27 PM Subject: [ceph-users] Re: NVMe and 2x Replica

...

Personally, this looks like marketing BS to me. SSD shops want to sell SSDs, but because of the cost difference they have to convince buyers that their products are competitive.

...

Jack

3:57 a.m.

On 2/4/21 7:17 PM, DHilsbos(a)performair.com wrote:

...

hy would I when I can get a 18TB Seagate IronWolf for <$600, a 18TB Seagate Exos for <$500, or a 18TB WD Gold for <$600?

IOPS

Anthony D'Atri

4:24 a.m.

...

Why would I when I can get a 18TB Seagate IronWolf for <$600, a 18TB Seagate Exos for <$500, or a 18TB WD Gold for <$600?

IOPS

Some installations don’t care so much about IOPS. Less-tangible factors include: * Time to repair and thus to restore redundancy. When an EC pool of spinners takes a *month* to weight up a drive, that’s a significant operational and data durability / availability concern. * RMAs. They’re a pain, especially if you have to work them through a chassis vendor, who likely will be dilatory and demand unreasonable hoops like attaching a BMC web interface screenshot for every drive. This translates to each RMA being modeled with a certain shipping / person-hour cost, which means that for lower unit-value items it may not be worth the hassle. It is not unreasonable to guesstimate a threshold around USD 500. Soit is not uncommon to just trash failed / DOA spinners — or letting them stack up indefinitely in a corner — instead of recovering their value. As I wrote … in 2019 I think it was, with spinners you have some manner of HBA in the mix. If that HBA is a fussy RAID model, you may have significant added cost for the RoC, onboard RAM, and supercap/BBU. Complexity also comes with neverending firmware bugs and cache management nightmares. Gas gauge firmware… don’t even get me talking about that. And consider how many TB of 3.5” spinners you fit into an RU, compared to 2.5” or EDSFF flash. RUs aren’t free, and SATA HBAs will bottleneck a relatively dense HDD chassis long before a similar number of NVMe drives will bottleneck. Unless perhaps you have the misfortune of a chassis manufacturer who for some reason runs NVMe PCI lanes *though* an HBA.

Steven Pine

4:41 a.m.

Taking a month to weight up a drive suggests the cluster doesn't have enough spare IO capacity. And for everyone suggesting EC, I don't understand how anyone really thinks that's a valid alternative with the min allocation / space amplification bug, no one in this community, not even the top developers of the project can provide an accurate space projection for EC usage -- and if you cannot predict how much space an EC configuration will use in the wild because of a bug that isn't well documented nor discussed, you cannot begin to even talk about costs and numbers. I understand '4k' block sizes may fix this issue but not everyone can necessarily run ceph off the latest master. There are a lot of hidden costs in using ceph which can vary depending on usage needs, such as having spare io for recovery operations or ensuring your total cluster disk usage stays below 60%. On Thu, Feb 4, 2021 at 2:25 PM Anthony D'Atri <anthony.datri(a)gmail.com> wrote:

...

> Why would I when I can get a 18TB Seagate IronWolf for <$600, a 18TB

Seagate Exos for <$500, or a 18TB WD Gold for <$600?

IOPS

-- Steven Pine *E * steven.pine(a)webair.com | *P * 516.938.4100 x *Webair* | 501 Franklin Avenue Suite 200, Garden City NY, 11530 webair.com [image: Facebook icon] <https://www.facebook.com/WebairInc/> [image: Twitter icon] <https://twitter.com/WebairInc> [image: Linkedin icon] <https://www.linkedin.com/company/webair> NOTICE: This electronic mail message and all attachments transmitted with it are intended solely for the use of the addressee and may contain legally privileged proprietary and confidential information. If the reader of this message is not the intended recipient, or if you are an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution, copying, or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender immediately by replying to this message and delete it from your computer.

Anthony D'Atri

1:22 p.m.

Weighting up slowly so as not to DoS users. Huge omaps and EC. So yes you’re actually agreeing with me.

...

Taking a month to weight up a drive suggests the cluster doesn't have enough spare IO capacity.

Pascal Ehlert

3:38 p.m.

Sorry to jump in here, but would you care to explain why the total disk usage should stay under 60%? This is not something I have heard before and a quick Google search didn't return anything useful. Steven Pine wrote on 04.02.21 20:41:

...

There are a lot of hidden costs in using ceph which can vary depending on usage needs, such as having spare io for recovery operations or ensuring your total cluster disk usage stays below 60%.

Brian :

4:35 p.m.

Certainly with a small number of nodes / osd this makes sense as to lose a node could quickly make the cluster storage capacity be full very quickly. On Friday, February 5, 2021, Pascal Ehlert <pascal(a)hacksrus.net> wrote:

...

Sorry to jump in here, but would you care to explain why the total disk

usage should stay under 60%?

...

This is not something I have heard before and a quick Google search

didn't return anything useful.

...

Steven Pine wrote on 04.02.21 20:41:

There are a lot of hidden costs in using ceph which can vary depending on usage needs, such as having spare io for recovery operations or ensuring your total cluster disk usage stays below 60%.

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Janne Johansson

7:23 p.m.

Den fre 5 feb. 2021 kl 07:38 skrev Pascal Ehlert <pascal(a)hacksrus.net>et>:

...

Sorry to jump in here, but would you care to explain why the total disk usage should stay under 60%? This is not something I have heard before and a quick Google search didn't return anything useful.

If you have 3 hosts with 3 drives each and repl=3, then each replica will (at least by default) want to end up on a separate host. If one disk dies, ALL the pgs will want to end up on the last drive of that host, and if it doesn't fit, the recovery will stall. (the cluster will serve IO but in a degraded mode) Same for having 4 hosts with X drives, and if one host dies, all the replicas on that host needs to fit into the 3 remaining hosts, while still ending up below 85-90% when backfill_toofull starts occurring. So if you are to survive planned or unplanned downtime on any single host while the cluster doesn't get full you need extra room, plus the 10-15% margin ceph wants. Of course it will differ a bit if you have 100s of hosts or 50 drives in a host or something extreme like that, but you should have certain levels so that by 50% full you start planning expansion, by 60+% full you execute the plan so that it can get installed before becoming 70% full or something to that effect. Ceph is somewhat like an old raid6 box with a slot or two for hotspares, except you actively use the hotspares because-why-not? So you would still need to have as much space free in your ceph as the hotspares would hold but you use the extra IOPS the hotspares bring in the mean time. -- May the most significant bit of your life be positive.

Wido den Hollander

8:26 p.m.

On 04/02/2021 18:57, Adam Boyhan wrote:

...

To add to this, I have seen real cases as a Ceph consultant where size=2 and min_size=1 on all flash lead to data loss. Picture this: - One node is down (Maintenance, failure, etc, etc) - NVMe device in other node dies - You loose data Although you can bring back the other node which was down but not broken you are missing data. The data on the NVMe devices in there is outdated and thus the PGs will not become active. size=2 is only safe with min_size=2, but that doesn't really provide HA. The same goes with ZFS in mirror, raidz1, etc. If you loose one device the chances are real you loose the other device before the array has healed itself. With Ceph it's slighly more complex, but the same principles apply. No, with NVMe I still would highly advise against using size=2, min_size=1 The question is not if you will loose data, but the question is: When will you loose data? Within one year, 2? 3? 10? Wido > > > > From: "Anthony D'Atri" <anthony.datri(a)gmail.com> > To: "ceph-users" <ceph-users(a)ceph.io> > Sent: Thursday, February 4, 2021 12:47:27 PM > Subject: [ceph-users] Re: NVMe and 2x Replica > >> I searched each to find the section where 2x was discussed. What I found was interesting. First, there are really only 2 positions here: Micron's and Red Hat's. Supermicro copies Micron's positon paragraph word for word. Not surprising considering that they are advertising a Supermicro / Micron solution. > > FWIW, at Cephalocon another vendor made a similar claim during a talk. > > * Failure rates are averages, not minima. Some drives will always fail sooner > * Firmware and other design flaws can result in much higher rates of failure or insidious UREs that can result in partial data unavailability or loss > * Latent soft failures may not be detected until a deep scrub succeeds, which could be weeks later > * In a distributed system, there are up/down/failure scenarios where the location of even one good / canonical / latest copy of data is unclear, especially when drive or HBA cache is in play. > * One of these is a power failure. Sure PDU / PSU redundancy helps, but stuff happens, like a DC underprovisioning amps, so that a spike in user traffic results in the whole row going down :-x Various unpleasant things can happen. > > I was championing R3 even pre-Ceph when I was using ZFS or HBA RAID. As others have written, as drives get larger the time to fill them with replica data increases, as does the chance of overlapping failures. I’ve experieneced R2 overlapping failures more than once, with and before Ceph. > > My sense has been that not many people run R2 for data they care about, and as has been written recently 2,2 EC is safer with the same raw:usable ratio. I’ve figured that vendors make R2 statements like these as a selling point to assert lower TCO. My first response is often “How much would it cost you directly, and indirectly in terms of user / customer goodwill, to loose data?”. > >> Personally, this looks like marketing BS to me. SSD shops want to sell SSDs, but because of the cost difference they have to convince buyers that their products are competitive. > > ^this. I’m watching the QLC arena with interest for the potential to narrow the CapEx gap. Durability has been one concern, though I’m seeing newer products claiming that eg. ZNS improves that. It also seems that there are something like what, *4* separate EDSFF / ruler form factors, I really want to embrace those eg. for object clusters, but I’m VERY wary of the longevity of competing standards and any single-source for chassies or drives. > >> Our products cost twice as much, but LOOK you only need 2/3 as many, and you get all these other benefits (performance). Plus, if you replace everything in 2 or 3 years anyway, then you won't have to worry about them failing. > > Refresh timelines. You’re funny ;) Every time, every single time, that I’ve worked in an organization that claims a 3 (or 5, or whatever) hardware refresh cycle, it hasn’t happened. When you start getting close, the capex doesn’t materialize, or the opex cost of DC hands and operational oversight. “How do you know that the drives will start failing or getting slower? Let’s revisit this in 6 months”. Etc. > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io >

Jack

8:48 p.m.

At the end, this is nothing but a probability stuff Picture this, using size=3, min_size=2: - One node is down for maintenance - You loose a couple of devices - You loose data Is it likely that a nvme device dies during a short maintenance window ? Is it likely that two devices dies at the same time ? What are the numbers ? On 2/5/21 12:26 PM, Wido den Hollander wrote:

...

On 04/02/2021 18:57, Adam Boyhan wrote:

From: "Anthony D'Atri" <anthony.datri(a)gmail.com> To: "ceph-users" <ceph-users(a)ceph.io> Sent: Thursday, February 4, 2021 12:47:27 PM Subject: [ceph-users] Re: NVMe and 2x Replica

Personally, this looks like marketing BS to me. SSD shops want to sell SSDs, but because of the cost difference they have to convince buyers that their products are competitive.

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Frank Schilder

9:14 p.m.

...

Picture this, using size=3, min_size=2: - One node is down for maintenance - You loose a couple of devices - You loose data Is it likely that a nvme device dies during a short maintenance window ? Is it likely that two devices dies at the same time ?

If you just look at it from this narrow point of view of fundamental laws of nature, then, yes, 2+1 is safe. As safe as is nuclear power just looking at the laws of physics. So why then did Chernobyl and Fukushima happen? Its because its operated by humans. If you look around, the No. 1 reason for loosing data on ceph or entire clusters is 2+1. Look at the reasons. Its rarely a broken disk. A system designed with no redundancy that offers a margin for error will suffer from every little admin mistake, undetected race condition, bug in ceph or bug in firmware. So, if the savings are worth the sweat, downtime and consultancy budget, why not? Ceph has infinite uptime. During such a long period, low-probability events will happen with probability 1. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Jack <ceph(a)jack.fr.eu.org> Sent: 05 February 2021 12:48:33 To: ceph-users(a)ceph.io Subject: [ceph-users] Re: NVMe and 2x Replica At the end, this is nothing but a probability stuff Picture this, using size=3, min_size=2: - One node is down for maintenance - You loose a couple of devices - You loose data Is it likely that a nvme device dies during a short maintenance window ? Is it likely that two devices dies at the same time ? What are the numbers ? On 2/5/21 12:26 PM, Wido den Hollander wrote:

...

On 04/02/2021 18:57, Adam Boyhan wrote:

From: "Anthony D'Atri" <anthony.datri(a)gmail.com> To: "ceph-users" <ceph-users(a)ceph.io> Sent: Thursday, February 4, 2021 12:47:27 PM Subject: [ceph-users] Re: NVMe and 2x Replica

Personally, this looks like marketing BS to me. SSD shops want to sell SSDs, but because of the cost difference they have to convince buyers that their products are competitive.

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Adam Boyhan

9:58 p.m.

This turned into a great thread. Lots of good information and clarification. I am 100% on board with 3 copies for the primary. What does everyone think about possibly only doing 2 copies on the secondary? Keeping in mind that I would keep min=2 which I think will be reasonable for a secondary site. From: "Frank Schilder" <frans(a)dtu.dk> To: "Jack" <ceph(a)jack.fr.eu.org>rg>, "ceph-users" <ceph-users(a)ceph.io> Sent: Friday, February 5, 2021 7:14:52 AM Subject: [ceph-users] Re: NVMe and 2x Replica

...

On 04/02/2021 18:57, Adam Boyhan wrote:

From: "Anthony D'Atri" <anthony.datri(a)gmail.com> To: "ceph-users" <ceph-users(a)ceph.io> Sent: Thursday, February 4, 2021 12:47:27 PM Subject: [ceph-users] Re: NVMe and 2x Replica

Personally, this looks like marketing BS to me. SSD shops want to sell SSDs, but because of the cost difference they have to convince buyers that their products are competitive.

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Frank Schilder

6 Feb 6 Feb

12:19 a.m.

I don't run a secondary site and don't know if short windows of read-only access are terrible. From the data security point of view, min_size 2 is fine. Its the min_size 1 that really is dangerous, because it accepts non-redundant writes. Even if you loose the second site entirely, you can always re-sync from scratch - assuming decent network bandwidth. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Adam Boyhan <adamb(a)medent.com> Sent: 05 February 2021 13:58:34 To: Frank Schilder Cc: Jack; ceph-users Subject: Re: [ceph-users] Re: NVMe and 2x Replica This turned into a great thread. Lots of good information and clarification. I am 100% on board with 3 copies for the primary. What does everyone think about possibly only doing 2 copies on the secondary? Keeping in mind that I would keep min=2 which I think will be reasonable for a secondary site. ________________________________ From: "Frank Schilder" <frans(a)dtu.dk> To: "Jack" <ceph(a)jack.fr.eu.org>rg>, "ceph-users" <ceph-users(a)ceph.io> Sent: Friday, February 5, 2021 7:14:52 AM Subject: [ceph-users] Re: NVMe and 2x Replica

...

On 04/02/2021 18:57, Adam Boyhan wrote:

From: "Anthony D'Atri" <anthony.datri(a)gmail.com> To: "ceph-users" <ceph-users(a)ceph.io> Sent: Thursday, February 4, 2021 12:47:27 PM Subject: [ceph-users] Re: NVMe and 2x Replica

Personally, this looks like marketing BS to me. SSD shops want to sell SSDs, but because of the cost difference they have to convince buyers that their products are competitive.

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Adam Boyhan

12:20 a.m.

Those are my thoughts as well. We have 40Gbit/s of dedicated dark fiber that we manage between the two sites. From: "Frank Schilder" <frans(a)dtu.dk> To: "adamb" <adamb(a)medent.com> Cc: "Jack" <ceph(a)jack.fr.eu.org>rg>, "ceph-users" <ceph-users(a)ceph.io> Sent: Friday, February 5, 2021 10:19:06 AM Subject: Re: [ceph-users] Re: NVMe and 2x Replica I don't run a secondary site and don't know if short windows of read-only access are terrible. From the data security point of view, min_size 2 is fine. Its the min_size 1 that really is dangerous, because it accepts non-redundant writes. Even if you loose the second site entirely, you can always re-sync from scratch - assuming decent network bandwidth. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Adam Boyhan <adamb(a)medent.com> Sent: 05 February 2021 13:58:34 To: Frank Schilder Cc: Jack; ceph-users Subject: Re: [ceph-users] Re: NVMe and 2x Replica This turned into a great thread. Lots of good information and clarification. I am 100% on board with 3 copies for the primary. What does everyone think about possibly only doing 2 copies on the secondary? Keeping in mind that I would keep min=2 which I think will be reasonable for a secondary site. ________________________________ From: "Frank Schilder" <frans(a)dtu.dk> To: "Jack" <ceph(a)jack.fr.eu.org>rg>, "ceph-users" <ceph-users(a)ceph.io> Sent: Friday, February 5, 2021 7:14:52 AM Subject: [ceph-users] Re: NVMe and 2x Replica

...

On 04/02/2021 18:57, Adam Boyhan wrote:

From: "Anthony D'Atri" <anthony.datri(a)gmail.com> To: "ceph-users" <ceph-users(a)ceph.io> Sent: Thursday, February 4, 2021 12:47:27 PM Subject: [ceph-users] Re: NVMe and 2x Replica

Personally, this looks like marketing BS to me. SSD shops want to sell SSDs, but because of the cost difference they have to convince buyers that their products are competitive.

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Jack

1:08 a.m.

Is raid1 dangerous ? Is raid5 dangerous ? They both allow non-redondant writes On 2/5/21 4:19 PM, Frank Schilder wrote:

...

On 04/02/2021 18:57, Adam Boyhan wrote:

From: "Anthony D'Atri" <anthony.datri(a)gmail.com> To: "ceph-users" <ceph-users(a)ceph.io> Sent: Thursday, February 4, 2021 12:47:27 PM Subject: [ceph-users] Re: NVMe and 2x Replica

Personally, this looks like marketing BS to me. SSD shops want to sell SSDs, but because of the cost difference they have to convince buyers that their products are competitive.

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Anthony D'Atri

1:26 a.m.

Analogies between a distributed system and one that isn’t can be a bit strained or nuanced. The question really isn’t IF a given solution is dangerous, but HOW dangerous it is. There is always a long tail ; one picks a point along it based on capex, business needs, etc. I sometimes read that RAID1 == R2, but I’ve always considered it RN. After I asked HP to support R3 with their HBAs they did, I like to think I had something to do with that but it may have been coincidence. For RAID5, look up “write hole”. min_size = 1 has occasional utility when *very temporarily* set to allow recovery from a bad situation(1), but as a permanent topology it’s Russian Roulette. Spin enough times and .... 1: These aren’t as common as they used to be. > On Feb 5, 2021, at 8:09 AM, Jack <ceph(a)jack.fr.eu.org> wrote: > > Is raid1 dangerous ? > Is raid5 dangerous ? > > They both allow non-redondant writes > > >> On 2/5/21 4:19 PM, Frank Schilder wrote: >> I don't run a secondary site and don't know if short windows of read-only access are terrible. From the data security point of view, min_size 2 is fine. Its the min_size 1 that really is dangerous, because it accepts non-redundant writes. >> Even if you loose the second site entirely, you can always re-sync from scratch - assuming decent network bandwidth. >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> ________________________________________ >> From: Adam Boyhan <adamb(a)medent.com> >> Sent: 05 February 2021 13:58:34 >> To: Frank Schilder >> Cc: Jack; ceph-users >> Subject: Re: [ceph-users] Re: NVMe and 2x Replica >> This turned into a great thread. Lots of good information and clarification. >> I am 100% on board with 3 copies for the primary. >> What does everyone think about possibly only doing 2 copies on the secondary? Keeping in mind that I would keep min=2 which I think will be reasonable for a secondary site. >> ________________________________ >> From: "Frank Schilder" <frans(a)dtu.dk> >> To: "Jack" <ceph(a)jack.fr.eu.org>rg>, "ceph-users" <ceph-users(a)ceph.io> >> Sent: Friday, February 5, 2021 7:14:52 AM >> Subject: [ceph-users] Re: NVMe and 2x Replica >>> Picture this, using size=3, min_size=2: >>> - One node is down for maintenance >>> - You loose a couple of devices >>> - You loose data >>> >>> Is it likely that a nvme device dies during a short maintenance window ? >>> Is it likely that two devices dies at the same time ? >> If you just look at it from this narrow point of view of fundamental laws of nature, then, yes, 2+1 is safe. As safe as is nuclear power just looking at the laws of physics. So why then did Chernobyl and Fukushima happen? Its because its operated by humans. If you look around, the No. 1 reason for loosing data on ceph or entire clusters is 2+1. >> Look at the reasons. Its rarely a broken disk. A system designed with no redundancy that offers a margin for error will suffer from every little admin mistake, undetected race condition, bug in ceph or bug in firmware. So, if the savings are worth the sweat, downtime and consultancy budget, why not? >> Ceph has infinite uptime. During such a long period, low-probability events will happen with probability 1. >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> ________________________________________ >> From: Jack <ceph(a)jack.fr.eu.org> >> Sent: 05 February 2021 12:48:33 >> To: ceph-users(a)ceph.io >> Subject: [ceph-users] Re: NVMe and 2x Replica >> At the end, this is nothing but a probability stuff >> Picture this, using size=3, min_size=2: >> - One node is down for maintenance >> - You loose a couple of devices >> - You loose data >> Is it likely that a nvme device dies during a short maintenance window ? >> Is it likely that two devices dies at the same time ? >> What are the numbers ? >>> On 2/5/21 12:26 PM, Wido den Hollander wrote: >>> >>> >>> On 04/02/2021 18:57, Adam Boyhan wrote: >>>> All great input and points guys. >>>> >>>> Helps me lean towards 3 copes a bit more. >>>> >>>> I mean honestly NVMe cost per TB isn't that much more than SATA SSD >>>> now. Somewhat surprised the salesmen aren't pitching 3x replication as >>>> it makes them more money. >>> >>> To add to this, I have seen real cases as a Ceph consultant where size=2 >>> and min_size=1 on all flash lead to data loss. >>> >>> Picture this: >>> >>> - One node is down (Maintenance, failure, etc, etc) >>> - NVMe device in other node dies >>> - You loose data >>> >>> Although you can bring back the other node which was down but not broken >>> you are missing data. The data on the NVMe devices in there is outdated >>> and thus the PGs will not become active. >>> >>> size=2 is only safe with min_size=2, but that doesn't really provide HA. >>> >>> The same goes with ZFS in mirror, raidz1, etc. If you loose one device >>> the chances are real you loose the other device before the array has >>> healed itself. >>> >>> With Ceph it's slighly more complex, but the same principles apply. >>> >>> No, with NVMe I still would highly advise against using size=2, min_size=1 >>> >>> The question is not if you will loose data, but the question is: When >>> will you loose data? Within one year, 2? 3? 10? >>> >>> Wido >>> >>>> >>>> >>>> >>>> From: "Anthony D'Atri" <anthony.datri(a)gmail.com> >>>> To: "ceph-users" <ceph-users(a)ceph.io> >>>> Sent: Thursday, February 4, 2021 12:47:27 PM >>>> Subject: [ceph-users] Re: NVMe and 2x Replica >>>> >>>>> I searched each to find the section where 2x was discussed. What I >>>>> found was interesting. First, there are really only 2 positions here: >>>>> Micron's and Red Hat's. Supermicro copies Micron's positon paragraph >>>>> word for word. Not surprising considering that they are advertising a >>>>> Supermicro / Micron solution. >>>> >>>> FWIW, at Cephalocon another vendor made a similar claim during a talk. >>>> >>>> * Failure rates are averages, not minima. Some drives will always fail >>>> sooner >>>> * Firmware and other design flaws can result in much higher rates of >>>> failure or insidious UREs that can result in partial data >>>> unavailability or loss >>>> * Latent soft failures may not be detected until a deep scrub >>>> succeeds, which could be weeks later >>>> * In a distributed system, there are up/down/failure scenarios where >>>> the location of even one good / canonical / latest copy of data is >>>> unclear, especially when drive or HBA cache is in play. >>>> * One of these is a power failure. Sure PDU / PSU redundancy helps, >>>> but stuff happens, like a DC underprovisioning amps, so that a spike >>>> in user traffic results in the whole row going down :-x Various >>>> unpleasant things can happen. >>>> >>>> I was championing R3 even pre-Ceph when I was using ZFS or HBA RAID. >>>> As others have written, as drives get larger the time to fill them >>>> with replica data increases, as does the chance of overlapping >>>> failures. I’ve experieneced R2 overlapping failures more than once, >>>> with and before Ceph. >>>> >>>> My sense has been that not many people run R2 for data they care >>>> about, and as has been written recently 2,2 EC is safer with the same >>>> raw:usable ratio. I’ve figured that vendors make R2 statements like >>>> these as a selling point to assert lower TCO. My first response is >>>> often “How much would it cost you directly, and indirectly in terms of >>>> user / customer goodwill, to loose data?”. >>>> >>>>> Personally, this looks like marketing BS to me. SSD shops want to >>>>> sell SSDs, but because of the cost difference they have to convince >>>>> buyers that their products are competitive. >>>> >>>> ^this. I’m watching the QLC arena with interest for the potential to >>>> narrow the CapEx gap. Durability has been one concern, though I’m >>>> seeing newer products claiming that eg. ZNS improves that. It also >>>> seems that there are something like what, *4* separate EDSFF / ruler >>>> form factors, I really want to embrace those eg. for object clusters, >>>> but I’m VERY wary of the longevity of competing standards and any >>>> single-source for chassies or drives. >>>> >>>>> Our products cost twice as much, but LOOK you only need 2/3 as many, >>>>> and you get all these other benefits (performance). Plus, if you >>>>> replace everything in 2 or 3 years anyway, then you won't have to >>>>> worry about them failing. >>>> >>>> Refresh timelines. You’re funny ;) Every time, every single time, that >>>> I’ve worked in an organization that claims a 3 (or 5, or whatever) >>>> hardware refresh cycle, it hasn’t happened. When you start getting >>>> close, the capex doesn’t materialize, or the opex cost of DC hands and >>>> operational oversight. “How do you know that the drives will start >>>> failing or getting slower? Let’s revisit this in 6 months”. Etc. >>>> >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Nathan Fish

1:28 a.m.

Why would you use RAID underneath Ceph? The only reason I've seen to do that is if you don't have enough CPU to run enough OSDs. On Fri, Feb 5, 2021 at 11:09 AM Jack <ceph(a)jack.fr.eu.org> wrote: > > Is raid1 dangerous ? > Is raid5 dangerous ? > > They both allow non-redondant writes > > > On 2/5/21 4:19 PM, Frank Schilder wrote: > > I don't run a secondary site and don't know if short windows of read-only access are terrible. From the data security point of view, min_size 2 is fine. Its the min_size 1 that really is dangerous, because it accepts non-redundant writes. > > > > Even if you loose the second site entirely, you can always re-sync from scratch - assuming decent network bandwidth. > > > > Best regards, > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > ________________________________________ > > From: Adam Boyhan <adamb(a)medent.com> > > Sent: 05 February 2021 13:58:34 > > To: Frank Schilder > > Cc: Jack; ceph-users > > Subject: Re: [ceph-users] Re: NVMe and 2x Replica > > > > This turned into a great thread. Lots of good information and clarification. > > > > I am 100% on board with 3 copies for the primary. > > > > What does everyone think about possibly only doing 2 copies on the secondary? Keeping in mind that I would keep min=2 which I think will be reasonable for a secondary site. > > > > ________________________________ > > From: "Frank Schilder" <frans(a)dtu.dk> > > To: "Jack" <ceph(a)jack.fr.eu.org>rg>, "ceph-users" <ceph-users(a)ceph.io> > > Sent: Friday, February 5, 2021 7:14:52 AM > > Subject: [ceph-users] Re: NVMe and 2x Replica > > > >> Picture this, using size=3, min_size=2: > >> - One node is down for maintenance > >> - You loose a couple of devices > >> - You loose data > >> > >> Is it likely that a nvme device dies during a short maintenance window ? > >> Is it likely that two devices dies at the same time ? > > > > If you just look at it from this narrow point of view of fundamental laws of nature, then, yes, 2+1 is safe. As safe as is nuclear power just looking at the laws of physics. So why then did Chernobyl and Fukushima happen? Its because its operated by humans. If you look around, the No. 1 reason for loosing data on ceph or entire clusters is 2+1. > > > > Look at the reasons. Its rarely a broken disk. A system designed with no redundancy that offers a margin for error will suffer from every little admin mistake, undetected race condition, bug in ceph or bug in firmware. So, if the savings are worth the sweat, downtime and consultancy budget, why not? > > > > Ceph has infinite uptime. During such a long period, low-probability events will happen with probability 1. > > > > Best regards, > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > ________________________________________ > > From: Jack <ceph(a)jack.fr.eu.org> > > Sent: 05 February 2021 12:48:33 > > To: ceph-users(a)ceph.io > > Subject: [ceph-users] Re: NVMe and 2x Replica > > > > At the end, this is nothing but a probability stuff > > > > Picture this, using size=3, min_size=2: > > - One node is down for maintenance > > - You loose a couple of devices > > - You loose data > > > > Is it likely that a nvme device dies during a short maintenance window ? > > Is it likely that two devices dies at the same time ? > > > > What are the numbers ? > > > > On 2/5/21 12:26 PM, Wido den Hollander wrote: > >> > >> > >> On 04/02/2021 18:57, Adam Boyhan wrote: > >>> All great input and points guys. > >>> > >>> Helps me lean towards 3 copes a bit more. > >>> > >>> I mean honestly NVMe cost per TB isn't that much more than SATA SSD > >>> now. Somewhat surprised the salesmen aren't pitching 3x replication as > >>> it makes them more money. > >> > >> To add to this, I have seen real cases as a Ceph consultant where size=2 > >> and min_size=1 on all flash lead to data loss. > >> > >> Picture this: > >> > >> - One node is down (Maintenance, failure, etc, etc) > >> - NVMe device in other node dies > >> - You loose data > >> > >> Although you can bring back the other node which was down but not broken > >> you are missing data. The data on the NVMe devices in there is outdated > >> and thus the PGs will not become active. > >> > >> size=2 is only safe with min_size=2, but that doesn't really provide HA. > >> > >> The same goes with ZFS in mirror, raidz1, etc. If you loose one device > >> the chances are real you loose the other device before the array has > >> healed itself. > >> > >> With Ceph it's slighly more complex, but the same principles apply. > >> > >> No, with NVMe I still would highly advise against using size=2, min_size=1 > >> > >> The question is not if you will loose data, but the question is: When > >> will you loose data? Within one year, 2? 3? 10? > >> > >> Wido > >> > >>> > >>> > >>> > >>> From: "Anthony D'Atri" <anthony.datri(a)gmail.com> > >>> To: "ceph-users" <ceph-users(a)ceph.io> > >>> Sent: Thursday, February 4, 2021 12:47:27 PM > >>> Subject: [ceph-users] Re: NVMe and 2x Replica > >>> > >>>> I searched each to find the section where 2x was discussed. What I > >>>> found was interesting. First, there are really only 2 positions here: > >>>> Micron's and Red Hat's. Supermicro copies Micron's positon paragraph > >>>> word for word. Not surprising considering that they are advertising a > >>>> Supermicro / Micron solution. > >>> > >>> FWIW, at Cephalocon another vendor made a similar claim during a talk. > >>> > >>> * Failure rates are averages, not minima. Some drives will always fail > >>> sooner > >>> * Firmware and other design flaws can result in much higher rates of > >>> failure or insidious UREs that can result in partial data > >>> unavailability or loss > >>> * Latent soft failures may not be detected until a deep scrub > >>> succeeds, which could be weeks later > >>> * In a distributed system, there are up/down/failure scenarios where > >>> the location of even one good / canonical / latest copy of data is > >>> unclear, especially when drive or HBA cache is in play. > >>> * One of these is a power failure. Sure PDU / PSU redundancy helps, > >>> but stuff happens, like a DC underprovisioning amps, so that a spike > >>> in user traffic results in the whole row going down :-x Various > >>> unpleasant things can happen. > >>> > >>> I was championing R3 even pre-Ceph when I was using ZFS or HBA RAID. > >>> As others have written, as drives get larger the time to fill them > >>> with replica data increases, as does the chance of overlapping > >>> failures. I’ve experieneced R2 overlapping failures more than once, > >>> with and before Ceph. > >>> > >>> My sense has been that not many people run R2 for data they care > >>> about, and as has been written recently 2,2 EC is safer with the same > >>> raw:usable ratio. I’ve figured that vendors make R2 statements like > >>> these as a selling point to assert lower TCO. My first response is > >>> often “How much would it cost you directly, and indirectly in terms of > >>> user / customer goodwill, to loose data?”. > >>> > >>>> Personally, this looks like marketing BS to me. SSD shops want to > >>>> sell SSDs, but because of the cost difference they have to convince > >>>> buyers that their products are competitive. > >>> > >>> ^this. I’m watching the QLC arena with interest for the potential to > >>> narrow the CapEx gap. Durability has been one concern, though I’m > >>> seeing newer products claiming that eg. ZNS improves that. It also > >>> seems that there are something like what, *4* separate EDSFF / ruler > >>> form factors, I really want to embrace those eg. for object clusters, > >>> but I’m VERY wary of the longevity of competing standards and any > >>> single-source for chassies or drives. > >>> > >>>> Our products cost twice as much, but LOOK you only need 2/3 as many, > >>>> and you get all these other benefits (performance). Plus, if you > >>>> replace everything in 2 or 3 years anyway, then you won't have to > >>>> worry about them failing. > >>> > >>> Refresh timelines. You’re funny ;) Every time, every single time, that > >>> I’ve worked in an organization that claims a 3 (or 5, or whatever) > >>> hardware refresh cycle, it hasn’t happened. When you start getting > >>> close, the capex doesn’t materialize, or the opex cost of DC hands and > >>> operational oversight. “How do you know that the drives will start > >>> failing or getting slower? Let’s revisit this in 6 months”. Etc. > >>> > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users(a)ceph.io > >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users(a)ceph.io > >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io > >>> > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users(a)ceph.io > >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Mark Lehrer

5 Feb 5 Feb

4:18 a.m.

...

It seems all the big vendors feel 2x is safe with NVMe but I get the feeling this community feels otherwise

Definitely! As someone who works for a big vendor (and I have since I worked at Fusion-IO way back in the old days), IMO the correct way to phrase this would probably be that "someone in technical marketing at the big vendors" was convinced that 2x was safe enough to put in a white paper or sales document. They (we, I guess, since I'm one of these types of people) are focused on performance and cost numbers and as much as I hate to admit it, it can get in the way of long-term reliability settings sometimes. This doesn't mean that they are "wrong" -- these documents are primarily meant to show the capabilities of their hardware, with a bill of materials containing their part numbers. It is expected that end users will adjust a few things when it comes to a production environment. The idea that NVMe is safer than spinning rust drives is not necessarily true -- and it's beside the point. You are just as likely to run into a weird situation where an OSD or pg acts up or disappears for non-hardware reasons. Unless you can live with "nine fives" instead of "five nines" (say, a caching type of application where you can re-generate the data), use a size of at least 3 -- and if you can't afford this much storage then look at erasure coding schemes. All of this is IMO of course, Mark On Thu, Feb 4, 2021 at 4:38 AM Adam Boyhan <adamb(a)medent.com> wrote: > > I know there is already a few threads about 2x replication but I wanted to start one dedicated to discussion on NVMe. There are some older threads, but nothing recent that addresses how the vendors are now pushing the idea of 2x. > > We are in the process of considering Ceph to replace our Nimble setup. We will have two completely separate clusters at two different sites that we are using rbd-mirror snapshot replication. The plan would be to run 2x replication on each cluster. 3x is still an option, but for obvious reasons 2x is enticing. > > Both clusters will be spot on to the super micro example in the white paper below. > > It seems all the big vendors feel 2x is safe with NVMe but I get the feeling this community feels otherwise. Trying to wrap my head around were the disconnect is between the big players and the community. I could be missing something, but even our Supermicro contact that we worked the config out with was in agreement with 2x on NVMe. > > Appreciate the input! > > [ https://www.supermicro.com/white_paper/white_paper_Ceph-Ultra.pdf | https://www.supermicro.com/white_paper/white_paper_Ceph-Ultra.pdf ] > > [ https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-referen… ] > [ https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-referen… | https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-referen… ] > > [ https://www.samsung.com/semiconductor/global.semi/file/resource/2020/05/red… | https://www.samsung.com/semiconductor/global.semi/file/resource/2020/05/red… ] > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Mark Lehrer

6 Feb 6 Feb

5:28 a.m.

I have just one more suggestion for you:

...

but even our Supermicro contact that we worked the config out with was in agreement with 2x on NVMe

These kinds of settings aren't set in stone, it is a one line command to rebalance (admittedly you wouldn't want to just do this casually). I don't know your situation in any detail, but perhaps you could start with size 3 and put off the size 2 decision until your cluster is maybe 30% full... then you could make a final decision to either add more storage or rebalance to size 2. You can also have different size settings for different pools depending on how important the data is. Mark On Thu, Feb 4, 2021 at 4:38 AM Adam Boyhan <adamb(a)medent.com> wrote: > > I know there is already a few threads about 2x replication but I wanted to start one dedicated to discussion on NVMe. There are some older threads, but nothing recent that addresses how the vendors are now pushing the idea of 2x. > > We are in the process of considering Ceph to replace our Nimble setup. We will have two completely separate clusters at two different sites that we are using rbd-mirror snapshot replication. The plan would be to run 2x replication on each cluster. 3x is still an option, but for obvious reasons 2x is enticing. > > Both clusters will be spot on to the super micro example in the white paper below. > > It seems all the big vendors feel 2x is safe with NVMe but I get the feeling this community feels otherwise. Trying to wrap my head around were the disconnect is between the big players and the community. I could be missing something, but even our Supermicro contact that we worked the config out with was in agreement with 2x on NVMe. > > Appreciate the input! > > [ https://www.supermicro.com/white_paper/white_paper_Ceph-Ultra.pdf | https://www.supermicro.com/white_paper/white_paper_Ceph-Ultra.pdf ] > > [ https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-referen… ] > [ https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-referen… | https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-referen… ] > > [ https://www.samsung.com/semiconductor/global.semi/file/resource/2020/05/red… | https://www.samsung.com/semiconductor/global.semi/file/resource/2020/05/red… ] > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

1179

days inactive

1180

days old

ceph-users@ceph.io

Manage subscription

22 comments

12 participants

tags (0)

participants (12)

Adam Boyhan
Anthony D'Atri
Brian :
DHilsbos＠performair.com
Frank Schilder
Jack
Janne Johansson
Mark Lehrer
Nathan Fish
Pascal Ehlert
Steven Pine
Wido den Hollander