Worst thing that can happen if I have size= 2

List overview All Threads
Download

newer

older

ceph-volume bluestore _read_fsid...

mon db high iops

Mario Giammarco

3 Feb 2021 3 Feb '21

2:54 p.m.

Hello, Imagine this situation: - 3 servers with ceph - a pool with size 2 min 1 I know perfectly the size 3 and min 2 is better. I would like to know what is the worst thing that can happen: - a disk breaks and another disk breaks before ceph has reconstructed second replica, ok I lose data - if network goes down and so monitors lose quorum do ceph still write on disks? What else? Thanks, Mario

Show replies by thread

Magnus HAGDORN

3 Feb 3 Feb

2:59 p.m.

if a OSD becomes unavailble (broken disk, rebooting server) then all I/O to the PGs stored on that OSD will block until replication level of 2 is reached again. So, for a highly available cluster you need a replication level of 3 On Wed, 2021-02-03 at 10:24 +0100, Mario Giammarco wrote:

...

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

Max Krasilnikov

3:09 p.m.

День добрий! Wed, Feb 03, 2021 at 09:29:52AM +0000, Magnus.Hagdorn wrote:

...

AFAIK, with min_size 1 it is possible to write even to only active OSD serving PG. > On Wed, 2021-02-03 at 10:24 +0100, Mario Giammarco wrote: > > Hello, > > Imagine this situation: > > - 3 servers with ceph > > - a pool with size 2 min 1 > > > > I know perfectly the size 3 and min 2 is better. > > I would like to know what is the worst thing that can happen: > > > > - a disk breaks and another disk breaks before ceph has reconstructed > > second replica, ok I lose data > > > > - if network goes down and so monitors lose quorum do ceph still > > write on > > disks? > > > > What else?

Magnus HAGDORN

3:13 p.m.

On Wed, 2021-02-03 at 09:39 +0000, Max Krasilnikov wrote:

...

AFAIK, with min_size 1 it is possible to write even to only active OSD serving

yes, that's correct but then you seriously risk trashing your data The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

Adam Boyhan

9:19 p.m.

Isn't this somewhat reliant on the OSD type? Redhat/Micron/Samsung/Supermicro have all put out white papers backing the idea of 2 copies on NVMe's as safe for production. From: "Magnus HAGDORN" <Magnus.Hagdorn(a)ed.ac.uk> To: pseudo(a)avalon.org.ua Cc: "ceph-users" <ceph-users(a)ceph.io> Sent: Wednesday, February 3, 2021 4:43:08 AM Subject: [ceph-users] Re: Worst thing that can happen if I have size= 2 On Wed, 2021-02-03 at 09:39 +0000, Max Krasilnikov wrote:

...

AFAIK, with min_size 1 it is possible to write even to only active OSD serving

DHilsbos＠performair.com

9:27 p.m.

Adam; I'd like to see that / those white papers. I suspect what they're advocating is multiple OSD daemon processes per NVMe device. This is something which can improve performance. Though I've never done it, I believe you partition the device, and then create your OSD pointing at a partition. Thank you, Dominic L. Hilsbos, MBA Director - Information Technology Perform Air International Inc. DHilsbos(a)PerformAir.com www.PerformAir.com -----Original Message----- From: Adam Boyhan [mailto:adamb@medent.com] Sent: Wednesday, February 3, 2021 8:50 AM To: Magnus HAGDORN Cc: ceph-users Subject: [ceph-users] Re: Worst thing that can happen if I have size= 2 Isn't this somewhat reliant on the OSD type? Redhat/Micron/Samsung/Supermicro have all put out white papers backing the idea of 2 copies on NVMe's as safe for production. From: "Magnus HAGDORN" <Magnus.Hagdorn(a)ed.ac.uk> To: pseudo(a)avalon.org.ua Cc: "ceph-users" <ceph-users(a)ceph.io> Sent: Wednesday, February 3, 2021 4:43:08 AM Subject: [ceph-users] Re: Worst thing that can happen if I have size= 2 On Wed, 2021-02-03 at 09:39 +0000, Max Krasilnikov wrote:

...

AFAIK, with min_size 1 it is possible to write even to only active OSD serving

yes, that's correct but then you seriously risk trashing your data The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Martin Verges

9:29 p.m.

Hello Adam, 2 copies are save, min size 1 is not. As long as there is no write while one copy is missing, you can recover from that or from the unavailable copy when it comes online again. If you have min size 1 and you therefore write data on a single copy, no safety net will protect you. In general we consider even a 2 copy setup not secure enough. -- Martin Verges Managing director Mobile: +49 174 9335695 E-Mail: martin.verges(a)croit.io Chat: https://t.me/MartinVerges croit GmbH, Freseniusstr. 31h, 81247 Munich CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht Munich HRB 231263 Web: https://croit.io YouTube: https://goo.gl/PGE1Bx Am Mi., 3. Feb. 2021 um 16:50 Uhr schrieb Adam Boyhan <adamb(a)medent.com>om>: > > Isn't this somewhat reliant on the OSD type? > > Redhat/Micron/Samsung/Supermicro have all put out white papers backing the idea of 2 copies on NVMe's as safe for production. > > > From: "Magnus HAGDORN" <Magnus.Hagdorn(a)ed.ac.uk> > To: pseudo(a)avalon.org.ua > Cc: "ceph-users" <ceph-users(a)ceph.io> > Sent: Wednesday, February 3, 2021 4:43:08 AM > Subject: [ceph-users] Re: Worst thing that can happen if I have size= 2 > > On Wed, 2021-02-03 at 09:39 +0000, Max Krasilnikov wrote: > > > if a OSD becomes unavailble (broken disk, rebooting server) then > > > all > > > I/O to the PGs stored on that OSD will block until replication > > > level of > > > 2 is reached again. So, for a highly available cluster you need a > > > replication level of 3 > > > > > > AFAIK, with min_size 1 it is possible to write even to only active > > OSD serving > > > yes, that's correct but then you seriously risk trashing your data > > The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Adam Boyhan

9:33 p.m.

No problem. They have been around for quite some time now. Even speaking to the Ceph engineers over at supermicro while we spec'd our hardware, they agreed as well. [ https://www.supermicro.com/white_paper/white_paper_Ceph-Ultra.pdf | https://www.supermicro.com/white_paper/white_paper_Ceph-Ultra.pdf ] [ https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-referen… | https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-referen… ] [ https://www.samsung.com/semiconductor/global.semi/file/resource/2020/05/red… | https://www.samsung.com/semiconductor/global.semi/file/resource/2020/05/red… ] From: DHilsbos(a)performair.com To: "adamb" <adamb(a)medent.com> Cc: "ceph-users" <ceph-users(a)ceph.io> Sent: Wednesday, February 3, 2021 10:57:38 AM Subject: RE: Worst thing that can happen if I have size= 2 Adam; I'd like to see that / those white papers. I suspect what they're advocating is multiple OSD daemon processes per NVMe device. This is something which can improve performance. Though I've never done it, I believe you partition the device, and then create your OSD pointing at a partition. Thank you, Dominic L. Hilsbos, MBA Director - Information Technology Perform Air International Inc. DHilsbos(a)PerformAir.com www.PerformAir.com -----Original Message----- From: Adam Boyhan [mailto:adamb@medent.com] Sent: Wednesday, February 3, 2021 8:50 AM To: Magnus HAGDORN Cc: ceph-users Subject: [ceph-users] Re: Worst thing that can happen if I have size= 2 Isn't this somewhat reliant on the OSD type? Redhat/Micron/Samsung/Supermicro have all put out white papers backing the idea of 2 copies on NVMe's as safe for production. From: "Magnus HAGDORN" <Magnus.Hagdorn(a)ed.ac.uk> To: pseudo(a)avalon.org.ua Cc: "ceph-users" <ceph-users(a)ceph.io> Sent: Wednesday, February 3, 2021 4:43:08 AM Subject: [ceph-users] Re: Worst thing that can happen if I have size= 2 On Wed, 2021-02-03 at 09:39 +0000, Max Krasilnikov wrote:

...

AFAIK, with min_size 1 it is possible to write even to only active OSD serving

Simon Ironside

10:12 p.m.

On 03/02/2021 09:24, Mario Giammarco wrote:

...

Hello, Imagine this situation: - 3 servers with ceph - a pool with size 2 min 1 I know perfectly the size 3 and min 2 is better. I would like to know what is the worst thing that can happen:

Mario Giammarco

4 Feb 4 Feb

1:18 a.m.

Thanks Simon and thanks to other people that have replied. Sorry but I try to explain myself better. It is evident to me that if I have two copies of data, one brokes and while ceph creates again a new copy of the data also the disk with the second copy brokes you lose the data. It is obvious and a bit paranoid because many servers on many customers run on raid1 and so you are saying: yeah you have two copies of the data but you can broke both. Consider that in ceph recovery is automatic, with raid1 some one must manually go to the customer and change disks. So ceph is already an improvement in this case even with size=2. With size 3 and min 2 it is a bigger improvement I know. What I ask is this: what happens with min_size=1 and split brain, network down or similar things: do ceph block writes because it has no quorum on monitors? Are there some failure scenarios that I have not considered? Thanks again! Mario Il giorno mer 3 feb 2021 alle ore 17:42 Simon Ironside < sironside(a)caffetine.org> ha scritto:

...

On 03/02/2021 09:24, Mario Giammarco wrote:

Hello, Imagine this situation: - 3 servers with ceph - a pool with size 2 min 1 I know perfectly the size 3 and min 2 is better. I would like to know what is the worst thing that can happen:

Hi Mario, This thread is worth a read, it's an oldie but a goodie: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.ht… Especially this post, which helped me understand the importance of min_size=2 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014892.ht… Cheers, Simon _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Dan van der Ster

1:52 a.m.

Ceph is multiple factors more risky with min_size 1 than good old raid1: With raid1, having disks A and B, when disk A fails, you start recovery to a new disk A'. If disk B fails during recovery then you have a disaster. With Ceph, we have multiple servers and multiple disks: When an OSD fails and you replace it, it starts recovering. During that recovery time, if roughly *any* other disk in the cluster fails then you have a disaster. That's the basic argument. In more detail, OSDs are aware of a sort of "last written to" state of the PGs on all their peers. If an OSD goes down briefly then restarts, it first learns the PG states of its peers and starts recovering those missed writes. The recovering OSD will not be able to serve any IO until it has recovered the objects to their latest states. So... If any of those peers have any sort of problem during the recovery process, your cluster will be down. "Down" in this case means precisely that the PG will be marked incomplete and IO will be blocked until all needed OSDs are up and running. Experts here know how to revive a cluster in that state, accepting then dealing with arbitrary data loss, but ceph won't do that "dangerous" recovery automatically for obvious reasons. Here's another reference (from Wido again) that i hope will scare you away from min_size 1: https://www.slideshare.net/mobile/ShapeBlue/wido-den-hollander-10-ways-to-b… Lastly, if you can't afford 3x replicas, then use 2+2 erasure coding if possible. Cheers, Dan On Wed, Feb 3, 2021, 8:49 PM Mario Giammarco <mgiammarco(a)gmail.com> wrote:

...

On 03/02/2021 09:24, Mario Giammarco wrote:

Hello, Imagine this situation: - 3 servers with ceph - a pool with size 2 min 1 I know perfectly the size 3 and min 2 is better. I would like to know what is the worst thing that can happen:

Hi Mario, This thread is worth a read, it's an oldie but a goodie:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.ht…

Especially this post, which helped me understand the importance of min_size=2

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014892.ht…

Cheers, Simon _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Simon Ironside

5:03 a.m.

On 03/02/2021 19:48, Mario Giammarco wrote:

...

It is obvious and a bit paranoid because many servers on many customers run on raid1 and so you are saying: yeah you have two copies of the data but you can broke both. Consider that in ceph recovery is automatic, with raid1 some one must manually go to the customer and change disks. So ceph is already an improvement in this case even with size=2. With size 3 and min 2 it is a bigger improvement I know.

To labour Dan's point a bit further, maybe a RAID5/6 analogy is better than RAID1. Yes, I know we're not talking erasure coding pools here but this is similar to the reasons why people moved from RAID5 (size=2, kind of) to RAID6 (size=3, kind of). I.e. the more disks you have in an array (cluster, in our case) and the bigger those disks are, the greater the chance you have of encountering a second problem during a recovery.

...

What I ask is this: what happens with min_size=1 and split brain, network down or similar things: do ceph block writes because it has no quorum on monitors? Are there some failure scenarios that I have not considered?

It sounds like in your example you would have 3 physical servers in total. So would you have both a monitor and OSDs processes on each server? If so, it's not really related to min_size=1 but to answer your question you could lose one monitor and the cluster would continue. Losing a second monitor will stop your cluster until this is resolved. In your example setup (with colocated mons & OSDs) this would presumably also mean you'd lost two OSDs servers too so you'd have bigger problems. HTH, Simon

Mario Giammarco

3:53 p.m.

Il giorno gio 4 feb 2021 alle ore 00:33 Simon Ironside < sironside(a)caffetine.org> ha scritto:

...

On 03/02/2021 19:48, Mario Giammarco wrote: To labour Dan's point a bit further, maybe a RAID5/6 analogy is better than RAID1. Yes, I know we're not talking erasure coding pools here but this is similar to the reasons why people moved from RAID5 (size=2, kind of) to RAID6 (size=3, kind of). I.e. the more disks you have in an array (cluster, in our case) and the bigger those disks are, the greater the chance you have of encountering a second problem during a recovery. Yes I know the motivations for raid6 but to simplify the use case I am

comparing ceph size=2 to raid1.

...

It sounds like in your example you would have 3 physical servers in total. So would you have both a monitor and OSDs processes on each server?

Yes sorry if it was not clear: - three servers - three monitors - three managers - 6 osd (two disks per server)

...

If so, it's not really related to min_size=1 but to answer your question you could lose one monitor and the cluster would continue. Losing a second monitor will stop your cluster until this is resolved. In your example setup (with colocated mons & OSDs) this would presumably also mean you'd lost two OSDs servers too so you'd have bigger problems.

Losing the switch means monitors are up but cannot communicate so they should stop?

Mario Giammarco

3:59 p.m.

Il giorno mer 3 feb 2021 alle ore 21:22 Dan van der Ster <dan(a)vanderster.com> ha scritto:

...

Lastly, if you can't afford 3x replicas, then use 2+2 erasure coding if possible.

I will investigate I heard that erasure coding is slow. Anyway I will write here the reason of this thread: In my customers I have usually proxmox+ceph with: - three servers - three monitors - 6 osd (two per server) - size=3 and min_size=2 I followed the recommendations to stay safe. But one day one disk of one server has broken, osd where at 55%. What happened then? Ceph started filling the remaining OSD to maintain size=3 OSD reached 90% ceph stopped all. Customer VMs froze and customer lost time and some data that was not written on disk. So I got angry.... size=3 and customer still loses time and data?

...

Cheers, Dan On Wed, Feb 3, 2021, 8:49 PM Mario Giammarco <mgiammarco(a)gmail.com> wrote:

On 03/02/2021 09:24, Mario Giammarco wrote:

Hello, Imagine this situation: - 3 servers with ceph - a pool with size 2 min 1 I know perfectly the size 3 and min 2 is better. I would like to know what is the worst thing that can happen:

Hi Mario, This thread is worth a read, it's an oldie but a goodie:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.ht…

Especially this post, which helped me understand the importance of min_size=2

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014892.ht…

Cheers, Simon _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Alexander E. Patrakov

4:05 p.m.

There is a big difference between traditional RAID1 and Ceph. Namely, with Ceph, there are nodes where OSDs are running, and these nodes need maintenance. You want to be able to perform maintenance even if you have one broken OSD, that's why the recommendation is to have three copies with Ceph. There is no such "maintenance" consideration with traditional RAID1, so two copies are OK there. чт, 4 февр. 2021 г. в 00:49, Mario Giammarco <mgiammarco(a)gmail.com>om>:

...

On 03/02/2021 09:24, Mario Giammarco wrote:

Hello, Imagine this situation: - 3 servers with ceph - a pool with size 2 min 1 I know perfectly the size 3 and min 2 is better. I would like to know what is the worst thing that can happen:

Hi Mario, This thread is worth a read, it's an oldie but a goodie:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.ht…

Especially this post, which helped me understand the importance of min_size=2

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014892.ht…

Cheers, Simon _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

-- Alexander E. Patrakov CV: http://u.pc.cd/wT8otalK

Frank Schilder

4:26 p.m.

...

- three servers - three monitors - 6 osd (two per server) - size=3 and min_size=2

This is a set-up that I would not run at all. The first one is, that ceph lives on the law of large numbers and 6 is a small number. Hence, your OSD fill-up due to uneven distribution. What comes to my mind is a hyper-converged server with 6+ disks in a RAID10 array, possibly with a good controller with battery-powered or other non-volatile cache. Ceph will never beat that performance. Put in some extra disks as hot-spare and you have close to self-healing storage. Such a small ceph cluster will inherit all the baddies of ceph (performance, maintenance) without giving any of the goodies (scale-out, self-healing, proper distributed raid protection). Ceph needs size to become well-performing and pay off the maintenance and architectural effort. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Mario Giammarco <mgiammarco(a)gmail.com> Sent: 04 February 2021 11:29:49 To: Dan van der Ster Cc: Ceph Users Subject: [ceph-users] Re: Worst thing that can happen if I have size= 2 Il giorno mer 3 feb 2021 alle ore 21:22 Dan van der Ster <dan(a)vanderster.com> ha scritto:

...

Lastly, if you can't afford 3x replicas, then use 2+2 erasure coding if possible.

...

Cheers, Dan On Wed, Feb 3, 2021, 8:49 PM Mario Giammarco <mgiammarco(a)gmail.com> wrote:

On 03/02/2021 09:24, Mario Giammarco wrote:

Hello, Imagine this situation: - 3 servers with ceph - a pool with size 2 min 1 I know perfectly the size 3 and min 2 is better. I would like to know what is the worst thing that can happen:

Hi Mario, This thread is worth a read, it's an oldie but a goodie:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.ht…

Especially this post, which helped me understand the importance of min_size=2

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014892.ht…

Cheers, Simon _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Dan van der Ster

4:27 p.m.

On Thu, Feb 4, 2021 at 11:30 AM Mario Giammarco <mgiammarco(a)gmail.com> wrote:

...

Il giorno mer 3 feb 2021 alle ore 21:22 Dan van der Ster <dan(a)vanderster.com> ha scritto:

Lastly, if you can't afford 3x replicas, then use 2+2 erasure coding if possible.

You should size the osd fullness config in such a way that failure you expect would still leave sufficient capacity. In our case, we plan so that we could lose and re-replicate an entire rack and still have enough space left. -- (IOW, with 5-6 racks, we start to add capacity when the clusters reach ~70-75% full) In your case, the issue is more extreme: Because you have 3 hosts, 2 osds each, and 3 replicas: when one OSD fails and is marked out, you are telling ceph that *all* of the objects will need to be written to the last remaining disk on that host with the failure. So unless your cluster was under 40-50% used, that osd is going to become overfull. (But BTW, ceph will get backfillfull on the loaded OSD before stopping IO -- this should not have blocked your user unless they *also* filled the disk with new data at the same time). IMO with a cluster this size, you should not ever mark out any OSDs -- rather, you should leave the PGs degraded, replace the disk (keep the same OSD ID), then recover those objects to the new disk. Or, keep it <40% used (which sounds like a waste). -- dan > > > > > >> >> Cheers, Dan >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Wed, Feb 3, 2021, 8:49 PM Mario Giammarco <mgiammarco(a)gmail.com> wrote: >>> >>> Thanks Simon and thanks to other people that have replied. >>> Sorry but I try to explain myself better. >>> It is evident to me that if I have two copies of data, one brokes and while >>> ceph creates again a new copy of the data also the disk with the second >>> copy brokes you lose the data. >>> It is obvious and a bit paranoid because many servers on many customers run >>> on raid1 and so you are saying: yeah you have two copies of the data but >>> you can broke both. Consider that in ceph recovery is automatic, with raid1 >>> some one must manually go to the customer and change disks. So ceph is >>> already an improvement in this case even with size=2. With size 3 and min 2 >>> it is a bigger improvement I know. >>> >>> What I ask is this: what happens with min_size=1 and split brain, network >>> down or similar things: do ceph block writes because it has no quorum on >>> monitors? Are there some failure scenarios that I have not considered? >>> Thanks again! >>> Mario >>> >>> >>> >>> Il giorno mer 3 feb 2021 alle ore 17:42 Simon Ironside < >>> sironside(a)caffetine.org> ha scritto: >>> >>> > On 03/02/2021 09:24, Mario Giammarco wrote: >>> > > Hello, >>> > > Imagine this situation: >>> > > - 3 servers with ceph >>> > > - a pool with size 2 min 1 >>> > > >>> > > I know perfectly the size 3 and min 2 is better. >>> > > I would like to know what is the worst thing that can happen: >>> > >>> > Hi Mario, >>> > >>> > This thread is worth a read, it's an oldie but a goodie: >>> > >>> > >>> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.ht… >>> > >>> > Especially this post, which helped me understand the importance of >>> > min_size=2 >>> > >>> > >>> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014892.ht… >>> > >>> > Cheers, >>> > Simon >>> > _______________________________________________ >>> > ceph-users mailing list -- ceph-users(a)ceph.io >>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> > >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io

Frank Schilder

4:40 p.m.

...

Because you have 3 hosts, 2 osds each, and 3 replicas: ... So unless your cluster was under 40-50% used, that osd is going to become overfull.

Yes, overlooked this. With 2 disks per host statistics is not yet at play here, its the deterministic case. To run it safe, you need to have at least 2*3=6 times the storage capacity compared with data stored. Going to 2+2 EC will not really help and size 2 min_size 1 will be a disaster in any case. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Dan van der Ster <dan(a)vanderster.com> Sent: 04 February 2021 11:57:38 To: Mario Giammarco Cc: Ceph Users Subject: [ceph-users] Re: Worst thing that can happen if I have size= 2 On Thu, Feb 4, 2021 at 11:30 AM Mario Giammarco <mgiammarco(a)gmail.com> wrote:

...

Il giorno mer 3 feb 2021 alle ore 21:22 Dan van der Ster <dan(a)vanderster.com> ha scritto:

Lastly, if you can't afford 3x replicas, then use 2+2 erasure coding if possible.

...

> > Cheers, Dan > > > > > > > > > > > > > > > > > On Wed, Feb 3, 2021, 8:49 PM Mario Giammarco <mgiammarco(a)gmail.com> wrote: >> >> Thanks Simon and thanks to other people that have replied. >> Sorry but I try to explain myself better. >> It is evident to me that if I have two copies of data, one brokes and while >> ceph creates again a new copy of the data also the disk with the second >> copy brokes you lose the data. >> It is obvious and a bit paranoid because many servers on many customers run >> on raid1 and so you are saying: yeah you have two copies of the data but >> you can broke both. Consider that in ceph recovery is automatic, with raid1 >> some one must manually go to the customer and change disks. So ceph is >> already an improvement in this case even with size=2. With size 3 and min 2 >> it is a bigger improvement I know. >> >> What I ask is this: what happens with min_size=1 and split brain, network >> down or similar things: do ceph block writes because it has no quorum on >> monitors? Are there some failure scenarios that I have not considered? >> Thanks again! >> Mario >> >> >> >> Il giorno mer 3 feb 2021 alle ore 17:42 Simon Ironside < >> sironside(a)caffetine.org> ha scritto: >> >> > On 03/02/2021 09:24, Mario Giammarco wrote: >> > > Hello, >> > > Imagine this situation: >> > > - 3 servers with ceph >> > > - a pool with size 2 min 1 >> > > >> > > I know perfectly the size 3 and min 2 is better. >> > > I would like to know what is the worst thing that can happen: >> > >> > Hi Mario, >> > >> > This thread is worth a read, it's an oldie but a goodie: >> > >> > >> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.ht… >> > >> > Especially this post, which helped me understand the importance of >> > min_size=2 >> > >> > >> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014892.ht… >> > >> > Cheers, >> > Simon >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users(a)ceph.io >> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >> > >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Robert Sander

4:44 p.m.

Hi, Am 04.02.21 um 12:10 schrieb Frank Schilder:

...

Going to 2+2 EC will not really help

On such a small cluster you cannot even use EC because there are not enough independent hosts. As a rule of thumb there should be k+m+1 hosts in a cluster AFAIK. Regards -- Robert Sander Heinlein Support GmbH Schwedter Str. 8/9b, 10119 Berlin http://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Zwangsangaben lt. §35a GmbHG: HRB 93818 B / Amtsgericht Berlin-Charlottenburg, Geschäftsführer: Peer Heinlein -- Sitz: Berlin

Eneko Lacunza

4:48 p.m.

Hi all, El 4/2/21 a las 11:56, Frank Schilder escribió:

...

- three servers - three monitors - 6 osd (two per server) - size=3 and min_size=2

It's funny that we have multiple clusters similar to this, and we and our customers couldn't be happier. Just use a HCI solution (like for example Proxmox VE, but there are others) to manage everything. Maybe the weakest thing in that configuration is having 2 OSDs per node; osd nearfull must be tuned accordingly so that no OSD goes beyond about 0.45, so that in case of failure of one disk, the other OSD in the node has enough space for healing replication. When deciding min_size, one has to balance availability (failure during maintenance of one node with min_size=2) vs risk of data loss (min_size=1). Not everyone needs to max SSD disk IOPS; having a decent, HA setup can be of much value... Cheers -- Eneko Lacunza Zuzendari teknikoa | Director técnico Binovo IT Human Project Tel. +34 943 569 206 | https://www.binovo.es Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun https://www.youtube.com/user/CANALBINOVO/ https://www.linkedin.com/company/37269706/

Anthony D'Atri

10:58 p.m.

...

Maybe the weakest thing in that configuration is having 2 OSDs per node; osd nearfull must be tuned accordingly so that no OSD goes beyond about 0.45, so that in case of failure of one disk, the other OSD in the node has enough space for healing replication.

A careful setting of mon_osd_down_out_subtree_limit can help in the situation of losing a whole node, though as you and others have noted, this topology has other challenges.

huxiaoyu＠horebdata.cn

5 Feb 5 Feb

3 a.m.

...

IMO with a cluster this size, you should not ever mark out any OSDs -- rather, you should leave the PGs degraded, replace the disk (keep the same OSD ID), then recover those objects to the new disk. Or, keep it <40% used (which sounds like a waste).

Dear Dan, I particularly like your idea of " leave the PGs degraded, and replace the disk with the same OSD ID ". This is a wonderful thing i really want to do. Could you please share some more details on how to achieve this, or some scripts already being tested? thanks a lot, Samuel huxiaoyu(a)horebdata.cn From: Dan van der Ster Date: 2021-02-04 11:57 To: Mario Giammarco CC: Ceph Users Subject: [ceph-users] Re: Worst thing that can happen if I have size= 2 On Thu, Feb 4, 2021 at 11:30 AM Mario Giammarco <mgiammarco(a)gmail.com> wrote:

...

Il giorno mer 3 feb 2021 alle ore 21:22 Dan van der Ster <dan(a)vanderster.com> ha scritto:

Lastly, if you can't afford 3x replicas, then use 2+2 erasure coding if possible.

...

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Frank Schilder

2:31 p.m.

I think the answer is very simple: Data loss. You are setting yourself up for data loss. Having only +1 redundancy is a design flaw and you will be fully responsible for loosing data on such a set-up. If this is not a problem, then that's an option. If this will get you fired, its not.

...

Yes, this is exactly the point. The keyword is "redundancy under degraded conditions". Its not just that you want to be able to maintain stuff if an OSD is down, you want to be able to maintain stuff without risking data loss every single time. A simple example is OS updates that require reboots. Every of these operations opens a window of opportunity for something else to fail. The other thing is admin errors. Redundancy under degraded conditions allows admins to commit 1 or more extra mistakes during maintenance. I learned this the hard way when upgrading our MON data disks. We have 3 MONs and I needed to migrate each MON store to new storage. Of course I managed to install the new disks in one and wipe the MON store on another MON. 2 hours downtime. Will upgrade to 5 MONs as soon as possible. More serious examples are ceph upgrades. There are plenty of instances where these went wrong for one or the other reason and people needed to redeploy entire OSD hosts. Loooong window of opportunity for data loss during complete rebuild. And never trust your boss when he says "we will replace everything long before MTBF". This is BS as soon as budgets get cut. I think, however, another really important aspect is the data security. In a small cluster you might get away with thinking in typical RAID terms. However, a scale-out cluster is defined by the property that multiple simultaneous disk fails will be observed regularly. Simultaneous meaning fails within the window of opportunity opened by degraded objects being present. The limit for observing this is not as high as one might think. Pushing prices means pushing hardware to the physical limits and quality control will not catch everything. We got a batch of 8 disks that seem not to be great. I had already one fail (half a year in production) and others regularly show up with slow ops. Its not bad enough to get them replaced, so I have to deal with it. They are all in one host, so I can sleep, but it is quite likely that a few of them go while the cluster is rebuilding redundancy. For scale-out storage the distributed RAID of ceph comes to the rescue, without this it would be impossible to run a scale out system. If you do the stats on the probability of loosing sufficiently many OSDs that share a PG, you will find out that this probability goes down exponentially with the number of extra copies/shards, where +1 just leaves you at ordinary RAID level - meaning its dangerous. Taking all of this together, maintainability and probability of data loss, I regret that I didn't go for EC 8+3 (3 extra shards) instead of 8+2. For replication the same holds, 3-times is the lowest number that is safe but 4 is a lot lot better. Bottom line is, data loss and ruined weekends/holidays are not worth going cheap. If I get an hardware alert at night, I want to be able to turn around and continue sleeping. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Alexander E. Patrakov <patrakov(a)gmail.com> Sent: 04 February 2021 11:35:27 To: Mario Giammarco Cc: ceph-users Subject: [ceph-users] Re: Worst thing that can happen if I have size= 2 There is a big difference between traditional RAID1 and Ceph. Namely, with Ceph, there are nodes where OSDs are running, and these nodes need maintenance. You want to be able to perform maintenance even if you have one broken OSD, that's why the recommendation is to have three copies with Ceph. There is no such "maintenance" consideration with traditional RAID1, so two copies are OK there. чт, 4 февр. 2021 г. в 00:49, Mario Giammarco <mgiammarco(a)gmail.com>om>:

...

On 03/02/2021 09:24, Mario Giammarco wrote:

Hello, Imagine this situation: - 3 servers with ceph - a pool with size 2 min 1 I know perfectly the size 3 and min 2 is better. I would like to know what is the worst thing that can happen:

Hi Mario, This thread is worth a read, it's an oldie but a goodie:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.ht…

Especially this post, which helped me understand the importance of min_size=2

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014892.ht…

Cheers, Simon _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

-- Alexander E. Patrakov CV: http://u.pc.cd/wT8otalK _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Mario Giammarco

6 Feb 6 Feb

1:40 a.m.

Il giorno gio 4 feb 2021 alle ore 12:19 Eneko Lacunza <elacunza(a)binovo.es> ha scritto:

...

Hi all, El 4/2/21 a las 11:56, Frank Schilder escribió:

- three servers - three monitors - 6 osd (two per server) - size=3 and min_size=2

This is a set-up that I would not run at all. The first one is, that

ceph lives on the law of large numbers and 6 is a small number. Hence, your OSD fill-up due to uneven distribution.

What comes to my mind is a hyper-converged server with 6+ disks in a

RAID10 array, possibly with a good controller with battery-powered or other non-volatile cache. Ceph will never beat that performance. Put in some extra disks as hot-spare and you have close to self-healing storage.

Such a small ceph cluster will inherit all the baddies of ceph

(performance, maintenance) without giving any of the goodies (scale-out, self-healing, proper distributed raid protection). Ceph needs size to become well-performing and pay off the maintenance and architectural effort.

...

I reply to both: infact I am using Proxmox VE and I am following all guidelines for ha hyperconverged server: - three servers as reccomended by proxmox (with 10gb ethernet and so on) - size=3 and min_size=2 reccomended by Ceph It is not that a morning I wake up and put some random hardware together, I followed guidelines. The result should be: - if a disk (or more) brokes work goes on - if a server brokes the VMs on the server start on another server and work goes on. The result is: one disk brokes, ceph fills the other one in the same server , reaches 90% and EVERYTHING stops including all VMs and the customer has lost unsaved data and it cannot run the VMs it needs to continue works. Not very "HA" as hoped. Size=3 means 3xhdd cost. Now I must double it again 6x. Customer will not buy other disks. So I ask (again): apart the known fact that with size=2 I risk that a second disk brokes before ceph has filled again the second copy of data are there other risks?? I repeat: I know perfectly size=3 is "better" I followed guidelines but what can happen with size=2 and min_size=1? The only thing I can imagine is that if I power down switches I get a split brain but in this case monitor quorum is not reached and so ceph should stop writing and so I do not risk inconsistent data. Are there other things to consider? Thanks, Mario

Simon Ironside

2:39 a.m.

On 05/02/2021 20:10, Mario Giammarco wrote:

...

It is not that a morning I wake up and put some random hardware together, I followed guidelines. The result should be: - if a disk (or more) brokes work goes on - if a server brokes the VMs on the server start on another server and work goes on. The result is: one disk brokes, ceph fills the other one in the same server , reaches 90% and EVERYTHING stops including all VMs and the customer has lost unsaved data and it cannot run the VMs it needs to continue works. Not very "HA" as hoped.

With three OSD hosts, each with two disks, size=3 and default CRUSH rules (i.e. each replica goes to a different host) then each OSD host would expect to get roughly 1/3 of the total data. Under normal running this would mean each disk sees 1/6 of the total data. When a single disk failed in your scenario above, all three hosts were still available and still get 1/3 of the total data. Because one disk failed, the surviving disk has to store the replicas that were on the failed disk as well its own (so, 2/6 total data - double what it had before). To have reached 90% full on the surviving disk suggests that it was (at least) 45% full under normal running. Ceph is doing what it's supposed to in this case, the issue is that the disks haven't been sized large enough to allow for this failure. Simon

Mark Lehrer

2:43 a.m.

...

Redhat/Micron/Samsung/Supermicro have all put out white papers backing the idea of 2 copies on NVMe's as safe for production.

It's not like you can just jump from "unsafe" to "safe" -- it is about comparing the probability of losing data against how valuable that data is. A vendor's decision on size -- when they have a vested interest in making the price lower vs the competition -- may be a different decision than you would make as the person who stands to lose your data and potentially your career. And I say this as someone who works for a hardware vendor... listen to their advice but make your own decision. I have lost data on a size 2 cluster before and learned first-hand how easy it is for this to happen. Luckily it was just my home NAS. But if anyone has Roger Federer's 2018 tennis matches archived we need to talk :D Mark On Wed, Feb 3, 2021 at 8:50 AM Adam Boyhan <adamb(a)medent.com> wrote: > > Isn't this somewhat reliant on the OSD type? >

...

Redhat/Micron/Samsung/Supermicro have all put out white papers backing the idea of 2 copies on NVMe's as safe for production.

> > > From: "Magnus HAGDORN" <Magnus.Hagdorn(a)ed.ac.uk> > To: pseudo(a)avalon.org.ua > Cc: "ceph-users" <ceph-users(a)ceph.io> > Sent: Wednesday, February 3, 2021 4:43:08 AM > Subject: [ceph-users] Re: Worst thing that can happen if I have size= 2 > > On Wed, 2021-02-03 at 09:39 +0000, Max Krasilnikov wrote: > > > if a OSD becomes unavailble (broken disk, rebooting server) then > > > all > > > I/O to the PGs stored on that OSD will block until replication > > > level of > > > 2 is reached again. So, for a highly available cluster you need a > > > replication level of 3 > > > > > > AFAIK, with min_size 1 it is possible to write even to only active > > OSD serving > > > yes, that's correct but then you seriously risk trashing your data > > The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Frank Schilder

4:06 p.m.

...

- three servers as reccomended by proxmox (with 10gb ethernet and so on) - size=3 and min_size=2 reccomended by Ceph

You forgot the ceph recommendation* to provide sufficient fail-over capacity in case a failure domain or disk fails. The recommendation would be to have 4 hosts with 25% capacity left free for fail-over and another 10% for handling imbalance. With very few disks I would increase the buffer for imbalance. * Its actually not a recommendation, its a requirement for non-experimental clusters. Everything else has been answered already in great detail. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Mario Giammarco <mgiammarco(a)gmail.com> Sent: 05 February 2021 21:10:33 To: Eneko Lacunza Cc: Ceph Users Subject: [ceph-users] Re: Worst thing that can happen if I have size= 2 Il giorno gio 4 feb 2021 alle ore 12:19 Eneko Lacunza <elacunza(a)binovo.es> ha scritto:

...

Hi all, El 4/2/21 a las 11:56, Frank Schilder escribió:

- three servers - three monitors - 6 osd (two per server) - size=3 and min_size=2

This is a set-up that I would not run at all. The first one is, that

ceph lives on the law of large numbers and 6 is a small number. Hence, your OSD fill-up due to uneven distribution.

What comes to my mind is a hyper-converged server with 6+ disks in a

Such a small ceph cluster will inherit all the baddies of ceph

...

Konstantin Shalygin

4:34 p.m.

How do you achieve that? 2 hours? Install new drive for db is 10min of DC engineer hand work (if drive is HHHL and need power off server). Then after server is boots your mon already up. After you provision new drive, make fstab, is stop, rm old monstore, mount new, mon mkfs, start. Even if this is not covered by script is max 5 minutes to reach quorum. Thanks, k Sent from my iPhone

...

On 5 Feb 2021, at 12:03, Frank Schilder <frans(a)dtu.dk> wrote: I learned this the hard way when upgrading our MON data disks. We have 3 MONs and I needed to migrate each MON store to new storage. Of course I managed to install the new disks in one and wipe the MON store on another MON. 2 hours downtime.

Frank Schilder

4:52 p.m.

...

How do you achieve that? 2 hours?

That's a long story. Short one is, by taking a wrong path for trouble shooting. I should have stayed with my check-list instead. This is the whole point of the redundancy remark I made, that 1 admin mistake doesn't hurt and you are less likely to panic if one happens. For a too long time on this day, I thought I had lost the whole cluster. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Konstantin Shalygin <k0ste(a)k0ste.ru> Sent: 06 February 2021 12:04:12 To: Frank Schilder Cc: Alexander E. Patrakov; Mario Giammarco; ceph-users Subject: Re: [ceph-users] Re: Worst thing that can happen if I have size= 2 How do you achieve that? 2 hours? Install new drive for db is 10min of DC engineer hand work (if drive is HHHL and need power off server). Then after server is boots your mon already up. After you provision new drive, make fstab, is stop, rm old monstore, mount new, mon mkfs, start. Even if this is not covered by script is max 5 minutes to reach quorum. Thanks, k Sent from my iPhone

...

1175

days inactive

1178

days old

ceph-users@ceph.io

Manage subscription

28 comments

16 participants

tags (0)

participants (16)

Adam Boyhan
Alexander E. Patrakov
Anthony D'Atri
Dan van der Ster
DHilsbos＠performair.com
Eneko Lacunza
Frank Schilder
huxiaoyu＠horebdata.cn
Konstantin Shalygin
Magnus HAGDORN
Mario Giammarco
Mark Lehrer
Martin Verges
Max Krasilnikov
Robert Sander
Simon Ironside