Because you have 3 hosts, 2 osds each, and 3 replicas:
...
So unless your cluster was under 40-50% used, that osd is going to
become overfull.
Yes, overlooked this. With 2 disks per host statistics is not yet at play here, its the
deterministic case. To run it safe, you need to have at least 2*3=6 times the storage
capacity compared with data stored. Going to 2+2 EC will not really help and size 2
min_size 1 will be a disaster in any case.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Dan van der Ster <dan(a)vanderster.com>
Sent: 04 February 2021 11:57:38
To: Mario Giammarco
Cc: Ceph Users
Subject: [ceph-users] Re: Worst thing that can happen if I have size= 2
On Thu, Feb 4, 2021 at 11:30 AM Mario Giammarco <mgiammarco(a)gmail.com> wrote:
Il giorno mer 3 feb 2021 alle ore 21:22 Dan van der Ster <dan(a)vanderster.com> ha
scritto:
Lastly, if you can't afford 3x replicas, then use 2+2 erasure coding if possible.
I will investigate I heard that erasure coding is slow.
Anyway I will write here the reason of this thread:
In my customers I have usually proxmox+ceph with:
- three servers
- three monitors
- 6 osd (two per server)
- size=3 and min_size=2
I followed the recommendations to stay safe.
But one day one disk of one server has broken, osd where at 55%.
What happened then?
Ceph started filling the remaining OSD to maintain size=3
OSD reached 90% ceph stopped all.
Customer VMs froze and customer lost time and some data that was not written on disk.
So I got angry.... size=3 and customer still loses time and data?
You should size the osd fullness config in such a way that failure you
expect would still leave sufficient capacity.
In our case, we plan so that we could lose and re-replicate an entire
rack and still have enough space left. -- (IOW, with 5-6 racks, we
start to add capacity when the clusters reach ~70-75% full)
In your case, the issue is more extreme:
Because you have 3 hosts, 2 osds each, and 3 replicas: when one OSD
fails and is marked out, you are telling ceph that *all* of the
objects will need to be written to the last remaining disk on that
host with the failure.
So unless your cluster was under 40-50% used, that osd is going to
become overfull. (But BTW, ceph will get backfillfull on the loaded
OSD before stopping IO -- this should not have blocked your user
unless they *also* filled the disk with new data at the same time).
IMO with a cluster this size, you should not ever mark out any OSDs --
rather, you should leave the PGs degraded, replace the disk (keep the
same OSD ID), then recover those objects to the new disk.
Or, keep it <40% used (which sounds like a waste).
-- dan
>
> Cheers, Dan
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, Feb 3, 2021, 8:49 PM Mario Giammarco <mgiammarco(a)gmail.com> wrote:
>>
>> Thanks Simon and thanks to other people that have replied.
>> Sorry but I try to explain myself better.
>> It is evident to me that if I have two copies of data, one brokes and while
>> ceph creates again a new copy of the data also the disk with the second
>> copy brokes you lose the data.
>> It is obvious and a bit paranoid because many servers on many customers run
>> on raid1 and so you are saying: yeah you have two copies of the data but
>> you can broke both. Consider that in ceph recovery is automatic, with raid1
>> some one must manually go to the customer and change disks. So ceph is
>> already an improvement in this case even with size=2. With size 3 and min 2
>> it is a bigger improvement I know.
>>
>> What I ask is this: what happens with min_size=1 and split brain, network
>> down or similar things: do ceph block writes because it has no quorum on
>> monitors? Are there some failure scenarios that I have not considered?
>> Thanks again!
>> Mario
>>
>>
>>
>> Il giorno mer 3 feb 2021 alle ore 17:42 Simon Ironside <
>> sironside(a)caffetine.org> ha scritto:
>>
>> > On 03/02/2021 09:24, Mario Giammarco wrote:
>> > > Hello,
>> > > Imagine this situation:
>> > > - 3 servers with ceph
>> > > - a pool with size 2 min 1
>> > >
>> > > I know perfectly the size 3 and min 2 is better.
>> > > I would like to know what is the worst thing that can happen:
>> >
>> > Hi Mario,
>> >
>> > This thread is worth a read, it's an oldie but a goodie:
>> >
>> >
>> >
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.ht…
>> >
>> > Especially this post, which helped me understand the importance of
>> > min_size=2
>> >
>> >
>> >
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014892.ht…
>> >
>> > Cheers,
>> > Simon
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users(a)ceph.io
>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>> >
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io