I think the answer is very simple: Data loss. You are setting yourself up for data loss.
Having only +1 redundancy is a design flaw and you will be fully responsible for loosing
data on such a set-up. If this is not a problem, then that's an option. If this will
get you fired, its not.
There is a big difference between traditional RAID1
and Ceph. Namely, with
Ceph, there are nodes where OSDs are running, and these nodes need
maintenance. You want to be able to perform maintenance even if you have
one broken OSD, that's why the recommendation is to have three copies with
Ceph. There is no such "maintenance" consideration with traditional RAID1,
so two copies are OK there.
Yes, this is exactly the point. The keyword is "redundancy under degraded
conditions". Its not just that you want to be able to maintain stuff if an OSD is
down, you want to be able to maintain stuff without risking data loss every single time. A
simple example is OS updates that require reboots. Every of these operations opens a
window of opportunity for something else to fail.
The other thing is admin errors. Redundancy under degraded conditions allows admins to
commit 1 or more extra mistakes during maintenance. I learned this the hard way when
upgrading our MON data disks. We have 3 MONs and I needed to migrate each MON store to new
storage. Of course I managed to install the new disks in one and wipe the MON store on
another MON. 2 hours downtime. Will upgrade to 5 MONs as soon as possible.
More serious examples are ceph upgrades. There are plenty of instances where these went
wrong for one or the other reason and people needed to redeploy entire OSD hosts. Loooong
window of opportunity for data loss during complete rebuild.
And never trust your boss when he says "we will replace everything long before
MTBF". This is BS as soon as budgets get cut.
I think, however, another really important aspect is the data security. In a small cluster
you might get away with thinking in typical RAID terms. However, a scale-out cluster is
defined by the property that multiple simultaneous disk fails will be observed regularly.
Simultaneous meaning fails within the window of opportunity opened by degraded objects
being present.
The limit for observing this is not as high as one might think. Pushing prices means
pushing hardware to the physical limits and quality control will not catch everything. We
got a batch of 8 disks that seem not to be great. I had already one fail (half a year in
production) and others regularly show up with slow ops. Its not bad enough to get them
replaced, so I have to deal with it. They are all in one host, so I can sleep, but it is
quite likely that a few of them go while the cluster is rebuilding redundancy.
For scale-out storage the distributed RAID of ceph comes to the rescue, without this it
would be impossible to run a scale out system. If you do the stats on the probability of
loosing sufficiently many OSDs that share a PG, you will find out that this probability
goes down exponentially with the number of extra copies/shards, where +1 just leaves you
at ordinary RAID level - meaning its dangerous.
Taking all of this together, maintainability and probability of data loss, I regret that I
didn't go for EC 8+3 (3 extra shards) instead of 8+2. For replication the same holds,
3-times is the lowest number that is safe but 4 is a lot lot better.
Bottom line is, data loss and ruined weekends/holidays are not worth going cheap. If I get
an hardware alert at night, I want to be able to turn around and continue sleeping.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Alexander E. Patrakov <patrakov(a)gmail.com>
Sent: 04 February 2021 11:35:27
To: Mario Giammarco
Cc: ceph-users
Subject: [ceph-users] Re: Worst thing that can happen if I have size= 2
There is a big difference between traditional RAID1 and Ceph. Namely, with
Ceph, there are nodes where OSDs are running, and these nodes need
maintenance. You want to be able to perform maintenance even if you have
one broken OSD, that's why the recommendation is to have three copies with
Ceph. There is no such "maintenance" consideration with traditional RAID1,
so two copies are OK there.
чт, 4 февр. 2021 г. в 00:49, Mario Giammarco <mgiammarco(a)gmail.com>om>:
Thanks Simon and thanks to other people that have
replied.
Sorry but I try to explain myself better.
It is evident to me that if I have two copies of data, one brokes and while
ceph creates again a new copy of the data also the disk with the second
copy brokes you lose the data.
It is obvious and a bit paranoid because many servers on many customers run
on raid1 and so you are saying: yeah you have two copies of the data but
you can broke both. Consider that in ceph recovery is automatic, with raid1
some one must manually go to the customer and change disks. So ceph is
already an improvement in this case even with size=2. With size 3 and min 2
it is a bigger improvement I know.
What I ask is this: what happens with min_size=1 and split brain, network
down or similar things: do ceph block writes because it has no quorum on
monitors? Are there some failure scenarios that I have not considered?
Thanks again!
Mario
Il giorno mer 3 feb 2021 alle ore 17:42 Simon Ironside <
sironside(a)caffetine.org> ha scritto:
On 03/02/2021 09:24, Mario Giammarco wrote:
Hello,
Imagine this situation:
- 3 servers with ceph
- a pool with size 2 min 1
I know perfectly the size 3 and min 2 is better.
I would like to know what is the worst thing that can happen:
Hi Mario,
This thread is worth a read, it's an oldie but a goodie:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.ht…
Especially this post, which helped me understand the importance of
min_size=2
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014892.ht…
Cheers,
Simon
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
--
Alexander E. Patrakov
CV:
http://u.pc.cd/wT8otalK
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io