Many thanks!!
Regards
Marcus
On fre, dec 22 2023 at 19:12:19 -0500, Anthony D'Atri
<aad(a)dreamsnake.net <mailto:aad@dreamsnake.net>> wrote:
> You
can do that for a PoC, but that's a bad idea for any
> production workload. You'd want at least three nodes with OSDs
> to use the default RF=3 replication. You can do RF=2, but at the
> peril of your mortal data.
I'm not sure I agree - I think size=2, min_size=2 is no worse than
RAID1 for data security.
size=2, min_size=2 *is* RAID1. Except that you become
unavailable
if a single drive is unavailable.
> That isn't even the main risk as I
understand it. Of course a
> double
failure is going to be a problem with size=2, or traditional
RAID1,
and I think anybody choosing this configuration accepts this risk.
We see people
often enough who don´t know that. I´ve seen
double failures. ymmv.
As I understand it, the reason min_size=1 is a
trap has nothing
to do
with double failures per se.
It´s one of the concerns.
The issue is that Ceph OSDs are somewhat prone
to flapping during
recovery (OOM, etc). So even if the disk is fine, an OSD can go
down
for a short time. If you have size=2, min=1 configured, then when
this happens the PG will become degraded and will continue
operating
on the other OSD, and the flapping OSD becomes stale. Then when
it
comes back up it recovers. The problem is that if the other OSD
has a
permanent failure (disk crash/etc) while the first OSD is
flapping,
now you have no good OSDs, because when the flapping OSD comes
back up
it is stale, and its PGs have no peer.
Indeed, arguably that´s an overlapping
failure. I´ve seen this
too, and have a pg query to demonstrate it.
I suspect there are ways to re-activate it,
though this will
result in potential data
inconsistency since writes were allowed to the cluster and will
then
get rolled back.
Yep.
With only two OSDs I'm guessing that would
be the
main impact (well, depending on journaling behavior/etc), but if
you
have more OSDs than that then you could have situations where one
file
is getting rolled back, and some other file isn't, and so on.
But you´d have
a voting majority.
With min_size=2 you're fairly safe from
flapping because there
will
always be two replicas that have the most recent version of every
PG,
and so you can still tolerate a permanent failure of one of them.
Exactly.
size=2, min=2 doesn't suffer this failure
mode, because anytime
there
is flapping the PG goes inactive and no writes can be made, so
when
the other OSD comes back up there is nothing to recover. Of
course
this results in IO blocks and downtime, which is obviously
undesirable, but it is likely a more recoverable state than
inconsistent writes.
Agreed, the difference between availability and durability.
Depends what´s important to you.
Apologies if I've gotten any of that wrong,
but my understanding
is
that it is these sorts of failure modes that cause min_size=1 to
be a
trap. This isn't the sort of thing that typically happens in a
RAID1
config, or at least that admins don't think about.
It´s both.
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
<mailto:ceph-users@ceph.io> <<mailto:ceph-users@ceph.io>>
To unsubscribe send an email to ceph-users-leave(a)ceph.io
<mailto:ceph-users-leave@ceph.io>
<<mailto:ceph-users-leave@ceph.io>>
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
<mailto:ceph-users@ceph.io>
To unsubscribe send an email to ceph-users-leave(a)ceph.io
<mailto:ceph-users-leave@ceph.io>