Re: Ceph newbee questions

List overview All Threads
Download

newer

older

Unable to find Refresh Interval...

CLT Meeting Minutes 2024-01-03

Anthony D'Atri

23 Dec 2023 23 Dec '23

3:12 a.m.

...

You can do that for a PoC, but that's a bad idea for any production workload. You'd want at least three nodes with OSDs to use the default RF=3 replication. You can do RF=2, but at the peril of your mortal data.

I'm not sure I agree - I think size=2, min_size=2 is no worse than RAID1 for data security.

size=2, min_size=2 *is* RAID1. Except that you become unavailable if a single drive is unavailable.

...

That isn't even the main risk as I understand it. Of course a double

failure is going to be a problem with size=2, or traditional RAID1, and I think anybody choosing this configuration accepts this risk.

We see people often enough who don’t know that. I’ve seen double failures. ymmv.

...

As I understand it, the reason min_size=1 is a trap has nothing to do with double failures per se.

It’s one of the concerns.

...

The issue is that Ceph OSDs are somewhat prone to flapping during recovery (OOM, etc). So even if the disk is fine, an OSD can go down for a short time. If you have size=2, min=1 configured, then when this happens the PG will become degraded and will continue operating on the other OSD, and the flapping OSD becomes stale. Then when it comes back up it recovers. The problem is that if the other OSD has a permanent failure (disk crash/etc) while the first OSD is flapping, now you have no good OSDs, because when the flapping OSD comes back up it is stale, and its PGs have no peer.

Indeed, arguably that’s an overlapping failure. I’ve seen this too, and have a pg query to demonstrate it.

...

I suspect there are ways to re-activate it, though this will result in potential data inconsistency since writes were allowed to the cluster and will then get rolled back.

Yep.

...

With only two OSDs I'm guessing that would be the main impact (well, depending on journaling behavior/etc), but if you have more OSDs than that then you could have situations where one file is getting rolled back, and some other file isn't, and so on.

But you’d have a voting majority.

...

With min_size=2 you're fairly safe from flapping because there will always be two replicas that have the most recent version of every PG, and so you can still tolerate a permanent failure of one of them.

Exactly.

...

size=2, min=2 doesn't suffer this failure mode, because anytime there is flapping the PG goes inactive and no writes can be made, so when the other OSD comes back up there is nothing to recover. Of course this results in IO blocks and downtime, which is obviously undesirable, but it is likely a more recoverable state than inconsistent writes.

Agreed, the difference between availability and durability. Depends what’s important to you.

...

Apologies if I've gotten any of that wrong, but my understanding is that it is these sorts of failure modes that cause min_size=1 to be a trap. This isn't the sort of thing that typically happens in a RAID1 config, or at least that admins don't think about.

It’s both.

Show replies by date

Marcus

1 Jan 1 Jan

11:33 p.m.

New subject: Ceph newbee questions

Hi and thanks for your answers! So my understanding from this, make sure that the "admin" node have a fast CPU and the minimum data nodes for production is three and max = 3 and min = 2. You can expand this cluster by just adding one data node, there is no need to expand with another 3 nodes, right? With three data nodes and things starts to break. If one disk or a few disks break, you will still have two copies of your objects, cluster will be in degraded mode until disks are replaced? There is no healing or restructuring of objects while a couple of disks are broken? If one of the data nodes break, there are still 2 copies of the objects and the cluster will run in degraded mode until server is replaced and data is replicated? If two data nodes break, the cluster will fail but there is still one copy of the objects so if the two nodes are replaced the cluster and all objects will be there after replication. No lost data? Many thanks!! Regards Marcus On fre, dec 22 2023 at 19:12:19 -0500, Anthony D'Atri <aad(a)dreamsnake.net> wrote:

...

I'm not sure I agree - I think size=2, min_size=2 is no worse than RAID1 for data security.

size=2, min_size=2 *is* RAID1. Except that you become unavailable if a single drive is unavailable.

That isn't even the main risk as I understand it. Of course a double

failure is going to be a problem with size=2, or traditional RAID1, and I think anybody choosing this configuration accepts this risk.

We see people often enough who don’t know that. I’ve seen double failures. ymmv.

As I understand it, the reason min_size=1 is a trap has nothing to do with double failures per se.

It’s one of the concerns.

Indeed, arguably that’s an overlapping failure. I’ve seen this too, and have a pg query to demonstrate it.

I suspect there are ways to re-activate it, though this will result in potential data inconsistency since writes were allowed to the cluster and will then get rolled back.

Yep.

But you’d have a voting majority.

Exactly.

Agreed, the difference between availability and durability. Depends what’s important to you.

It’s both. _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

Anthony D'Atri

2 Jan 2 Jan

1 a.m.

New subject: Ceph newbee questions

...

Hi and thanks for your answers! So my understanding from this, make sure that the "admin" node have a fast CPU

You don’t strictly need an admin node as such. Only worry about clock rate if you’re doing CephFS.

...

max = 3 and min = 2.

For your pools, size=3, min_size=2. These are defaults, don’t worry about them.

...

You can expand this cluster by just adding one data node, there is no need to expand with another 3 nodes, right?

With defaults, yes.

...

With three data nodes and things starts to break. If one disk

yes

...

or a few disks break

Depends on which disks.

...

you will still have two copies of your objects, cluster will be in degraded mode until disks are replaced?

Ceph will heal itself if it can to restore redundancy. With three nodes if you lose one node, it can’t heal and it’ll be in degraded mode but data will be available. If you lose just one drive, unless the cluster is very full, redundancy will be restored using the surviving drives on that node.

...

There is no healing or restructuring of objects while a couple of disks are broken?

Depends on which disks. If you lose one on each of 3 nodes at the same time, that’s a problem. If you lose one on each of 3 nodes a week apart, probably not.

...

If one of the data nodes break, there are still 2 copies of the objects and the cluster will run in degraded mode until server is replaced and data is replicated?

Yes.

...

If two data nodes break, the cluster will fail but there is still one copy of the objects so if the two nodes are replaced the cluster and all objects will be there after replication. No lost data?

Correct, *if* nothing happens to the survivors. But unless you take manual steps, data will be unavailable. Most of the time if a node fails you can replace a DIMM etc. and bring it back.

...

Many thanks!! Regards Marcus On fre, dec 22 2023 at 19:12:19 -0500, Anthony D'Atri <aad(a)dreamsnake.net> wrote:

I'm not sure I agree - I think size=2, min_size=2 is no worse than RAID1 for data security.

size=2, min_size=2 *is* RAID1. Except that you become unavailable if a single drive is unavailable.

That isn't even the main risk as I understand it. Of course a double

failure is going to be a problem with size=2, or traditional RAID1, and I think anybody choosing this configuration accepts this risk.

We see people often enough who don´t know that. I´ve seen double failures. ymmv.

As I understand it, the reason min_size=1 is a trap has nothing to do with double failures per se.

It´s one of the concerns.

Indeed, arguably that´s an overlapping failure. I´ve seen this too, and have a pg query to demonstrate it.

I suspect there are ways to re-activate it, though this will result in potential data inconsistency since writes were allowed to the cluster and will then get rolled back.

Yep.

But you´d have a voting majority.

Exactly.

Agreed, the difference between availability and durability. Depends what´s important to you.

It´s both. _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Marcus

3 Jan 3 Jan

11:01 p.m.

New subject: Ceph newbee questions

Hi, thanks for your answers! On mån, jan 1 2024 at 17:00:59 -0500, Anthony D'Atri <aad(a)dreamsnake.net> wrote:

...

Hi and thanks for your answers! So my understanding from this, make sure that the "admin" node have a fast CPU

You don’t strictly need an admin node as such. Only worry about clock rate if you’re doing CephFS.

So an admin node is not required? I can run all the daemons on three "fileservers"? As we will use CephFS, we want good clock rate, but fileserver running OSD:s wants multiple cores, right? Or did I missunderstand?

...

max = 3 and min = 2.

For your pools, size=3, min_size=2. These are defaults, don’t worry about them.

You can expand this cluster by just adding one data node, there is no need to expand with another 3 nodes, right?

With defaults, yes.

With three data nodes and things starts to break. If one disk

yes

or a few disks break

Depends on which disks.

you will still have two copies of your objects, cluster will be in degraded mode until disks are replaced?

There is no healing or restructuring of objects while a couple of disks are broken?

Depends on which disks. If you lose one on each of 3 nodes at the same time, that’s a problem. If you lose one on each of 3 nodes a week apart, probably not.

If one of the data nodes break, there are still 2 copies of the objects and the cluster will run in degraded mode until server is replaced and data is replicated?

Yes.

If two data nodes break, the cluster will fail but there is still one copy of the objects so if the two nodes are replaced the cluster and all objects will be there after replication. No lost data?

Correct, *if* nothing happens to the survivors. But unless you take manual steps, data will be unavailable. Most of the time if a node fails you can replace a DIMM etc. and bring it back.

Many thanks!! Regards Marcus On fre, dec 22 2023 at 19:12:19 -0500, Anthony D'Atri <aad(a)dreamsnake.net <mailto:aad@dreamsnake.net>> wrote:

> You can do that for a PoC, but that's a bad idea for any > production workload. You'd want at least three nodes with OSDs > to use the default RF=3 replication. You can do RF=2, but at the > peril of your mortal data. I'm not sure I agree - I think size=2, min_size=2 is no worse than RAID1 for data security.

size=2, min_size=2 *is* RAID1. Except that you become unavailable if a single drive is unavailable.

> That isn't even the main risk as I understand it. Of course a > double failure is going to be a problem with size=2, or traditional RAID1, and I think anybody choosing this configuration accepts this risk.

We see people often enough who don´t know that. I´ve seen double failures. ymmv.

As I understand it, the reason min_size=1 is a trap has nothing to do with double failures per se.

It´s one of the concerns.

Indeed, arguably that´s an overlapping failure. I´ve seen this too, and have a pg query to demonstrate it.

I suspect there are ways to re-activate it, though this will result in potential data inconsistency since writes were allowed to the cluster and will then get rolled back.

Yep.

But you´d have a voting majority.

Exactly.

Agreed, the difference between availability and durability. Depends what´s important to you.

It´s both. _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> <<mailto:ceph-users@ceph.io>> To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io> <<mailto:ceph-users-leave@ceph.io>>

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

Anthony D'Atri

4 Jan 4 Jan

1:16 a.m.

New subject: Ceph newbee questions

...

You don’t strictly need an admin node as such. Only worry about clock rate if you’re doing CephFS.

So an admin node is not required?

It isn’t. An admin node basically is any system with the admin keys installed. With production clusters there can be some advantages to having one, even if it’s a VM, but a dedicated system is not required.

...

I can run all the daemons on three "fileservers"?

Yes.

...

As we will use CephFS, we want good clock rate, but fileserver running OSD:s wants multiple cores, right?

There are a lot of variables that go into those questions. The CephFS MDS is AIUI single-threaded, so it benefits from a high clock rate. You can also run more than one, with the workload distributed in various ways. If you’re looking at a three node cluster, presumably as a PoC, probably don’t worry about that yet. Desirable core count is a function of OSD drive count, type and size of OSD drive, etc.

...

Or did I missunderstand?

max = 3 and min = 2.

For your pools, size=3, min_size=2. These are defaults, don’t worry about them.

You can expand this cluster by just adding one data node, there is no need to expand with another 3 nodes, right?

With defaults, yes.

With three data nodes and things starts to break. If one disk

yes

or a few disks break

Depends on which disks.

you will still have two copies of your objects, cluster will be in degraded mode until disks are replaced?

There is no healing or restructuring of objects while a couple of disks are broken?

Depends on which disks. If you lose one on each of 3 nodes at the same time, that’s a problem. If you lose one on each of 3 nodes a week apart, probably not.

If one of the data nodes break, there are still 2 copies of the objects and the cluster will run in degraded mode until server is replaced and data is replicated?

Yes.

If two data nodes break, the cluster will fail but there is still one copy of the objects so if the two nodes are replaced the cluster and all objects will be there after replication. No lost data?

Correct, *if* nothing happens to the survivors. But unless you take manual steps, data will be unavailable. Most of the time if a node fails you can replace a DIMM etc. and bring it back. > Many thanks!! > Regards > Marcus > On fre, dec 22 2023 at 19:12:19 -0500, Anthony D'Atri <aad(a)dreamsnake.net <mailto:aad@dreamsnake.net>> wrote: >>>> You can do that for a PoC, but that's a bad idea for any production workload. You'd want at least three nodes with OSDs to use the default RF=3 replication. You can do RF=2, but at the peril of your mortal data. >>> I'm not sure I agree - I think size=2, min_size=2 is no worse than >>> RAID1 for data security. >> size=2, min_size=2 *is* RAID1. Except that you become unavailable if a single drive is unavailable. >>>> That isn't even the main risk as I understand it. Of course a double >>> failure is going to be a problem with size=2, or traditional RAID1, >>> and I think anybody choosing this configuration accepts this risk. >> We see people often enough who don´t know that. I´ve seen double failures. ymmv. >>> As I understand it, the reason min_size=1 is a trap has nothing to do >>> with double failures per se. >> It´s one of the concerns. >>> The issue is that Ceph OSDs are somewhat prone to flapping during >>> recovery (OOM, etc). So even if the disk is fine, an OSD can go down >>> for a short time. If you have size=2, min=1 configured, then when >>> this happens the PG will become degraded and will continue operating >>> on the other OSD, and the flapping OSD becomes stale. Then when it >>> comes back up it recovers. The problem is that if the other OSD has a >>> permanent failure (disk crash/etc) while the first OSD is flapping, >>> now you have no good OSDs, because when the flapping OSD comes back up >>> it is stale, and its PGs have no peer. >> Indeed, arguably that´s an overlapping failure. I´ve seen this too, and have a pg query to demonstrate it. >>> I suspect there are ways to re-activate it, though this will result in potential data >>> inconsistency since writes were allowed to the cluster and will then >>> get rolled back. >> Yep. >>> With only two OSDs I'm guessing that would be the >>> main impact (well, depending on journaling behavior/etc), but if you >>> have more OSDs than that then you could have situations where one file >>> is getting rolled back, and some other file isn't, and so on. >> But you´d have a voting majority. >>> With min_size=2 you're fairly safe from flapping because there will >>> always be two replicas that have the most recent version of every PG, >>> and so you can still tolerate a permanent failure of one of them. >> Exactly. >>> size=2, min=2 doesn't suffer this failure mode, because anytime there >>> is flapping the PG goes inactive and no writes can be made, so when >>> the other OSD comes back up there is nothing to recover. Of course >>> this results in IO blocks and downtime, which is obviously >>> undesirable, but it is likely a more recoverable state than >>> inconsistent writes. >> Agreed, the difference between availability and durability. Depends what´s important to you. >>> Apologies if I've gotten any of that wrong, but my understanding is >>> that it is these sorts of failure modes that cause min_size=1 to be a >>> trap. This isn't the sort of thing that typically happens in a RAID1 >>> config, or at least that admins don't think about. >> It´s both. >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> <<mailto:ceph-users@ceph.io>> >> To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io> <<mailto:ceph-users-leave@ceph.io>> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> > To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

137

days inactive

148

days old

ceph-users@ceph.io

Manage subscription

4 comments

2 participants

tags (0)

participants (2)

Anthony D'Atri
Marcus