Suppose a client sends a write on object foo to osds 0, 1, 2. osds 1
and 2 are shut down, but not before osd 0 records the update to foo
locally. At this point, osd 0 is informed that osds 1 and 2 have gone
down. However, min_read_size=1, so it begins accepting reads (*). A
client reads foo, so osd 0 returns the new state. osd 0 goes down.
osd 1 and 2 come back up. The client reads foo again. osd 1 (now the
primary) returns its copy of foo, but it is out of date resulting in
an inconsistent read.
Without min_read_size, (*) can't happen unless min_size=1, in which
case (**) won't happen until osd.0 comes back up. On the flip side,
with min_size=2, (*) will result in osd.0 finding another osd to bring
up to date prior to accepting reads and writes (as there'd be enough
copies then). If it dies prior to that point, at osd.1 and 2 can
safely assume that their state must be new enough to safely serve IO.
This is not the only valid choice in the design space, but they all
involve tradeoffs. For instance, if we perform non-destructive writes
and therefore have prior copies of foo, we could conceivably return a
prior version of foo known to have been committed to min_size
replicas, but that would in general require something like an
additional distributed commit prior to acking to the client or serving
reads (to ensure that all replicas have the new read bound) on the
object in addition to the overhead of non-destructive writes (and I'm
almost certainly missing things even so).
-Sam
On Tue, Jan 12, 2021 at 7:10 AM Prasad Krishnan
<prasad.krishnan(a)flipkart.com> wrote:
Hi Sam,
Thank you for responding and apologies for missing out your reply.....noticed it
recently.
On Tue, Jan 5, 2021 at 6:16 AM Sam Just <sjust(a)redhat.com> wrote:
Part of the answer is that going "readable" with read_min_size
replicas has a side effect of committing any writes those replicas
happen to know about whether they were actually committed to
write_min_size replicas or not because once we've served a read
reflecting those writes, all future reads must also reflect those
writes.
I'm wondering why it isn't a problem now? If there's a mechanism that
prevents uncommitted write transactions from being read now, I can't see
how read/write min_size separation would break that.
Thanks,
Prasad Krishnan
-Sam
On Thu, Dec 24, 2020 at 6:14 AM Prasad Krishnan
<prasad.krishnan(a)flipkart.com> wrote:
Dear Ceph developers,
Presently Ceph has a single config option named min_size which decides the minimum number
of copies that must be available before any client I/O operation (read or write) can be
performed on a given RADOS pool.
Would it make sense to split it into two i.e. read_min_size and write_min_size to allow
better data availability?
For instance, in a pool with replication size of 3 (where 3 copies are stored), if two
OSDs go down, we would want to avoid client write operations (to reduce risk of data loss)
but allow client read operations from the single copy that is available. This can be done
by setting read_min_size to 1, but retaining write_min_size to 2.
Are there any technical reasons why this cannot work? Any pitfalls that I don't
foresee?
Thanks,
K.Prasad
-----------------------------------------------------------------------------------------
This email and any files transmitted with it are confidential and intended solely for the
use of the individual or entity to whom they are addressed. If you have received this
email in error, please notify the system manager. This message contains confidential
information and is intended only for the individual named. If you are not the named
addressee, you should not disseminate, distribute or copy this email. Please notify the
sender immediately by email if you have received this email by mistake and delete this
email from your system. If you are not the intended recipient, you are notified that
disclosing, copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited.
Any views or opinions presented in this email are solely those of the author and do not
necessarily represent those of the organization. Any information on shares, debentures or
similar instruments, recommended product pricing, valuations and the like are for
information purposes only. It is not meant to be an instruction or recommendation, as the
case may be, to buy or to sell securities, products, services nor an offer to buy or sell
securities, products or services unless specifically stated to be so on behalf of the
Flipkart group. Employees of the Flipkart group of companies are expressly required not to
make defamatory statements and not to infringe or authorise any infringement of copyright
or any other legal right by email communications. Any such communication is contrary to
organizational policy and outside the scope of the employment of the individual concerned.
The organization will not accept any liability in respect of such communication, and the
employee responsible will be personally liable for any damages or other liability
arising.
Our organization accepts no liability for the content of this email, or for the
consequences of any actions taken on the basis of the information provided, unless that
information is subsequently confirmed in writing. If you are not the intended recipient,
you are notified that disclosing, copying, distributing or taking any action in reliance
on the contents of this information is strictly prohibited.
-----------------------------------------------------------------------------------------
_______________________________________________
Dev mailing list -- dev(a)ceph.io
To unsubscribe send an email to dev-leave(a)ceph.io
-----------------------------------------------------------------------------------------
This email and any files transmitted with it are confidential and intended solely for the
use of the individual or entity to whom they are addressed. If you have received this
email in error, please notify the system manager. This message contains confidential
information and is intended only for the individual named. If you are not the named
addressee, you should not disseminate, distribute or copy this email. Please notify the
sender immediately by email if you have received this email by mistake and delete this
email from your system. If you are not the intended recipient, you are notified that
disclosing, copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited.
Any views or opinions presented in this email are solely those of the author and do not
necessarily represent those of the organization. Any information on shares, debentures or
similar instruments, recommended product pricing, valuations and the like are for
information purposes only. It is not meant to be an instruction or recommendation, as the
case may be, to buy or to sell securities, products, services nor an offer to buy or sell
securities, products or services unless specifically stated to be so on behalf of the
Flipkart group. Employees of the Flipkart group of companies are expressly required not to
make defamatory statements and not to infringe or authorise any infringement of copyright
or any other legal right by email communications. Any such communication is contrary to
organizational policy and outside the scope of the employment of the individual concerned.
The organization will not accept any liability in respect of such communication, and the
employee responsible will be personally liable for any damages or other liability
arising.
Our organization accepts no liability for the content of this email, or for the
consequences of any actions taken on the basis of the information provided, unless that
information is subsequently confirmed in writing. If you are not the intended recipient,
you are notified that disclosing, copying, distributing or taking any action in reliance
on the contents of this information is strictly prohibited.
-----------------------------------------------------------------------------------------