On 2020-01-26 07:58, Gregory Farnum wrote:
I’ve often thought idly about what we could do if we gave up on the
overbearing strict consistency Ceph provides and built something more
specifically suited to serving up block devices, so it’s very cool to
see somebody’s gotten more organized about it and written their
thoughts down. I for one would be happy to discuss this topic at
Yes, would be great to meet and to discuss the topic in person.
I’ve read through this proposal with varying levels of
attention so I
may have missed something, but there are a few important points I
think you’ve missed in your writeup that need addressing, and a few
choices you’ve made that you’ll want to be ready to defend. :)
First of all thanks for your time. It is always hard to get
through someone else’s thoughts.
The first and most important one is this: while a
system like you
describe could be *Ceph*, it is fundamentally and absolutely not
*RADOS*. The transactions RADOS provides in its ordered log are not
solely about recovery; we expose transactions and that ordered log in
great detail to the user over the network protocol with operations
like object append, object classes, compare-and-swap, the entire omap
interface, and even by returning object versions on every write
Those operations you've mentioned are perfectly fit synchronous path
and can follow the log as it is right now (I attribute them to
management operations). I suppose the only concurrent access
which we deal with is access to a single object from different
clients, right? (otherwise, why client, which issues any two operations
which has to land on a disk in a specific order, does not care himself
about the order? Like any fs or dbms, etc do).
Append operation for example. I can imagine the situation when two
clients append a record to an object which acts as a journal (I'm
speculating, but the case seems valid). Those two want a chunk
in the object in full ownership, which accidentally can't be overwritten
by someone else. Those two append operations go through a synchronous
path and two client requests eventually will be somehow ordered on the
primary (the order is undefined, because access is concurrent, but
nobody cares). And this "somehow ordered" has a strong guarantee that
these clients eventually will have different offsets in ownership.
If another client breaks the convention and issues a write IO to a
offset of this journal object, then obviously this client is naughty
and introduces a data corruption. And seems from this nasty behavior
there is no any protection even with the synchronous path through a
primary (there is an undefined access order from different clients,
this access is serialized on the primary).
A good example is a simple variable in memory (move away from
transactions and storages). Two threads have a convention introduced
by a developer, that the variable is shared and can be accessed
concurrently. In order to protect the variable from the concurrent
access developer has to do special things (locking, atomic instructions,
etc). If one of the threads break the convention (e.g. does not use
an atomic instruction in order to do an increment), then you have
a corruption. Developer *has* to think about many things when memory
model is weak (read/write barriers, cpu caches, etc) and usually this
affects the performance. But when variable is not shared, then why
read/write access to that variable has to follow all the complicated
conventions on a fast path?
IMO this example from the programming world has a strong parallel
with the data access for the distributed storage. If you maintain
a journal or any object which is shared, then special API should
be provided (as it is right now) to guarantee some sorts of specific
data mutation effects.
You also mentioned the following operations: "compare-and-swap, the
entire omap interface, and even by returning object versions on every
write operation!". Let me try to picture how they could be implemented.
Compare-and-swap: access serialized through the primary, thus primary
does compare and if compare succeeds does log update (log is
then write is issued to the disk and same write operation is issued
to other replicas. Other replicas do data update, without compare.
If there is an inflight write IO, which bypasses compare-and-swap
convention, then block will be corrupted. Corruption does not mean,
that replicas will be out of sync, but corruption means that data
chunk will be inconsistent, i.e. client gets what it does not expect.
But this is fine! We can always return to the example with threads
and variable: two threads invoke compare-and-swap operations,
the third one does not. Obviously the third one corrupts the data.
omap interface + xattr: is not an IO, can go fully through a log,
like it is right now.
object versions on every write operation: if each write has a
monotonic timestamp, then why it can't be used as particular
version of an object? Here this thingy can be different,
to current use-cases, but frankly I do not see here the whole
picture, so do not want to speculate (need some particular
These operations aren’t just novelties; RGW, RBD, and
CephFS metadata schemes all rely on them implicitly and explicitly —
and even for the “simple” RBD data objects you’re targeting, the
changes you describe break how RADOS does snapshots.
rados snapshots are based on clone, right? Clone operation should
follow the sync path (through the single primary) but still can be
a bit tricky (requires object to be identical on all replicas
to the moment, when clone is done) and can be implemented by
communicating between replicas, where primary is the main
coordinator. Here is what I imagine (some sort of lazy data
1. hey, replicas, do object cloning, right now! here is the
timestamp when we start!
[after a while, replicas notify they are done, a new object
is presumably is different on replicas, but since each
replica knows when clone op is started then it will not be
difficult for replicas to catch-up the final object state,
when clone op is finished]
2. hey, replicas, are you done? Good, here is the end
timestamp, when clone is finished, let's exchange what
Again, here the whole logic does not care about other IO which
are coming meanwhile. What is important for cloned object
is to be identical on all replicas. If IO has not been stopped
while object is cloning, then, obviously, clients know what
they are doing, thus cloned object can be at any state
(what is guaranteed, that the state will be equal on all replicas).
(I think the RBD
data journal also relies on a number of these, but I’m not certain.)
RADOS actually *is* a primitive database, not just a disk, and I don’t
think we can usefully separate operations into the “management” and
“data” division you suggest.
Everything which is not plain read/write is treated as management
or metadata requests.
(Among other things, any write op has the
potential to change the object’s size and that needs to be ordered
with truncate operations.)
Why these two operations have to be ordered? No, I will ask another
way. Why distributed storage should care about the order of these
two operations? That what I do not understand. Why client can't be
responsible for proper waiting of IO and only then issuing a truncate?
(direct analogy: you issue an IO to your file and do not wait for a
completion, then you truncate your file, what is the result?)
But if we are talking about concurrent clients case, when one of the
clients issues write, meanwhile another one issues truncate, than
I do not understand how does the sync log help, because the primary
replica can receive these two requests at any order (we assume no
proper locking is used, right?)
Now, making these changes isn’t necessarily bad if we
want to develop
a faster but less ridiculously-consistent storage system to better
serve the needs of the interfaces that actually get deployed — I have
long found it a little weird that RADOS, a strictly-consistent
transactional object store, is one of the premier providers of virtual
block-device IO and S3 storage. But if that’s the goal, we should
embrace being not-RADOS and be willing to explicitly take much larger
departures from it than just “in the happy path we drop ordering and
fall back to backfill if there’s a problem”.
Current constraints are blockers for the IO performance. It does not
matter how much we squeeze from the CPU (crimson project), unless we
can't relax IO ordering or reduce journaling effects, the overall
CPU cycles improvements can be not so impressive.
so I hope Ceph can make a step forward and be less conservative,
especially when we have a hardware, which breaks all the possible
The second big point is that if you want to have a
happy path and a
fallback ordered path, you’ll need to map out in a lot more detail how
those interact and how the clients and OSDs switch between them. Ideas
like this have come up before but almost every one (or literally every
one?) has had a fatal flaw that prevented it actually being safe.
Here I rely on a fact, that replicas know the PG state (as it is right
now). If PG is active and clean then replica accepts IO. If not -
IO is rejected with the proper error: "dear client, go to the primary,
I'm not in the condition to serve your request, but primary can".
Here several scenarios are possible. Client was the first one who
observes a replica in not a healthy state. We can expect all other
replicas will observe the same not healthy state sooner, but client
can propagate this information to other replicas in PG (need to be
discussed in detail).
Another scenario is that client was slower and the whole PG on all
replicas is observed as not healthy. Then client takes "the primary"
path and talks to the primary while peering, syncing etc.
Thirdly, you talk a lot about guaranteeing READ
RECENCY and rejecting
IOs from clients when they arrive at replicas in different orders —
but I don’t see how you detect that case is occurring. If every client
op has a timestamp associated with it, you need to remember all the
client ops that might conflict with each other and then reject them as
needed, which sounds an awful lot like an ordered log to me!
There is only one client operation which we care on fast IO path: write
operation. We need to remember a timestamp of write operation to a
certain offset of an object in order to delay reads to a not persistent
This structure is not a log, rather an rb-tree or hash table in-memory.
probably do something with in-memory op logs and having the object
primary tell replicas when an op is fully acked, but I haven’t thought
This is a in-memory structure, which is not supposed to be persistent.
Imagine any simple rb-tree or hash table, which belongs to an object and
has an offset of a block as a key and a timestamp as a value. When
crashes the whole peering and syncing procedure is performed, where
everything is serialized by a primary, thus this structure is not needed
while PG is not in a clean and healthy state.
In order not to bloat this structure infinitely by tracking all updates
to all blocks of all objects replicas can have a protocol, where primary
is an initiator and sends to all replicas in a heartbeat (or each
10 minutes, does not matter) highest timestamp, and replicas evict all
key-value pairs for the object, where values are lower than this
timestamp. This will guarantee, that replicas won't accept requests
the timestamp lower than this value. If any comes, then replica has to
reject and to ask client to repeat the request, obviously with a newer
In terms of design choices you’ve suggested, I’m not really sure why
you want to do client-side replication.
*Hybrid* client-side replication :) When client is responsible for
out write requests only in case of healthy pg.
It is frequently undesirable
since OSDs tend to have lower latency and more bandwidth to their
peers than the clients do to the OSDs;
Latency is the answer. I want to squeeze everything from RDMA. For
Ceph RDMA is dead. Basically for current implementation any per-client
improvements on transport side bring nothing. (I spent some time poking
the protocol v1 and had a good speed up on transport side, which is
unnoticed for the whole per-client IO performance. sigh)
this system still has to do
acks to a new per-object primary *anyway*;
Yes, but that affects *only* reads following writes to the same
offset of an object. Acks are needed *only* to delay reads
to not fully persistent blocks. (READ RECENT rule in action).
Acks do not affect writes. Writes come at any order.
Acks do not affect reads to other blocks of that object.
1. All replicas accept write request from a client (do write
to the disk, etc).
2. All replicas send ack to the client and at the same
time send acks to the primary.
So primary and client can receive acks at the same time (or in
different order). If client is not lucky enough and then immediately
(after the write request) issues a read request to the primary replica
to the same object offset, then this read *can* be delayed (rejected).
and the usual design reason
to for client-side replication is because it lets the data store
daemons be much simpler, but these OSDs are actually getting *more*
complicated than our existing ones! Client-side replication op latency
is going to be
MAX(client network hop, client data transmission*3) + op permission
checking + op processing time + disk commit time + client network hop
whereas for primary-replication it will be
client data transmission + op permission checking + max(OSD network
hop, OSD data transmission*2) + op processing time + disk commit time
+ OSD network hop + network hop
How those formulas compare in practice will vary but is often better
for the OSD replication — which is certainly much simpler to work with
and reason about.
Your equation is a bit too much for me. My logic is simple: client
writes to several replicas without any constraints, directly to disks,
gets all acks and write request is considered as completed. That is
fast IO path.
Again, acks from replicas to the primary are not involved, this is a
In a similar vein, I’m not sure what the point is of distributing
reads out by having per-object read primaries. We already distribute
reads by PG, and I guess the larger number of objects might fix
balance problems if we have them, but that seems like it’s pushing the
argument. Moreover it complicates the design by forcing replicas to
keep track of more state and reduces cache efficiency of the cluster
since we need at least some PG metadata in memory in more places.
Frankly, I liked the idea that reads can be served by different
in PG. I read the paper (PROAR something something) and thought that
this can be taken and applied to what I have in my mind.
And I think this is more of a nit, but we really,
really don’t want
client clocks to be involved in any decision-making at all, even if we
hope it’s just going to be for uncommon conflict resolutions. Sure, it
is theoretically possible to get good clock sync if you control the
environment, but it’s incredibly finicky. For instance, here’s a
ticket where a monotonic clock is somehow going backwards in time:
Luckily I don’t think the conflict resolution you’re using timestamps
for actually needs them; we can probably design a logical clock of
some description that will be safer and reasonably simple.
We do not need clocks. Physical clocks. All we need is something
in one direction. Resolution is also not very much important (hundreds
ms should be pretty enough, I think). This time-based concurrency
is needed only to guarantee certain order on replicas in case of
writes to the same offset of an object.
If no concurrent writes happen (locks are used, etc) - then timestamp
mechanism can be even stayed unnoticed.