Re: Attempt to rethink log-based replication in Ceph on fast IO path

3 Feb 2020

On Mon, Jan 27, 2020 at 2:58 PM Roman Penyaev &lt;rpenyaev(a)suse.de&gt; wrote:
...
  rados snapshots are based on clone, right? Clone
operation should
 follow the sync path (through the single primary) but still can be
 a bit tricky (requires object to be identical on all replicas
 to the moment, when clone is done) and can be implemented by
 communicating between replicas, where primary is the main
 coordinator.  Here is what I imagine (some sort of lazy data
 migration) 
Uh, they are clones, but not in the way you're thinking/hoping. Client
IO *can* (but does not need to) include a SnapSet, which contains data
about the snapshots that logically exist (but may or may not have been
seen by any particular object yet). When the client does a write with
a snapid the object doesn't already contain, the OSD does a clone
locally. And then any subsequent write applies to the new head, not
the old snapid.
So unfortunately neither the client nor the OSD has the slightest idea
whether a particular write operation requires any snapshot work until
it arrives at the OSD.

<snip>

...

 Everything which is not plain read/write is treated as management
 or metadata requests.

  (Among other things, any write op has the
 potential to change the object’s size and that needs to be ordered
 with truncate operations.) 
 Why these two operations have to be ordered? No, I will ask another
 way.  Why distributed storage should care about the order of these
 two operations? That what I do not understand.  Why client can't be
 responsible for proper waiting of IO and only then issuing a truncate?
 (direct analogy: you issue an IO to your file and do not wait for a
 completion, then you truncate your file, what is the result?)

 But if we are talking about concurrent clients case, when one of the
 clients issues write, meanwhile another one issues truncate, than
 I do not understand how does the sync log help, because the primary
 replica can receive these two requests at any order (we assume no
 proper locking is used, right?) 
If a client dispatches multiple operations on a single object, we
guarantee they are ordered in the same order they were dispatched by
the client. So he can do async truncate, async write, async read,
async append, whatever from a single thread and we promise to process
them in that order.

Some of that could probably be maintained in the client library rather
than on the OSDs, but not all of it given the timestamp-based retries
you describe and the problem of snapshots I mentioned above.

Basically what I'm getting at is that given the way the RADOS protocol
works, we really don't have any ops which are plain read/write.
Especially including some of the ones you care about most -- for
instance, RBD volumes without an object map include an IO hint on
every write, to make sure that any newly-created objects get set to
the proper 4MB size. These IO hints mean they're all compound
operations, not single-operation writes!

Of course, all these issues are not the result of changing our
durability guarantees, but of trying to provide client-side
replication...

...

  Now, making these changes isn’t necessarily bad
if we want to develop
 a faster but less ridiculously-consistent storage system to better
 serve the needs of the interfaces that actually get deployed — I have
 long found it a little weird that RADOS, a strictly-consistent
 transactional object store, is one of the premier providers of virtual
 block-device IO and S3 storage. But if that’s the goal, we should
 embrace being not-RADOS and be willing to explicitly take much larger
 departures from it than just “in the happy path we drop ordering and
 fall back to backfill if there’s a problem”. 
 Current constraints are blockers for the IO performance.  It does not
 matter how much we squeeze from the CPU (crimson project), unless we
 can't relax IO ordering or reduce journaling effects, the overall
 CPU cycles improvements can be not so impressive.

 so I hope Ceph can make a step forward and be less conservative,
 especially when we have a hardware, which breaks all the possible
 rules.

  The second big point is that if you want to have
a happy path and a
 fallback ordered path, you’ll need to map out in a lot more detail how
 those interact and how the clients and OSDs switch between them. Ideas
 like this have come up before but almost every one (or literally every
 one?) has had a fatal flaw that prevented it actually being safe. 
 Here I rely on a fact, that replicas know the PG state (as it is right
 now).  If PG is active and clean then replica accepts IO.  If not -
 IO is rejected with the proper error: "dear client, go to the primary,
 I'm not in the condition to serve your request, but primary can".

 Here several scenarios are possible. Client was the first one who
 observes a replica in not a healthy state.  We can expect all other
 replicas will observe the same not healthy state sooner, but client
 can propagate this information to other replicas in PG (need to be
 discussed in detail). 
"Not a healthy state" isn't really meaningful in Ceph — we make
decisions based on the settings of the newest OSDMap we have access
to, but peer OSDs might or might not match that state. When the
primary processes an OSDMap marking one of the peers down and he sets
it to degraded, there's a window where the peers haven't seen that
update. The clients will probably take even longer. And merely not
getting an op reply as fast as the client wants isn't indicative of
anything that RADOS cares about. Those states and the transitions
between them and the recovery logic for ops-in-flight are all very
hard to get right, and having the solutions mapped out in detail is a
requirement for merging any kind of change in RADOS.

<snip>

...
  *Hybrid* client-side replication :) When client is
responsible for
 fanning
 out write requests only in case of healthy pg.

  It is frequently undesirable
 since OSDs tend to have lower latency and more bandwidth to their
 peers than the clients do to the OSDs; 
 Latency is the answer.  I want to squeeze everything from RDMA.  For
 current
 Ceph RDMA is dead.  Basically for current implementation any per-client
 improvements on transport side bring nothing. (I spent some time poking
 the protocol v1 and had a good speed up on transport side,  which is
 unnoticed for the whole per-client IO performance. sigh) 
Can you explain why client-side replication over RDMA is a better idea
than over ethernet IP? Like I said with math, I think in most cases it
is actually slower, and it DEFINITELY makes harder all the other kinds
of changes you want to make. I think you will be a lot happier if you
drop that.

(Also: we are doing a lot of work where read-from-replica will become
desirable for things like rack-local reads and not being able to do
that would be sad.)
-Greg

2024

2023

2022

2021

2020

2019

Re: Attempt to rethink log-based replication in Ceph on fast IO path