On Mon, Jan 27, 2020 at 2:58 PM Roman Penyaev <rpenyaev(a)suse.de> wrote:
rados snapshots are based on clone, right? Clone
operation should
follow the sync path (through the single primary) but still can be
a bit tricky (requires object to be identical on all replicas
to the moment, when clone is done) and can be implemented by
communicating between replicas, where primary is the main
coordinator. Here is what I imagine (some sort of lazy data
migration)
Uh, they are clones, but not in the way you're thinking/hoping. Client
IO *can* (but does not need to) include a SnapSet, which contains data
about the snapshots that logically exist (but may or may not have been
seen by any particular object yet). When the client does a write with
a snapid the object doesn't already contain, the OSD does a clone
locally. And then any subsequent write applies to the new head, not
the old snapid.
So unfortunately neither the client nor the OSD has the slightest idea
whether a particular write operation requires any snapshot work until
it arrives at the OSD.
<snip>
Everything which is not plain read/write is treated as management
or metadata requests.
(Among other things, any write op has the
potential to change the object’s size and that needs to be ordered
with truncate operations.)
Why these two operations have to be ordered? No, I will ask another
way. Why distributed storage should care about the order of these
two operations? That what I do not understand. Why client can't be
responsible for proper waiting of IO and only then issuing a truncate?
(direct analogy: you issue an IO to your file and do not wait for a
completion, then you truncate your file, what is the result?)
But if we are talking about concurrent clients case, when one of the
clients issues write, meanwhile another one issues truncate, than
I do not understand how does the sync log help, because the primary
replica can receive these two requests at any order (we assume no
proper locking is used, right?)
If a client dispatches multiple operations on a single object, we
guarantee they are ordered in the same order they were dispatched by
the client. So he can do async truncate, async write, async read,
async append, whatever from a single thread and we promise to process
them in that order.
Some of that could probably be maintained in the client library rather
than on the OSDs, but not all of it given the timestamp-based retries
you describe and the problem of snapshots I mentioned above.
Basically what I'm getting at is that given the way the RADOS protocol
works, we really don't have any ops which are plain read/write.
Especially including some of the ones you care about most -- for
instance, RBD volumes without an object map include an IO hint on
every write, to make sure that any newly-created objects get set to
the proper 4MB size. These IO hints mean they're all compound
operations, not single-operation writes!
Of course, all these issues are not the result of changing our
durability guarantees, but of trying to provide client-side
replication...
Now, making these changes isn’t necessarily bad
if we want to develop
a faster but less ridiculously-consistent storage system to better
serve the needs of the interfaces that actually get deployed — I have
long found it a little weird that RADOS, a strictly-consistent
transactional object store, is one of the premier providers of virtual
block-device IO and S3 storage. But if that’s the goal, we should
embrace being not-RADOS and be willing to explicitly take much larger
departures from it than just “in the happy path we drop ordering and
fall back to backfill if there’s a problem”.
Current constraints are blockers for the IO performance. It does not
matter how much we squeeze from the CPU (crimson project), unless we
can't relax IO ordering or reduce journaling effects, the overall
CPU cycles improvements can be not so impressive.
so I hope Ceph can make a step forward and be less conservative,
especially when we have a hardware, which breaks all the possible
rules.
The second big point is that if you want to have
a happy path and a
fallback ordered path, you’ll need to map out in a lot more detail how
those interact and how the clients and OSDs switch between them. Ideas
like this have come up before but almost every one (or literally every
one?) has had a fatal flaw that prevented it actually being safe.
Here I rely on a fact, that replicas know the PG state (as it is right
now). If PG is active and clean then replica accepts IO. If not -
IO is rejected with the proper error: "dear client, go to the primary,
I'm not in the condition to serve your request, but primary can".
Here several scenarios are possible. Client was the first one who
observes a replica in not a healthy state. We can expect all other
replicas will observe the same not healthy state sooner, but client
can propagate this information to other replicas in PG (need to be
discussed in detail).
"Not a healthy state" isn't really meaningful in Ceph — we make
decisions based on the settings of the newest OSDMap we have access
to, but peer OSDs might or might not match that state. When the
primary processes an OSDMap marking one of the peers down and he sets
it to degraded, there's a window where the peers haven't seen that
update. The clients will probably take even longer. And merely not
getting an op reply as fast as the client wants isn't indicative of
anything that RADOS cares about. Those states and the transitions
between them and the recovery logic for ops-in-flight are all very
hard to get right, and having the solutions mapped out in detail is a
requirement for merging any kind of change in RADOS.
<snip>
*Hybrid* client-side replication :) When client is
responsible for
fanning
out write requests only in case of healthy pg.
It is frequently undesirable
since OSDs tend to have lower latency and more bandwidth to their
peers than the clients do to the OSDs;
Latency is the answer. I want to squeeze everything from RDMA. For
current
Ceph RDMA is dead. Basically for current implementation any per-client
improvements on transport side bring nothing. (I spent some time poking
the protocol v1 and had a good speed up on transport side, which is
unnoticed for the whole per-client IO performance. sigh)
Can you explain why client-side replication over RDMA is a better idea
than over ethernet IP? Like I said with math, I think in most cases it
is actually slower, and it DEFINITELY makes harder all the other kinds
of changes you want to make. I think you will be a lot happier if you
drop that.
(Also: we are doing a lot of work where read-from-replica will become
desirable for things like rack-local reads and not being able to do
that would be sad.)
-Greg