Attempt to rethink log-based replication in Ceph on fast IO path - Dev

23 Jan 2020

Hi Sage and all,

I would like to share some of my scattered thoughts regarding IO
performance improvements in Ceph. I touch on topics such as replicated
log model, atomicity and full journaling on the objectstore side and how
these constraints can be relaxed on IO path in order to squeeze maximum
from a hard drive, at the same time still respecting the strong data
consistency between replicas.

Would be great to have a chance to discuss this in Seoul.

Proposal in a few lines (for time saving)
-----------------------------------------

1. IO and management requests follow different paths:

    1. Slow path for management requests (object creation/deletion,
       locks, etc) involves replicated log and primary-copy model, as
       it is right now.

    2. Fast IO path (read, write requests) bypasses replicated log and
       makes clients responsible for sending data to each replica in a
       group (PG), I reference it as hybrid client-driven replication
       further in the text.

2. For the sake of simplicity client takes fast IO path only when
    group of replicas (PG) is in active and healthy state
    (active+client in terms of Ceph), otherwise client falls back to
    the slow path. Thus, client is not involved in the disaster
    recovery process (that's why hybrid).

3. Fast IO path (is taken only if group of replicas (PG) is
    in active and healthy state):

    1. Reads of objects happen from different primaries in a group
       (PG), in order to spread the read load.

    2. Writes are distributed by clients themselves to all replicas in
       a group (PG), avoiding one extra network hop.

4. Objectstore becomes journal-less and behaves similar to RAID 1,
    sacrificing atomicity and erasure coding (due to the "write hole"
    problem) for write performance, which becomes close to the write
    performance of a bare hard drive, but still keeping strong data
    consistency between replicas.

!!!!! CAUTION: many letters !!!!!

Introduction
------------

The most impressive IO performance numbers among products which
provide data redundancy is certainly delivered by RAID and that is not
a surprise: you just write a block to several disks in parallel,
nothing more. Perfect.

Then why Ceph can't achieve the same performance as RAID 1 on IO path,
doing plain old write-in-parallel thing, but supplemented with
scalability, CRUSH object placement and all other beloved things which
make Ceph so powerful?

The answer to this question lies in the replication design used by
Ceph, namely log-based replication (PG log in Ceph terms), which is
commonly used in storage and database systems and well described in
many publications [1, 2, 3, 4]). Sequential replicated log indeed
resolves issues related to atomicity, consistency, isolation and
durability (this famous ACID buzzword acronym), especially in DBMS
world where transactions are involved. But Ceph is a storage, not a
database.

Modern hard drives do not provide atomicity, strict writes ordering or
durability. Hardware breaks all the rules and relax the constraints
for the sake of performance, and software has to deal with it. Every
modern filesystem can survive crashes (journals, COW, soft-updates,
etc approaches are used), each application which cares about data
consistency and can't rely on weak guarantees of POSIX file API (any
DBMS for example) uses same methods (journals, COW, etc) to avoid
inconsistency. Why then a storage solution, which works below a
filesystem and on top of a hardware, should provide such strong ACID
properties for IO?

So what is a log-based replication which is a core of Ceph? Very
simple: duplication (mirroring) of a log which consist of records
describing storage operations. In order to duplicate something we have
to be sure that an object which we duplicate has not been changed on
the way, i.e. actually copy-paste does what it is supposed to do
without nasty surprises. When we have a log of storage operations we
need not only to deliver a recent record to a replica without
corruption (and wrong order of operations in a log is a corruption)
but also invoke this operation, which has been described in a log
record. Not only that: new record in a log and operation, which is
described by that record, should happen in one transaction,
atomically, thus either we see a record and operation was performed,
either no updates in a log and no operation consequences is visible.

Having such a wonderful log (which obviously can survive a crash or
even a nuclear strike) we can very easily answer The Most Important
question for a high scalable, self-healing storage product: what has
happened on other replicas while one of them took a nap (crashed). In
other words we need such a complicated mechanism only for one thing:
just to tell the exact data difference (e.g. in blocks of data and
operations), what has been changed since a crash.

Short quiz: can Сeph stay Ceph, but without any log-based replication?
Yes, throw it away, no need to know the difference, just copy the
whole bunch of data between replicas on any disaster recovery. Very
slow on resyncing but fast on doing operations and simple. I'm, of
course, exaggerating, but that is not so far from the truth.

Talking about log-based replication I've never mentioned why it
actually impacts the IO performance. Here is why:

1. Each update operation on a group of replicas (PG in terms of Ceph)
has to be strictly ordered (log should stay in the same order on all
replicas) i.e. no any other operation is allowed till an operation,
which has been just started, reaches a hard drive with all cache
flushes involved (who poked the code knows the notorious pglock, which
protects the replication log from corruption and makes it be equal on
all other replicas).

2. Severe restrictions on how actually the objectstore should be
implemented: data and metadata should be fully journaled, because a
log record of a replicated operation is stored along with the data in
atomic manner. And what is wrong with journal in terms of performance?
Nothing is wrong, it is just incredibly slow. (As a hint for the
curious: enable "data=journal" option for ext4 filesystem, or
"allow_cow" for xfs or just run any load which performs random writes
on BTRFS (cow is always enabled), and yes, I promise, you feel the
difference immediately).

3. Communication with a group of replicas (PG) happens only through a
primary one, where a primary is responsible for keeping log in sync
with other replicas. This primary-copy model [7] increases IO latency
adding one extra network hop.

All of the above about log-based replication in Ceph is about strong
consistency, but not about performance. But can we have both? And what
is actually a strong consistency? According to formal description "The
protocol is said to support strong consistency if all accesses are
seen by all parallel processes (or nodes, processors, etc.) in the
same order (sequentially)" [6]. In simple words: one thread produces
data to the same offset of a file, other  threads read from the same
offset, thus if strong consistency is respected readers never observe
writes in a wrong order, e.g.:

   Writer             Reader 1          Reader 2               Reader 3
   --------------     --------------    --------------    ---------------
   write(file, "A");
                      b0 = read(file);  b0 = read(file);
                                        b1 = read(file);
   write(file, "B");
                                                          b0 = 
read(file);
                      b1 = read(file);                    b1 = 
read(file);

In this example, readers are able to see a block of a file in the
following states:

   Reader 1     Reader 2      Reader 3
   ---------    ----------    ----------
   b0 == "A"    b0 == "A"     b0 == "B"
   b1 == "B"    b1 == "A"     b1 == "B"

But what should never be observed and can be treated as a corruption
is the following:

   b0 == "B"
   b1 == "A"

Having that we can develop a simple rule: writes are not blocked and
can come at any order, once a reader observes a change in a block,
further reads to this block see the change. Most likely I describe
here a wheel, which has some vivid name and was deeply buried in some
publication in early 70th, but I do not care and call the rule as READ
RECENT. The READ RECENT rule contains one splendid property: if read
has never happened and order of writes is unknown, writes can be
reordered regardless of their actual order. Practical example: there
is a distributed storage where writes are unordered by nature; a full
outage has happened and now replicas contain different blocks of data
on the same offsets; read of the block has never happened after it was
updated, thus we are able to bring replicas in sync choosing block
from *any* replica. Marvelous. I will return to this highly important
property when I start describing a new replication model.

It turns out that strong consistency is not about strict requests
ordering in a whole group of replicas (PG) it is only about order of
reads against writes to the same block of data. That's it. So the
question is left unanswered: can we have both strong consistency and
high performance of IO?

In order to answer this question, it is necessary to once again state
the main requirements for the IO replication model, which remove the
strict restrictions imposed by the log-based replication, positively
affect performance and do not contradict strong consistency:

1. Get rid of extra hop on fast IO path by doing client-driven
replication, at least when group of replicas (PG) is in active state
and not doing any sort of disaster recovery.

2. Write requests are not strictly ordered to different offsets of an
object and can be submitted to a hard drive in parallel.

3. Get rid of any journaling on fast IO path on objectstore side if
possible.

These are major requirements of constraint relaxation which can bring
storage IO performance close to a bare hard drive bandwidths keeping
strong consistency. In the next chapter I will cover all the
requirements and will show how it can be implemented in practice.

Separation of management requests from IO requests
--------------------------------------------------

Requests in a storage cluster can be divided into two parts:
management requests, which are object creation/deletion/listing,
attributes setting/getting, lock management, etc; and IO requests,
which do read or modify content of an object. The most prevailing
requests are IO read and write requests, whose latency affect the
overall performance of the entire cluster. The proposed requests
separation has also a clear separation on metadata and data, which
should be treated differently on the objectstore side, namely no
journal should be involved in data modification (this will be
discussed in detail below).

Separating IO from management requests makes it possible to have a
different primary replicas in a group of replicas (PG). For example in
order to distribute a read load, primary can be chosen by an object id
using persistent hashing (e.g. hash ring as proposed here [8]) instead
of having a single primary. Write requests require replication
involved and as was mentioned earlier client-driven replication model
can be chosen in order to minimize latencies of an extra hop.

Having different paths for several types of requests can be summarised
as the following:

1. Management requests go through a persistent primary replica and
log-based replication keeps strict ordering of operations in the
entire group (PG). There are no changes to original Ceph architecture
are proposed.

2. Read object requests are sent to different primaries according to
some persistent hashing rule [8].

3. Write requests can follow two IO paths, where on one of the paths
client is fully responsible for request replication and delivers
requests to all replicas in a group (PG). This path is always taken
when group of replicas (PG) is in active and healthy state. The second
IO path always lies through a primary, thus all requests are equally
ordered (like it is right now implemented in Ceph). This path is taken
in order to simplify all corner cases when a group of replicas (PG) is
not in an active state and performs resynchronization process.

Hybrid client-driven replication
--------------------------------

Client-driven replication is quite self-descriptive: client is
responsible for data delivery to replicas in a group (PG) avoiding one
extra network hop, which exists in a primary-copy model. If client
communicates with replicas in a group (PG) directly there is no longer
the central point of synchronization, thus the one possible issue that
needs to be considered is the possibility of write requests from
several clients to come to replicas in different order:

   Client 1                Client 2
   --------                --------
   sends A to replica1     sends B to replica1
   sends A to replica2     sends B to replica2

   Replica 1        Replica 2
   ---------        ---------
   A                B
   B                A

   ** replica1 writes B to the disk and then overwrites B with A
      replica2 writes A to the disk and then overwrites A with B

In this example both clients access the same block of the same object,
but write requests become reordered differently because of the
network. If this happened, then replicas can contain different data
and they are out of sync. This issue has very bad consequences: future
reads are undefined, especially if one of replicas crashed and clients
start reading from another. This is definitely a data corruption. In
order to solve the issue the order of writes to the same offset should
be synchronized between replicas. Synchronization can be performed by
marking each request with a timestamp, taking into consideration, that
time flows equally on clients. Sub-microsecond time synchronization is
not a problem [9], especially when there is the central point for all
members - distributed state machine cluster, namely cluster of
monitors in terms of Ceph.

Having time synchronized, client marks write requests with a
timestamp. According to the Thomas Write Rule [10] outdated requests
are discarded by timestamp-based concurrency control keeping the
correct order. However, in the case of the client-driven replication
it is impossible to discard a change on all replicas
simultaneously. Instead outdated request should be repeated for all
replicas in a group (PG), here is an example:

   Client 1                  Client 2
   --------                  --------
   marks A with stamp 1      marks B with stamp 2
   sends A to replica1       sends B to replica1
   sends A to replica2       sends B to replica2

   Replica 1        Replica 2
   ---------        --------------
                    -- on disk --
   A                B
   -- on disk --
   B                A

   ** replica1 writes B to the disk and rejects A with RETRY
      replica2 writes A to the disk and then overwrites A with B

Request A is not written by replica1, because it is older then B
request, which has been applied recently. Instead replica1 replied on
A request with the RETRY error, which forces the client to mark
request A with a new timestamp and resend it again to the whole group
of replicas (PG):

   Client 1
   --------
   marks A with stamp 3
   sends A to replica1
   sends A to replica2

   Replica 1        Replica 2
   ---------        ---------
   A                A
                    -- on disk --
                    B
   -- on disk --
   B                A

   ** replica1 overwrites B with A
      replica2 overwrites B with A

The next question rises: why do we need to repeat write operation on
all replicas, but not only on those who have just replied with the
error? Let's consider overlapping data (here I do not consider the
reason why clients send overlapping write requests concurrently,
obviously this is a bug on a client side, but nevertheless we keep
replicas in sync):

   Client 1 marks AAAA with stamp 1
   Client 2 marks BBB with stamp 2
   Client 3 marks C with stamp 3

   Replica 1        Replica 2
   ---------        ---------
     C     3
    BBB    2          C     3
   AAAAA   1        AAAAA   1
   ---------        ---------   below the line is the state on disk
   ABCBA            AACAA

   ** replica1 writes AAAA, then overwrites the middle with BBB and
      then with C

      replica2 writes AAAA and overwrites the middle with C;
      BBB request is rejected

Client2 receives the RETRY error and has to repeat BBB:

   Client 2 marks BBB with stamp 4

   Replica 1        Replica 2
   ---------        ---------
    BBB   4          BBB     4
   -- on disk --
     C    3
                    -- on disk --
    BBB   2           C      3
   AAAAA  1         AAAAA    1
   --------------   --------------   below the line is the state on disk
   ABBBA            ABBBA

On a second retry blocks on replicas become synchronized and write
request is considered as completed and persistent on a drive. Since
concurrent writes should never happen on well written client software
(I expect distributed filesystems use lock primitives in order to
prevent concurrent access) I do not expect any performance degradation
because of frequent write retries, so current algorithm acts as a
protection which keeps replicas in sync.

I would like to emphasize several points obtained from the description
of the algorithm described above:

1. Time synchronization on clients does not need to be very accurate,
even hundreds of millisecond accuracy is enough. Described time-based
algorithm does not depend on actual physical time, also it does not
depend on real (absolute) order of writes sent by concurrent
clients. Instead algorithm forces distributed cluster members to have
a single view on requests order, which is enough to decide should
request be rejected or executed.

2. Hybrid client-driven replication does not involve clients into data
recovery process, thus leaving this job to replicas (exactly as it is
right now). Here I describe the fast IO path only when group of
replicas (PG) is in active and fully synchronized state. When group of
replicas (PG) is in resynchronization state or client sends management
type of requests (not IO), then for the sake of simplicity all
communication goes through the main primary replica, as it is right
now implemented in Ceph.

Crash of a client in the middle of a replication
------------------------------------------------

Client-driven replication in contrast with a primary-copy model has
another major issue which has to be considered in detail: crash of a
client in the middle of a replication, that is, when client sends
write request to one of replicas in a group (PG) and then crashes,
leaving other replicas out of sync. Since there is no central point
which controls the whole replication process, group of replicas have
no information, was the write request confirmed by a whole group or
something has happened to a client and special action should be
performed.

The issue can be solved by immediate confirmation of a successfully
completed write request sent by non-primary replicas to a primary
one. As was mentioned earlier, each object has its own primary which
serves reads, thus write confirmations of a modified object are sent
to a particular primary, which accounts number of confirmations for
the modified object. If the confirmation did not come from all
replicas in a group, then after the timeout elapsed a primary can
start synchronizing an object on its own.

In order to be sure that confirmations are accounted correctly, each
of them is marked with a timestamp taken from a source write request,
thus reordered and outdated confirmations can be discarded.

I would like to stress the point, that accounting of confirmations
happens in RAM only at runtime and does not involve any disk
operations. If one of replicas crashes after a client crash then the
whole group of replicas (PG) changes its state, further write requests
take another path and forwarded to the main primary in a group and
objects synchronization would be started anyway.

The READ RECENT rule in action
------------------------------

In client-driven replication the order in which replicas receive
requests and update blocks is out of control, thus readers can receive
inconsistent data (also called weak consistency model [8]). To ensure
that a once read block will never return to its previous value (READ
RECENT rule mentioned earlier) all attempts to read non consistent
(not confirmed from all replicas) block has to be delayed or rejected
with an explicit error. Consider this scenario:

1. Client1 fans out write requests to all replicas.

2. Write request reaches only the primary one, client1 crashes.

3. Concurrent client2 reads the same block, since primary replica has
applied write request from client1, client2 reads the requested block
successfully and observes the change made by client1.

4. Primary replica crashes and new primary is selected, resync process
in a group (PG) is started.

5. Concurrent client2 reads again the same block from a new primary,
but since no one else observes the change written to previous primary
made by client1, client2 receives old data.

This scenario breaks the READ RECENT rule and introduces data
corruption. In order to guarantee that block is consistent, read from
line 3. should be delayed or rejected till all replicas confirm, that
block is persistent. This can be achieved by the same confirmation
mechanism described earlier: block is treated as inconsistent if it
was modified and not all  confirmations are received.

I do not expect any performance degradation for generic read loads and
can quote Sage here: “Reads in shared read/write workloads are usually
unaffected by write operations. However, reads of uncommitted data can
be delayed until the update commits. This increases read latency in
certain cases, but maintains a fully consistent behavior for
concurrent read and write operations in the event that all OSDs in the
placement group simultaneously fail and the write ack is not
delivered” [7].

Another issue, which has to be investigated in detail, is a full
outage of a group of replicas (PG). When replicas are come back some
of the objects can be out of sync, and without versioned log there is
no any possibility to distinguish what blocks were updated recently
and thus have a higher timestamp. As was mentioned earlier READ RECENT
rule has one property: if block has not been read since the
modification, then it is irrelevant what version of a block can be
taken as a master copy. Thus, if read of an inconsistent block is
delayed till it is persistent on all replicas, any replica in a group
may be a “donor” of this particular block.

 From the statement above it follows that non-atomic or partial writes
can happen. Indeed, from a filesystem or an application which acts as
a cluster client, this requires special handling of a write failure,
such as log replay, for dealing with inconsistencies left behind. But
it will assume that each block on all replicas in a group (PG) has
just one value, so in this way group remains synchronized and future
reads won’t report different content in case of a change of the
primary replica which serves reads.

Journal-less objectstore for IO path
------------------------------------

Write atomicity requires certain support from an underlying hardware
or software. Different techniques like journals, COW or soft-updates
can be used in order to guarantee atomicity, but for no doubts this
feature comes with a high price of the IO performance
degradation. Since each modern filesystem or a DBMS application itself
can take care of data inconsistency, atomicity constraint for the
objectstore can be relaxed.

In order to provide performance close to a bare hard drive, each
replica has to submit writes to an underlying hardware immediately and
without any requirements for request ordering or data atomicity. As
was mentioned earlier, the reading delay mechanism of non-persistent
blocks makes synchronization of objects after a complete shutdown
possible without knowing where is the recent update. However, full
synchronization can take days if an object is big enough. The problem
of long resync can be solved by bookkeeping a bitmap of a
possibly-out-of-sync blocks of an object. The algorithm can be
described in just one sentence: “Bits are set before a write
commences, and are cleared when there have been no writes for a
while.” [11].

In highly distributed storage solution where a group of replicas (PG)
can consist of different replicas during the whole life of a cluster
and members in a group constantly changing, it is important not only
to track out-of-sync blocks of an object, but also to keep versions of
such changes, so that to answer the following question with minimal
computational cost: what has changed in the object between versions N
and M?

Having versions of block changes in mind, one important modification
to bitmap algorithm can be proposed: bitmap of out-of-sync blocks is
stored in a file with a timestamp of a first write request in the
name; after a certain number of changed blocks each replica does a
rotation of the bitmap file, so that new bitmap file is created with a
timestamp of a first write request which initiated the bitmap rotation
process. The major difference to what RAID 1 does is, that bits are
never cleared, but new bitmap file is created instead. Bits of
modified but not yet persistent blocks (not all confirmations are
received from replicas) migrate to a new bitmap file, in order to
guarantee that block synchronization will take place in the future in
case of possible replica failure.

As timestamps of client write requests always change forward and the
time on clients is synchronized (major requirement of proposed hybrid
client-driven replication model), each replica will have similar view
of block changes, so the question about data difference between the
two versions can be easily answered.

Proposed rotation of bitmap files can occupy extra disk space on a
replica, especially when new files are created and never
deleted. Deletion of bitmap files of old versions is not a problem: N
files can be kept, rotation process spawns new bitmap file with a
recent timestamp, meanwhile a file with the oldest timestamp is
removed. If for some reason there will be a need to restore changes of
an object version, which bitmap file was removed, then full object
resynchronization should be performed.

Summary
-------

Proposed changes should relax a lot of constraints on fast IO path
in current Ceph implementation keeping strong data consistency
between replicas and bring performance of a single client to almost
bare hard drive bandwidths.

Even if there are logical holes in the model described above (there
are), I still believe that eliminating them will not be impossible.

PS. And yes, because of the “write hole” problem [12] erasure coding
replication won’t survive full group outage without atomicity
guarantees on the objectstore side. Sorry for that.

--
Roman

[1] R. Golding, Weak-consistency group communication and membership,
PhD thesis, University of California, Santa Cruz, 1992.
[2] Petersen et al., “Flexible Update Propagation for weakly
consistent replication”, Proc. of the 16th ACM Symposium on Operating
Systems Principles (SOSP), 1997.
[3] M. Rabinovich. N. Gehani. A. Kononov, “Scalable Update Propagation
in Epidemic Replicated Databases”,Advances in Database Technology -
EDBT'96, Lecture Notes in Computer Science Vol. 1057, Springer,
pp. 207-222.
[4] G. Wuu, A Bernstein. „Efficient Solutions to the Replicated Log
and Dictionary Problems”, Proceedings of the third ACM Symposium on
Principles of Distributed Computing, August 1984, pp. 233-242.
[6] https://en.wikipedia.org/wiki/Strong_consistency
[7] Sage A. Weil. Ceph: Reliable, Scalable, And High-Performance
Distributed Storage, PhD thesis, University of California, Santa Cruz,
2007.
[8] Jiayuan Zhang, Yongwei Wu†, Yeh-Ching Chung. PROAR: A Weak
Consistency Model For Ceph
[9] Precision Time Protocol (PTP/IEEE-1588)
[10] R. H. Thomas, “A majority consensus approach to concurrency control
for multiple copy databases,” ACM Trans. Database Syst., vol. 4, no. 2,
pp. 180–209, 1979.
[11] Cluster support for MD/RAID 1, https://lwn.net/Articles/674085/
[12] https://en.wikipedia.org/wiki/RAID