For developers submitting jobs using teuthology, we now have
recommendations on what priority level to use:
https://docs.ceph.com/docs/master/dev/developer_guide/#testing-priority
--
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
Hi,
We're happy to announce that a couple of weeks ago, we've submitted a few Github pull requests[1][2][3] adding initial Windows support. A big thank you to the people that have already reviewed the patches.
To bring some context about the scope and current status of our work: we're mostly targeting the client side, allowing Windows hosts to consume rados, rbd and cephfs resources.
We have Windows binaries capable of writing to rados pools[4]. We're using mingw to build the ceph components, mostly due to the fact that it requires the minimum amount of changes to cross compile ceph for Windows. However, we're soon going to switch to MSVC/Clang due to mingw limitations and long standing bugs[5][6]. Porting the unit tests is also something that we're currently working on.
The next step will be implementing a virtual miniport driver so that RBD volumes can be exposed to Windows hosts and Hyper-V guests. We're hoping to leverage librbd as much as possible as part of a daemon that will communicate with the driver. We're also aiming at cephfs and considering using Dokan, which is FUSE compatible.
Merging the open PRs would allow us to move forward, focusing on the drivers and avoiding rebase issues. Any help on that is greatly appreciated.
Last but not least, I'd like to thank Suse, who's sponsoring this effort!
Lucian Petrut
Cloudbase Solutions
[1] https://github.com/ceph/ceph/pull/31981
[2] https://github.com/ceph/ceph/pull/32027
[3] https://github.com/ceph/rocksdb/pull/42
[4] http://paste.openstack.org/raw/787534/
[5] https://sourceforge.net/p/mingw-w64/bugs/816/
[6] https://sourceforge.net/p/mingw-w64/bugs/527/
Hi all. In OPA integration from Ceph there is no integration for bucket
policy.
When user is setting bucket policy to his/her bucket the OPA server won't
get who get's access to that bucket so after that if the request is
coming from the user (that gets access to that bucket via bucket policy) to
access that bucket (PUT, GET,...), OPA will reject that because of no data
in database.
I have create a pull request for this problem so if user creates a bucket
policy for his/her bucket, the policy data will send to OPA server to be
update on the database.
I think the main idea of having OPA is to have all authorization in OPA and
Ceph don't authorize any request by it self.
Here is the pull request and I would be thankful to hear about your
comments.
https://github.com/ceph/ceph/pull/32294
Thanks.
Hello all,
I don't know how many of you folks are aware, but early last year,
Datto (full disclosure, my current employer, though I'm sending this
email pretty much on my own) released a tool called "zfs2ceph" as an
open source project[1]. This project was the result of a week-long
internal hackathon (SUSE folks may be familiar with this concept from
their own "HackWeek" program[2]) that Datto held internally in
December 2018. I was a member of that team, helping with research,
setting up infra, and making demos for it.
Anyway, I'm bringing it up here because I'd had some conversations
with some folks individually who suggested that I bring it up here in
the mailing list and to talk about some of the motivations and what
I'd like to see in the future from Ceph on this.
The main motivation here was to provide a seamless mechanism to
transfer ZFS based datasets with the full chain of historical
snapshots onto Ceph storage with as much fidelity as possible to allow
a storage migration without requiring 2x-4x system resources. Datto is
in the disaster recovery business, so working backups with full
history are extremely valuable to Datto, its partners, and their
customers. That's why the traditional path of just syncing the current
state and letting the old stuff die off is not workable. At the scale
of having literally thousands of servers with each server having
hundreds of terabytes of ZFS storage (making up in aggregate to
hundreds of petabytes of data), there's no feasible way to consider
alternative storage options without having a way to transfer datasets
from ZFS to Ceph so that we can cut over servers to being Ceph nodes
with minimal downtime and near zero new server purchasing requirements
(there's obviously a little bit of extra hardware needed to "seed" a
Ceph cluster, but that's fine).
The current zfs2ceph implementation handles zvol sends and transforms
them into rbd v1 import streams. I don't recall exactly the reason why
we don't use v2 anymore, but I think there was some gaps that made it
so it wasn't usable for our case back then (we were using Ceph
Luminous). I'm unsure if this is improved now, though it wouldn't
surprise me if it has. However, zvols aren't enough for us. Most of
our ZFS datasets are in the ZFS filesystem form, not the ZVol block
device form. Unfortunately, there is no import equivalent for CephFS,
which blocked an implemented of this capability[3]. I had filed a
request about it on the issue tracker, but it was rejected on the
basis of something was being worked on[4]. However, I haven't seen
something exactly like what I need land in CephFS yet.
The code is pretty simple, and I think it would be easy enough for it
to be incorporated into Ceph itself. However, there's a greater
question here. Is there interest from the Ceph developer community in
developing and supporting strategies to migrate from legacy data
stores to Ceph with as much fidelity as reasonably possible?
Personally, I hope so. My hope is that this post generates some
interesting conversation about how to make this a better supported
capability within Ceph for block and filesystem data. :)
Best regards,
Neal
[1]: https://github.com/datto/zfs2ceph
[2]: https://hackweek.suse.com/
[3]: https://github.com/datto/zfs2ceph/issues/1
[4]: https://tracker.ceph.com/issues/40390
--
真実はいつも一つ!/ Always, there's only one truth!
Hi Sage and all,
I would like to share some of my scattered thoughts regarding IO
performance improvements in Ceph. I touch on topics such as replicated
log model, atomicity and full journaling on the objectstore side and how
these constraints can be relaxed on IO path in order to squeeze maximum
from a hard drive, at the same time still respecting the strong data
consistency between replicas.
Would be great to have a chance to discuss this in Seoul.
Proposal in a few lines (for time saving)
-----------------------------------------
1. IO and management requests follow different paths:
1. Slow path for management requests (object creation/deletion,
locks, etc) involves replicated log and primary-copy model, as
it is right now.
2. Fast IO path (read, write requests) bypasses replicated log and
makes clients responsible for sending data to each replica in a
group (PG), I reference it as hybrid client-driven replication
further in the text.
2. For the sake of simplicity client takes fast IO path only when
group of replicas (PG) is in active and healthy state
(active+client in terms of Ceph), otherwise client falls back to
the slow path. Thus, client is not involved in the disaster
recovery process (that's why hybrid).
3. Fast IO path (is taken only if group of replicas (PG) is
in active and healthy state):
1. Reads of objects happen from different primaries in a group
(PG), in order to spread the read load.
2. Writes are distributed by clients themselves to all replicas in
a group (PG), avoiding one extra network hop.
4. Objectstore becomes journal-less and behaves similar to RAID 1,
sacrificing atomicity and erasure coding (due to the "write hole"
problem) for write performance, which becomes close to the write
performance of a bare hard drive, but still keeping strong data
consistency between replicas.
!!!!! CAUTION: many letters !!!!!
Introduction
------------
The most impressive IO performance numbers among products which
provide data redundancy is certainly delivered by RAID and that is not
a surprise: you just write a block to several disks in parallel,
nothing more. Perfect.
Then why Ceph can't achieve the same performance as RAID 1 on IO path,
doing plain old write-in-parallel thing, but supplemented with
scalability, CRUSH object placement and all other beloved things which
make Ceph so powerful?
The answer to this question lies in the replication design used by
Ceph, namely log-based replication (PG log in Ceph terms), which is
commonly used in storage and database systems and well described in
many publications [1, 2, 3, 4]). Sequential replicated log indeed
resolves issues related to atomicity, consistency, isolation and
durability (this famous ACID buzzword acronym), especially in DBMS
world where transactions are involved. But Ceph is a storage, not a
database.
Modern hard drives do not provide atomicity, strict writes ordering or
durability. Hardware breaks all the rules and relax the constraints
for the sake of performance, and software has to deal with it. Every
modern filesystem can survive crashes (journals, COW, soft-updates,
etc approaches are used), each application which cares about data
consistency and can't rely on weak guarantees of POSIX file API (any
DBMS for example) uses same methods (journals, COW, etc) to avoid
inconsistency. Why then a storage solution, which works below a
filesystem and on top of a hardware, should provide such strong ACID
properties for IO?
So what is a log-based replication which is a core of Ceph? Very
simple: duplication (mirroring) of a log which consist of records
describing storage operations. In order to duplicate something we have
to be sure that an object which we duplicate has not been changed on
the way, i.e. actually copy-paste does what it is supposed to do
without nasty surprises. When we have a log of storage operations we
need not only to deliver a recent record to a replica without
corruption (and wrong order of operations in a log is a corruption)
but also invoke this operation, which has been described in a log
record. Not only that: new record in a log and operation, which is
described by that record, should happen in one transaction,
atomically, thus either we see a record and operation was performed,
either no updates in a log and no operation consequences is visible.
Having such a wonderful log (which obviously can survive a crash or
even a nuclear strike) we can very easily answer The Most Important
question for a high scalable, self-healing storage product: what has
happened on other replicas while one of them took a nap (crashed). In
other words we need such a complicated mechanism only for one thing:
just to tell the exact data difference (e.g. in blocks of data and
operations), what has been changed since a crash.
Short quiz: can Сeph stay Ceph, but without any log-based replication?
Yes, throw it away, no need to know the difference, just copy the
whole bunch of data between replicas on any disaster recovery. Very
slow on resyncing but fast on doing operations and simple. I'm, of
course, exaggerating, but that is not so far from the truth.
Talking about log-based replication I've never mentioned why it
actually impacts the IO performance. Here is why:
1. Each update operation on a group of replicas (PG in terms of Ceph)
has to be strictly ordered (log should stay in the same order on all
replicas) i.e. no any other operation is allowed till an operation,
which has been just started, reaches a hard drive with all cache
flushes involved (who poked the code knows the notorious pglock, which
protects the replication log from corruption and makes it be equal on
all other replicas).
2. Severe restrictions on how actually the objectstore should be
implemented: data and metadata should be fully journaled, because a
log record of a replicated operation is stored along with the data in
atomic manner. And what is wrong with journal in terms of performance?
Nothing is wrong, it is just incredibly slow. (As a hint for the
curious: enable "data=journal" option for ext4 filesystem, or
"allow_cow" for xfs or just run any load which performs random writes
on BTRFS (cow is always enabled), and yes, I promise, you feel the
difference immediately).
3. Communication with a group of replicas (PG) happens only through a
primary one, where a primary is responsible for keeping log in sync
with other replicas. This primary-copy model [7] increases IO latency
adding one extra network hop.
All of the above about log-based replication in Ceph is about strong
consistency, but not about performance. But can we have both? And what
is actually a strong consistency? According to formal description "The
protocol is said to support strong consistency if all accesses are
seen by all parallel processes (or nodes, processors, etc.) in the
same order (sequentially)" [6]. In simple words: one thread produces
data to the same offset of a file, other threads read from the same
offset, thus if strong consistency is respected readers never observe
writes in a wrong order, e.g.:
Writer Reader 1 Reader 2 Reader 3
-------------- -------------- -------------- ---------------
write(file, "A");
b0 = read(file); b0 = read(file);
b1 = read(file);
write(file, "B");
b0 =
read(file);
b1 = read(file); b1 =
read(file);
In this example, readers are able to see a block of a file in the
following states:
Reader 1 Reader 2 Reader 3
--------- ---------- ----------
b0 == "A" b0 == "A" b0 == "B"
b1 == "B" b1 == "A" b1 == "B"
But what should never be observed and can be treated as a corruption
is the following:
b0 == "B"
b1 == "A"
Having that we can develop a simple rule: writes are not blocked and
can come at any order, once a reader observes a change in a block,
further reads to this block see the change. Most likely I describe
here a wheel, which has some vivid name and was deeply buried in some
publication in early 70th, but I do not care and call the rule as READ
RECENT. The READ RECENT rule contains one splendid property: if read
has never happened and order of writes is unknown, writes can be
reordered regardless of their actual order. Practical example: there
is a distributed storage where writes are unordered by nature; a full
outage has happened and now replicas contain different blocks of data
on the same offsets; read of the block has never happened after it was
updated, thus we are able to bring replicas in sync choosing block
from *any* replica. Marvelous. I will return to this highly important
property when I start describing a new replication model.
It turns out that strong consistency is not about strict requests
ordering in a whole group of replicas (PG) it is only about order of
reads against writes to the same block of data. That's it. So the
question is left unanswered: can we have both strong consistency and
high performance of IO?
In order to answer this question, it is necessary to once again state
the main requirements for the IO replication model, which remove the
strict restrictions imposed by the log-based replication, positively
affect performance and do not contradict strong consistency:
1. Get rid of extra hop on fast IO path by doing client-driven
replication, at least when group of replicas (PG) is in active state
and not doing any sort of disaster recovery.
2. Write requests are not strictly ordered to different offsets of an
object and can be submitted to a hard drive in parallel.
3. Get rid of any journaling on fast IO path on objectstore side if
possible.
These are major requirements of constraint relaxation which can bring
storage IO performance close to a bare hard drive bandwidths keeping
strong consistency. In the next chapter I will cover all the
requirements and will show how it can be implemented in practice.
Separation of management requests from IO requests
--------------------------------------------------
Requests in a storage cluster can be divided into two parts:
management requests, which are object creation/deletion/listing,
attributes setting/getting, lock management, etc; and IO requests,
which do read or modify content of an object. The most prevailing
requests are IO read and write requests, whose latency affect the
overall performance of the entire cluster. The proposed requests
separation has also a clear separation on metadata and data, which
should be treated differently on the objectstore side, namely no
journal should be involved in data modification (this will be
discussed in detail below).
Separating IO from management requests makes it possible to have a
different primary replicas in a group of replicas (PG). For example in
order to distribute a read load, primary can be chosen by an object id
using persistent hashing (e.g. hash ring as proposed here [8]) instead
of having a single primary. Write requests require replication
involved and as was mentioned earlier client-driven replication model
can be chosen in order to minimize latencies of an extra hop.
Having different paths for several types of requests can be summarised
as the following:
1. Management requests go through a persistent primary replica and
log-based replication keeps strict ordering of operations in the
entire group (PG). There are no changes to original Ceph architecture
are proposed.
2. Read object requests are sent to different primaries according to
some persistent hashing rule [8].
3. Write requests can follow two IO paths, where on one of the paths
client is fully responsible for request replication and delivers
requests to all replicas in a group (PG). This path is always taken
when group of replicas (PG) is in active and healthy state. The second
IO path always lies through a primary, thus all requests are equally
ordered (like it is right now implemented in Ceph). This path is taken
in order to simplify all corner cases when a group of replicas (PG) is
not in an active state and performs resynchronization process.
Hybrid client-driven replication
--------------------------------
Client-driven replication is quite self-descriptive: client is
responsible for data delivery to replicas in a group (PG) avoiding one
extra network hop, which exists in a primary-copy model. If client
communicates with replicas in a group (PG) directly there is no longer
the central point of synchronization, thus the one possible issue that
needs to be considered is the possibility of write requests from
several clients to come to replicas in different order:
Client 1 Client 2
-------- --------
sends A to replica1 sends B to replica1
sends A to replica2 sends B to replica2
Replica 1 Replica 2
--------- ---------
A B
B A
** replica1 writes B to the disk and then overwrites B with A
replica2 writes A to the disk and then overwrites A with B
In this example both clients access the same block of the same object,
but write requests become reordered differently because of the
network. If this happened, then replicas can contain different data
and they are out of sync. This issue has very bad consequences: future
reads are undefined, especially if one of replicas crashed and clients
start reading from another. This is definitely a data corruption. In
order to solve the issue the order of writes to the same offset should
be synchronized between replicas. Synchronization can be performed by
marking each request with a timestamp, taking into consideration, that
time flows equally on clients. Sub-microsecond time synchronization is
not a problem [9], especially when there is the central point for all
members - distributed state machine cluster, namely cluster of
monitors in terms of Ceph.
Having time synchronized, client marks write requests with a
timestamp. According to the Thomas Write Rule [10] outdated requests
are discarded by timestamp-based concurrency control keeping the
correct order. However, in the case of the client-driven replication
it is impossible to discard a change on all replicas
simultaneously. Instead outdated request should be repeated for all
replicas in a group (PG), here is an example:
Client 1 Client 2
-------- --------
marks A with stamp 1 marks B with stamp 2
sends A to replica1 sends B to replica1
sends A to replica2 sends B to replica2
Replica 1 Replica 2
--------- --------------
-- on disk --
A B
-- on disk --
B A
** replica1 writes B to the disk and rejects A with RETRY
replica2 writes A to the disk and then overwrites A with B
Request A is not written by replica1, because it is older then B
request, which has been applied recently. Instead replica1 replied on
A request with the RETRY error, which forces the client to mark
request A with a new timestamp and resend it again to the whole group
of replicas (PG):
Client 1
--------
marks A with stamp 3
sends A to replica1
sends A to replica2
Replica 1 Replica 2
--------- ---------
A A
-- on disk --
B
-- on disk --
B A
** replica1 overwrites B with A
replica2 overwrites B with A
The next question rises: why do we need to repeat write operation on
all replicas, but not only on those who have just replied with the
error? Let's consider overlapping data (here I do not consider the
reason why clients send overlapping write requests concurrently,
obviously this is a bug on a client side, but nevertheless we keep
replicas in sync):
Client 1 marks AAAA with stamp 1
Client 2 marks BBB with stamp 2
Client 3 marks C with stamp 3
Replica 1 Replica 2
--------- ---------
C 3
BBB 2 C 3
AAAAA 1 AAAAA 1
--------- --------- below the line is the state on disk
ABCBA AACAA
** replica1 writes AAAA, then overwrites the middle with BBB and
then with C
replica2 writes AAAA and overwrites the middle with C;
BBB request is rejected
Client2 receives the RETRY error and has to repeat BBB:
Client 2 marks BBB with stamp 4
Replica 1 Replica 2
--------- ---------
BBB 4 BBB 4
-- on disk --
C 3
-- on disk --
BBB 2 C 3
AAAAA 1 AAAAA 1
-------------- -------------- below the line is the state on disk
ABBBA ABBBA
On a second retry blocks on replicas become synchronized and write
request is considered as completed and persistent on a drive. Since
concurrent writes should never happen on well written client software
(I expect distributed filesystems use lock primitives in order to
prevent concurrent access) I do not expect any performance degradation
because of frequent write retries, so current algorithm acts as a
protection which keeps replicas in sync.
I would like to emphasize several points obtained from the description
of the algorithm described above:
1. Time synchronization on clients does not need to be very accurate,
even hundreds of millisecond accuracy is enough. Described time-based
algorithm does not depend on actual physical time, also it does not
depend on real (absolute) order of writes sent by concurrent
clients. Instead algorithm forces distributed cluster members to have
a single view on requests order, which is enough to decide should
request be rejected or executed.
2. Hybrid client-driven replication does not involve clients into data
recovery process, thus leaving this job to replicas (exactly as it is
right now). Here I describe the fast IO path only when group of
replicas (PG) is in active and fully synchronized state. When group of
replicas (PG) is in resynchronization state or client sends management
type of requests (not IO), then for the sake of simplicity all
communication goes through the main primary replica, as it is right
now implemented in Ceph.
Crash of a client in the middle of a replication
------------------------------------------------
Client-driven replication in contrast with a primary-copy model has
another major issue which has to be considered in detail: crash of a
client in the middle of a replication, that is, when client sends
write request to one of replicas in a group (PG) and then crashes,
leaving other replicas out of sync. Since there is no central point
which controls the whole replication process, group of replicas have
no information, was the write request confirmed by a whole group or
something has happened to a client and special action should be
performed.
The issue can be solved by immediate confirmation of a successfully
completed write request sent by non-primary replicas to a primary
one. As was mentioned earlier, each object has its own primary which
serves reads, thus write confirmations of a modified object are sent
to a particular primary, which accounts number of confirmations for
the modified object. If the confirmation did not come from all
replicas in a group, then after the timeout elapsed a primary can
start synchronizing an object on its own.
In order to be sure that confirmations are accounted correctly, each
of them is marked with a timestamp taken from a source write request,
thus reordered and outdated confirmations can be discarded.
I would like to stress the point, that accounting of confirmations
happens in RAM only at runtime and does not involve any disk
operations. If one of replicas crashes after a client crash then the
whole group of replicas (PG) changes its state, further write requests
take another path and forwarded to the main primary in a group and
objects synchronization would be started anyway.
The READ RECENT rule in action
------------------------------
In client-driven replication the order in which replicas receive
requests and update blocks is out of control, thus readers can receive
inconsistent data (also called weak consistency model [8]). To ensure
that a once read block will never return to its previous value (READ
RECENT rule mentioned earlier) all attempts to read non consistent
(not confirmed from all replicas) block has to be delayed or rejected
with an explicit error. Consider this scenario:
1. Client1 fans out write requests to all replicas.
2. Write request reaches only the primary one, client1 crashes.
3. Concurrent client2 reads the same block, since primary replica has
applied write request from client1, client2 reads the requested block
successfully and observes the change made by client1.
4. Primary replica crashes and new primary is selected, resync process
in a group (PG) is started.
5. Concurrent client2 reads again the same block from a new primary,
but since no one else observes the change written to previous primary
made by client1, client2 receives old data.
This scenario breaks the READ RECENT rule and introduces data
corruption. In order to guarantee that block is consistent, read from
line 3. should be delayed or rejected till all replicas confirm, that
block is persistent. This can be achieved by the same confirmation
mechanism described earlier: block is treated as inconsistent if it
was modified and not all confirmations are received.
I do not expect any performance degradation for generic read loads and
can quote Sage here: “Reads in shared read/write workloads are usually
unaffected by write operations. However, reads of uncommitted data can
be delayed until the update commits. This increases read latency in
certain cases, but maintains a fully consistent behavior for
concurrent read and write operations in the event that all OSDs in the
placement group simultaneously fail and the write ack is not
delivered” [7].
Another issue, which has to be investigated in detail, is a full
outage of a group of replicas (PG). When replicas are come back some
of the objects can be out of sync, and without versioned log there is
no any possibility to distinguish what blocks were updated recently
and thus have a higher timestamp. As was mentioned earlier READ RECENT
rule has one property: if block has not been read since the
modification, then it is irrelevant what version of a block can be
taken as a master copy. Thus, if read of an inconsistent block is
delayed till it is persistent on all replicas, any replica in a group
may be a “donor” of this particular block.
From the statement above it follows that non-atomic or partial writes
can happen. Indeed, from a filesystem or an application which acts as
a cluster client, this requires special handling of a write failure,
such as log replay, for dealing with inconsistencies left behind. But
it will assume that each block on all replicas in a group (PG) has
just one value, so in this way group remains synchronized and future
reads won’t report different content in case of a change of the
primary replica which serves reads.
Journal-less objectstore for IO path
------------------------------------
Write atomicity requires certain support from an underlying hardware
or software. Different techniques like journals, COW or soft-updates
can be used in order to guarantee atomicity, but for no doubts this
feature comes with a high price of the IO performance
degradation. Since each modern filesystem or a DBMS application itself
can take care of data inconsistency, atomicity constraint for the
objectstore can be relaxed.
In order to provide performance close to a bare hard drive, each
replica has to submit writes to an underlying hardware immediately and
without any requirements for request ordering or data atomicity. As
was mentioned earlier, the reading delay mechanism of non-persistent
blocks makes synchronization of objects after a complete shutdown
possible without knowing where is the recent update. However, full
synchronization can take days if an object is big enough. The problem
of long resync can be solved by bookkeeping a bitmap of a
possibly-out-of-sync blocks of an object. The algorithm can be
described in just one sentence: “Bits are set before a write
commences, and are cleared when there have been no writes for a
while.” [11].
In highly distributed storage solution where a group of replicas (PG)
can consist of different replicas during the whole life of a cluster
and members in a group constantly changing, it is important not only
to track out-of-sync blocks of an object, but also to keep versions of
such changes, so that to answer the following question with minimal
computational cost: what has changed in the object between versions N
and M?
Having versions of block changes in mind, one important modification
to bitmap algorithm can be proposed: bitmap of out-of-sync blocks is
stored in a file with a timestamp of a first write request in the
name; after a certain number of changed blocks each replica does a
rotation of the bitmap file, so that new bitmap file is created with a
timestamp of a first write request which initiated the bitmap rotation
process. The major difference to what RAID 1 does is, that bits are
never cleared, but new bitmap file is created instead. Bits of
modified but not yet persistent blocks (not all confirmations are
received from replicas) migrate to a new bitmap file, in order to
guarantee that block synchronization will take place in the future in
case of possible replica failure.
As timestamps of client write requests always change forward and the
time on clients is synchronized (major requirement of proposed hybrid
client-driven replication model), each replica will have similar view
of block changes, so the question about data difference between the
two versions can be easily answered.
Proposed rotation of bitmap files can occupy extra disk space on a
replica, especially when new files are created and never
deleted. Deletion of bitmap files of old versions is not a problem: N
files can be kept, rotation process spawns new bitmap file with a
recent timestamp, meanwhile a file with the oldest timestamp is
removed. If for some reason there will be a need to restore changes of
an object version, which bitmap file was removed, then full object
resynchronization should be performed.
Summary
-------
Proposed changes should relax a lot of constraints on fast IO path
in current Ceph implementation keeping strong data consistency
between replicas and bring performance of a single client to almost
bare hard drive bandwidths.
Even if there are logical holes in the model described above (there
are), I still believe that eliminating them will not be impossible.
PS. And yes, because of the “write hole” problem [12] erasure coding
replication won’t survive full group outage without atomicity
guarantees on the objectstore side. Sorry for that.
--
Roman
[1] R. Golding, Weak-consistency group communication and membership,
PhD thesis, University of California, Santa Cruz, 1992.
[2] Petersen et al., “Flexible Update Propagation for weakly
consistent replication”, Proc. of the 16th ACM Symposium on Operating
Systems Principles (SOSP), 1997.
[3] M. Rabinovich. N. Gehani. A. Kononov, “Scalable Update Propagation
in Epidemic Replicated Databases”,Advances in Database Technology -
EDBT'96, Lecture Notes in Computer Science Vol. 1057, Springer,
pp. 207-222.
[4] G. Wuu, A Bernstein. „Efficient Solutions to the Replicated Log
and Dictionary Problems”, Proceedings of the third ACM Symposium on
Principles of Distributed Computing, August 1984, pp. 233-242.
[6] https://en.wikipedia.org/wiki/Strong_consistency
[7] Sage A. Weil. Ceph: Reliable, Scalable, And High-Performance
Distributed Storage, PhD thesis, University of California, Santa Cruz,
2007.
[8] Jiayuan Zhang, Yongwei Wu†, Yeh-Ching Chung. PROAR: A Weak
Consistency Model For Ceph
[9] Precision Time Protocol (PTP/IEEE-1588)
[10] R. H. Thomas, “A majority consensus approach to concurrency control
for multiple copy databases,” ACM Trans. Database Syst., vol. 4, no. 2,
pp. 180–209, 1979.
[11] Cluster support for MD/RAID 1, https://lwn.net/Articles/674085/
[12] https://en.wikipedia.org/wiki/RAID
This is the seventh update to the Ceph Nautilus release series. This is
a hotfix release primarily fixing a couple of security issues. We
recommend that all users upgrade to this release.
Notable Changes
---------------
* CVE-2020-1699: Fixed a path traversal flaw in Ceph dashboard that
could allow
for potential information disclosure (Ernesto Puerta)
* CVE-2020-1700: Fixed a flaw in RGW beast frontend that could lead to
denial of
service from an unauthenticated client (Or Friedmann)
--
David Galloway
Systems Administrator, RDU
Ceph Engineering
IRC: dgalloway
I'm trying to get the SSH orchestrator running for testing the MDS
Autoscaler <https://github.com/ceph/ceph/pull/32731>.
I start the vstart cluster with:
$ MDS=3 ../src/vstart.sh -d -b -l -n --without-dashboard --cephadm
Processes seem to launch without errors until:
...
/home/mchangir/work/mchangir-ceph.git/build/bin/ceph -c
/home/mchangir/work/mchangir-ceph.git/build/ceph.conf -k
/home/mchangir/work/mchangir-ceph.git/build/keyring fs volume create a
Error EINVAL: Remote method threw exception: TypeError: %d format: a number
is required, not NoneType
Still digging at it. But no leads so far.
Let me know where should I be looking for this one.
The mgr plugin and related utilities python code looks okay to me. But I'm
not 100% sure.
-----
Will this failure get in my way for testing the MDS Autoscaler ?
Also, `ceph status` shows the system in HEALTH_WARN state like so:
$ ./bin/ceph status
2020-01-28T18:03:48.542+0530 7f4ddb872700 -1 WARNING: all dangerous and
experimental features are enabled.
2020-01-28T18:03:48.558+0530 7f4dda610700 -1 WARNING: all dangerous and
experimental features are enabled.
cluster:
id: e717ee71-d1e3-4be4-b771-1fface003e13
health: HEALTH_WARN
1 stray host(s) with 10 service(s) not managed by cephadm
10 stray service(s) not managed by cephadm
services:
mon: 3 daemons, quorum a,b,c (age 2h)
mgr: x(active, since 2h)
mds: a:1 {0=a=up:active} 2 up:standby
osd: 3 osds: 3 up (since 2h), 3 in (since 2h)
data:
pools: 2 pools, 64 pgs
objects: 22 objects, 2.2 KiB
usage: 6.0 GiB used, 297 GiB / 303 GiB avail
pgs: 64 active+clean
-----
Also,
Is this the correct way to remove an mds via the orchestrator:
$ ./bin/ceph orchestrator mds rm c
Error ENOENT: Unable to find mds.c[-*] daemon(s)
but the daemons are indeed running:
$ pgrep -a ceph-
1310763 /usr/libexec/platform-python -s /usr/bin/ceph-crash -n
client.crash.localhost.localdomain
1624818 /home/mchangir/work/mchangir-ceph.git/build/bin/ceph-mon -i a -c
/home/mchangir/work/mchangir-ceph.git/build/ceph.conf
1624861 /home/mchangir/work/mchangir-ceph.git/build/bin/ceph-mon -i b -c
/home/mchangir/work/mchangir-ceph.git/build/ceph.conf
1624904 /home/mchangir/work/mchangir-ceph.git/build/bin/ceph-mon -i c -c
/home/mchangir/work/mchangir-ceph.git/build/ceph.conf
1626022 /home/mchangir/work/mchangir-ceph.git/build/bin/ceph-osd -i 0 -c
/home/mchangir/work/mchangir-ceph.git/build/ceph.conf
1626376 /home/mchangir/work/mchangir-ceph.git/build/bin/ceph-osd -i 1 -c
/home/mchangir/work/mchangir-ceph.git/build/ceph.conf
1626707 /home/mchangir/work/mchangir-ceph.git/build/bin/ceph-osd -i 2 -c
/home/mchangir/work/mchangir-ceph.git/build/ceph.conf
1626895 /home/mchangir/work/mchangir-ceph.git/build/bin/ceph-mds -i a -c
/home/mchangir/work/mchangir-ceph.git/build/ceph.conf
1626950 /home/mchangir/work/mchangir-ceph.git/build/bin/ceph-mds -i b -c
/home/mchangir/work/mchangir-ceph.git/build/ceph.conf
1627005 /home/mchangir/work/mchangir-ceph.git/build/bin/ceph-mds -i c -c
/home/mchangir/work/mchangir-ceph.git/build/ceph.conf
1676260 /home/mchangir/work/mchangir-ceph.git/build/bin/ceph-mgr -i x -c
/home/mchangir/work/mchangir-ceph.git/build/ceph.conf
-----
Is the exception (in red) caused because of a missing placement spec ?
Ideally I'd like to launch a standby MDS without any affinity to a
filesystem or an OSD.
Any help to get this correctly implemented/tested will be appreciated.
--
Milind
Details of this release summarized here:
https://tracker.ceph.com/issues/42377#note-2
(This is a work in progress and some tests are still in queue, I on
PTO starting next Thursday and trying to start review/approval/fix
process early)
rados - Neha approve?
rgw - Casey approve ?
rbd - Jason approve?
krbd - Ilya approve?
fs - Patrick approve?
kcephfs - Patrick approve?
multimds - still running
ceph-deploy - Sage approve?
ceph-disk - Nathan was looking?
upgrade/client-upgrade-hammer (luminous) - Sage approve?
upgrade/client-upgrade-jewel (luminous) - Sage approve?
upgrade/luminous-p2p - Sage approve?
upgrade/jewel-x (luminous) - Sage approve?
upgrade/kraken-x (luminous) - REMOVED/ N/A
powercycle - still running
ceph-ansible - Bred FYI and ?
upgrade/luminous-x (mimic) PASSED
upgrade/luminous-x (nautilus) - one job rerunning, otherwise PASSED
ceph-volume - PASSED, thx Jan for the fix
(please speak up if something is missing)
PS: We'd like to release it this week.
Thx
YuriW
The following PR implements bucket granularity sync. We are aiming
this feature to land in time for Octopus.
https://github.com/ceph/ceph/pull/31686
Bucket granularity sync provides fine grained control of data movement
between buckets in different zones. It extends the existing zone sync
mechanism. In its core the feature modified the way the rgw sync
process treats buckets. Previously buckets were being treated
symmetrically, that is -- each (data) zone holds a mirror of that
bucket that should be the same as all the other zones. Whereas now it
is possible for buckets to diverge, and a bucket can pull data from
other buckets (ones that don't share its name or its ID) in different
zone.
The sync process was assuming therefore that the bucket sync source
and the bucket sync destination were always referring to the same
bucket, now that is not the case anymore.
A new sync policy that can supersede the old zonegroup coarse
configuration (sync_from*) was implemented. The sync policy can be
configured at the zonegroup level (and if it is configured it replaces
the old style config), but it can also be configured at the bucket
level.
In the new sync policy we can define multiple groups that can contain
lists of data-flow configurations, and lists of pipe configurations.
The data-flow define the flow of data between the different zones. It
can define symmetrical data flow, in which multiple zones sync data
from each other, and it can define directional data flow, in which the
data moves in one way from one zone to another.
A pipe defines the actual buckets that can use these data flow, and
the properties that are associated with it (for example: source object
prefix).
A sync policy group can be in 3 states:
enabled: sync is allowed and enabled
allowed: sync is allowed
forbidden: sync (as defined by this group) is not allowed and can
override other groups.
A policy can be defined at the bucket level. A bucket level sync
policy inherits the data flow of the zonegroup policy, and can only
define a subset of what the zonegroup allows.
A wildcard zone, and a wildcard bucket parameter in the policy defines
all relevant zones, or all relevant buckets. In the context of a
bucket policy it means the current bucket instance.
A disaster recovery configuration where entire zones are mirrored
doesn't require configuring anything on the buckets. However, for a
fine grained bucket sync it would be better to configure the pipes to
be synced by allowing (status=allowed) them at the zonegroup level
(e.g., using wildcards), but only enable the specific sync at the
bucket leve (status=enabled)l. If needed, the policy at the bucket
level can limit the data movement to specific relevant zones.
Any changes to the zonegroup policy will need to be applied on the
zonegroup master zone, and require period update and commit. Changes
to the bucket policy will need to be applied on the zonegroup master
zone. The changes are dynamically handled by rgw.
New radosgw-admin commands to control this feature were added:
sync policy get
sync group <create | modify | get | remove>
sync group flow <create | remove>
sync group pipe <create | remove>
sync info
Most are self explanatory. The notable one is sync info, which
provides info about the expected sources and targets of the sync
process at the current zone (or of another, effective zone), either at
the zone level, or at the bucket level.
Since a bucket can now define a policy that defines data movement from
it towards a different bucket at a different zone, when the policy is
created we also generate a list of bucket dependencies that are used
as hints when a sync of any particular bucket happens. The fact that a
bucket reference another bucket doesn't mean it actually sync to/from
it, as the data flow might not permit it.
Bucket sync can also be limited to specific source object prefixes.
The S3 bucket replication api has also been implemented, and allows
users to create replication rules between different buckets. Note
though that while the AWS replication feature allows bucket
replication within the same zone, rgw does not allow it at the moment.
However, the rgw api also added a new 'Zone' array that allows users
to select to what zones the specific bucket will be synced to.
Following are some usage examples:
The system in these examples includes 3 zones: us-east (the master
zone), us-west, us-west-2.
* Example 1: Two zones, complete mirror:
This is similar to current sync capabilities, but being done via the
new sync policy engine. Note that changes to the zonegroup sync policy
require a period update and commit.
[us-east] $ radosgw-admin sync group create --group-id=group1 --status=allowed
[us-east] $ radosgw-admin sync group flow create --group-id=group1 \
--flow-id=flow-mirror --flow-type=symmetrical \
--zones=us-east,us-west
[us-east] $ radosgw-admin sync group pipe create --group-id=group1 \
--pipe-id=pipe1 --source-zones='*' \
--source-bucket='*' --dest-zones='*' \
--dest-bucket='*'
[us-east] $ radosgw-admin sync group modify --group-id=group1 --status=enabled
[us-east] $ radosgw-admin period update --commit
$ radosgw-admin sync info --bucket=buck
{
"sources": [
{
"id": "pipe1",
"source": {
"zone": "us-west",
"bucket": "buck:115b12b3-....4409.1"
},
"dest": {
"zone": "us-east",
"bucket": "buck:115b12b3-....4409.1"
},
"params": {
...
}
}
],
"dests": [
{
"id": "pipe1",
"source": {
"zone": "us-east",
"bucket": "buck:115b12b3-....4409.1"
},
"dest": {
"zone": "us-west",
"bucket": "buck:115b12b3-....4409.1"
},
...
}
],
...
}
}
Note that the "id" field in the output above reflects the pipe rule
that generated that entry, a single rule can generate multiple sync
entries as can be seen in the example.
[us-west] $ radosgw-admin sync info --bucket=buck
{
"sources": [
{
"id": "pipe1",
"source": {
"zone": "us-east",
"bucket": "buck:115b12b3-....4409.1"
},
"dest": {
"zone": "us-west",
"bucket": "buck:115b12b3-....4409.1"
},
...
}
],
"dests": [
{
"id": "pipe1",
"source": {
"zone": "us-west",
"bucket": "buck:115b12b3-....4409.1"
},
"dest": {
"zone": "us-east",
"bucket": "buck:115b12b3-....4409.1"
},
...
}
],
...
}
* Example 2: Directional entire zone backup
Also similar to current sync capabilities. In here we add a third
zone, us-west-2 that will be a replica of us-west, but data will not
be replicated back from it.
[us-east] $ radosgw-admin sync group flow create --group-id=group1 \
--flow-id=us-west-backup --flow-type=directional \
--source-zone=us-west --dest-zone=us-west-2
[us-east] $ radosgw-admin period update --commit
Note that us-west has two dests:
[us-west] $ radosgw-admin sync info --bucket=buck
{
"sources": [
{
"id": "pipe1",
"source": {
"zone": "us-east",
"bucket": "buck:115b12b3-....4409.1"
},
"dest": {
"zone": "us-west",
"bucket": "buck:115b12b3-....4409.1"
},
...
}
],
"dests": [
{
"id": "pipe1",
"source": {
"zone": "us-west",
"bucket": "buck:115b12b3-....4409.1"
},
"dest": {
"zone": "us-east",
"bucket": "buck:115b12b3-....4409.1"
},
...
},
{
"id": "pipe1",
"source": {
"zone": "us-west",
"bucket": "buck:115b12b3-....4409.1"
},
"dest": {
"zone": "us-west-2",
"bucket": "buck:115b12b3-....4409.1"
},
...
}
],
...
}
Whereas us-west-2 has only source and no destinations:
[us-west-2] $ radosgw-admin sync info --bucket=buck
{
"sources": [
{
"id": "pipe1",
"source": {
"zone": "us-west",
"bucket": "buck:115b12b3-....4409.1"
},
"dest": {
"zone": "us-west-2",
"bucket": "buck:115b12b3-....4409.1"
},
...
}
],
"dests": [],
...
}
* Example 3: Mirror a specific bucket
Using the same group configuration, but this time switching it to
'allowed' state, which means that sync is allowed but not enabled.
[us-east] $ radosgw-admin sync group modify --group-id=group1 --status=allowed
[us-east] $ radosgw-admin period update --commit
And we will create a bucket level policy rule for existing bucket
buck2. Note that the bucket needs to exist before being able to set
this policy, and admin commands that modify bucket policies need to
run on the master zone, however, they do not require period update.
There is no need to change the data flow, as it is inherited from the
zonegroup policy. A bucket policy flow will only be a subset of the
flow defined in the zonegroup policy. Same goes for pipes, although a
bucket policy can enable pipes that are not enabled (albeit not
forbidden) at the zonegroup policy.
[us-east] $ radosgw-admin sync group create --bucket=buck2 \
--group-id=buck2-default --status=enabled
[us-east] $ radosgw-admin sync group pipe create --bucket=buck2 \
--group-id=buck2-default --pipe-id=pipe1 \
--source-zones='*' --dest-zones='*'
* Example 4: Limit bucket sync to specific zones:
This will only sync buck3 to us-east (from any zone that flow allows
to sync into us-east).
[us-east] $ radosgw-admin sync group create --bucket=buck3 \
--group-id=buck3-default --status=enabled
[us-east] $ radosgw-admin sync group pipe create --bucket=buck3 \
--group-id=buck3-default --pipe-id=pipe1 \
--source-zones='*' --dest-zones=us-east
* Example 5: sync from a different bucket
Note that bucket sync only works (currently) across zones and not
within the same zone.
Set buck4 to pull data from buck5
[us-east] $ radosgw-admin sync group create --bucket=buck4 '
--group-id=buck4-default --status=enabled
[us-east] $ radosgw-admin sync group pipe create --bucket=buck4 \
--group-id=buck4-default --pipe-id=pipe1 \
--source-zones='*' --source-bucket=buck5 \
--dest-zones='*'
can also limit it to specific zones, for example the following will
only sync data originated in us-west:
[us-east] $ radosgw-admin sync group pipe modify --bucket=buck4 \
--group-id=buck4-default --pipe-id=pipe1 \
--source-zones=us-west --source-bucket=buck5 \
--dest-zones='*'
Checking the sync info for buck5 on us-west is interesting:
[us-west] $ radosgw-admin sync info --bucket=buck5
{
"sources": [],
"dests": [],
"hints": {
"sources": [],
"dests": [
"buck4:115b12b3-....14433.2"
]
},
"resolved-hints-1": {
"sources": [],
"dests": [
{
"id": "pipe1",
"source": {
"zone": "us-west",
"bucket": "buck5"
},
"dest": {
"zone": "us-east",
"bucket": "buck4:115b12b3-....14433.2"
},
...
},
{
"id": "pipe1",
"source": {
"zone": "us-west",
"bucket": "buck5"
},
"dest": {
"zone": "us-west-2",
"bucket": "buck4:115b12b3-....14433.2"
},
...
}
]
},
"resolved-hints": {
"sources": [],
"dests": []
}
}
Note that there are resolved hints, which means that the bucket buck5
found about buck4 syncing from it indirectly, and not from its own
policy (the policy for buck5 itself is empty).
* Example 6: Sync to different bucket
The same mechanism can work for configuring data to be synced to (vs.
synced from as in the previous example). Note that internally data is
still pulled from the source at the destination zone:
Set buck6 to "push" data to buck5
[us-east] $ radosgw-admin sync group create --bucket=buck6 \
--group-id=buck6-default --status=enabled
[us-east] $ radosgw-admin sync group pipe create --bucket=buck6 \
--group-id=buck6-default --pipe-id=pipe1 \
--source-zones='*' --source-bucket='*' \
--dest-zones='*' --dest-bucket=buck5
A wildcard bucket name means the current bucket in the context of
bucket sync policy.
Combined with the configuration in Example 5, we can now write data to
buck6 on us-east, data will sync to buck5 on us-west, and from there
it will be distributed to buck4 on us-east, and on us-west-2.
* Example 7: source filters
Sync from buck8 to buck9, but only objects that start with 'foo/':
[us-east] $ radosgw-admin sync group create --bucket=buck8 \
--group-id=buck8-default --status=enabled
[us-east] $ radosgw-admin sync group pipe create --bucket=buck8 \
--group-id=buck8-default --pipe-id=pipe-prefix \
--prefix=foo/ --source-zones='*' --dest-zones='*' \
--dest-bucket=buck9
Also sync from buck8 to buck9 any object that has the tags color=blue
or color=red
[us-east] $ radosgw-admin sync group pipe create --bucket=buck8 \
--group-id=buck8-default --pipe-id=pipe-tags \
--tags-add=color=blue,color=red --source-zones='*' \
--dest-zones='*' --dest-bucket=buck9
And we can check the expected sync in us-east (for example):
[us-east] $ radosgw-admin sync info --bucket=buck8
{
"sources": [],
"dests": [
{
"id": "pipe-prefix",
"source": {
"zone": "us-east",
"bucket": "buck8:115b12b3-....14433.5"
},
"dest": {
"zone": "us-west",
"bucket": "buck9"
},
"params": {
"source": {
"filter": {
"prefix": "foo/",
"tags": []
}
},
...
}
},
{
"id": "pipe-tags",
"source": {
"zone": "us-east",
"bucket": "buck8:115b12b3-....14433.5"
},
"dest": {
"zone": "us-west",
"bucket": "buck9"
},
"params": {
"source": {
"filter": {
"tags": [
{
"key": "color",
"value": "blue"
},
{
"key": "color",
"value": "red"
}
]
}
},
...
}
}
],
...
}
Note that there aren't any sources, only two different destinations
(one for each configuration). When the sync process happens it will
select the relevant rule for each object it syncs.
Prefixes and tags can be combined, in which object will need to have
both in order to be synced. The priority param can also be passed, and
it can be used to determine when there are multiple different rules
that are matched (and have the same source and destination), to
determine which of the rules to be used.
* Example 8: destination params: storage class
Storage class of the destination objects can be configured:
[us-east] $ radosgw-admin sync group create --bucket=buck10 \
--group-id=buck10-default --status=enabled
[us-east] $ radosgw-admin sync group pipe create --bucket=buck10 \
--group-id=buck10-default \
--pipe-id=pipe-storage-class \
--source-zones='*' --dest-zones=us-west-2 \
--storage-class=CHEAP_AND_SLOW
* Example 9: destination params: destination owner translation
Set the destination objects owner as the destination bucket owner.
This requires specifying the uid of the destination bucket:
[us-east] $ radosgw-admin sync group create --bucket=buck11 \
--group-id=buck11-default --status=enabled
[us-east] $ radosgw-admin sync group pipe create --bucket=buck11 \
--group-id=buck11-default --pipe-id=pipe-dest-owner \
--source-zones='*' --dest-zones='*' \
--dest-bucket=buck12 --dest-owner=joe
* Example 10: destination params: user mode
User mode makes sure that the user has permissions to both read the
objects, and write to the destination bucket. This requires that the
uid of the user (which in its context the operation executes) is
specified.
[us-east] $ radosgw-admin sync group pipe modify --bucket=buck11 \
--group-id=buck11-default --pipe-id=pipe-dest-owner \
--mode=user --uid=jenny
Please let me know if you have any questions. This might be tweaked a
little bit, and there are a couple of additions that I would like to
make, but at the moment that's where things stand.
Yehuda
---------- Forwarded message ---------
From: Brad Hubbard <bhubbard(a)redhat.com>
Date: Tue, Jan 28, 2020 at 1:30 AM
Subject: Re: Luminous 12.2.13 QE Validation status
To: Yuri Weinstein <yweinste(a)redhat.com>
On Tue, Jan 28, 2020 at 12:49 AM Yuri Weinstein <yweinste(a)redhat.com> wrote:
>
> UPDATE:
>
> rados - Neha approve?
> rgw - Casey approved
> rbd - Jason approve?
> krbd - Ilya approved
> fs - Patrick approve?
> kcephfs - Patrick approve?
> multimds - Patrick approve?
> ceph-deploy - Sage approve?
> ceph-disk - Nathan was looking?
> upgrade/client-upgrade-hammer (luminous) - Sage approve?
> upgrade/client-upgrade-jewel (luminous) - Sage approve?
> upgrade/luminous-p2p - Sage approve?
> upgrade/jewel-x (luminous) - Sage approve?
> upgrade/kraken-x (luminous) - REMOVED/ N/A
> powercycle - still running
> ceph-ansible - Bred FYI and ?
This will require the following teuthology trackers to be resolved
(I've submitted PRs for all of them).
https://tracker.ceph.com/issues/43798https://tracker.ceph.com/issues/43799https://tracker.ceph.com/issues/43843
Note that this will allow the tests to run on Mira or OVH. Luminous CA
test will never run on Smithi but then I don't imagine it would need
to run too many more times against luminous.
> upgrade/luminous-x (mimic) PASSED
> upgrade/luminous-x (nautilus) - PASSED
> ceph-volume - PASSED
>
> Dev leads, pls review/approve runs so we can release this.
>
> Thx
> YuriW
>
--
Cheers,
Brad