Re: Seastore storage structure design consideration based on HLC and R-tree

10 Jan 2020

On Fri, Jan 10, 2020 at 2:43 AM Xuehan Xu &lt;xxhdx1985126(a)gmail.com&gt; wrote:
...

 On Fri, 10 Jan 2020 at 02:00, Sam Just &lt;sjust(a)redhat.com&gt; wrote:

 On Wed, Jan 8, 2020 at 7:35 PM Xuehan Xu &lt;xxhdx1985126(a)gmail.com&gt; wrote:

  I added a few comments, my high level perspective
is that it looks
 like an approach for dealing with multiversioned extents which might
 be a component of rados pool level point-in-time globally consistent
 snapshots for purposes like rados pool level cross-cluster
 replication.  However, that sort of thing would require a great deal
 of higher level support, so I'd consider the disk layout portion to be
 out of scope for now.  Is there another use case you are hoping to
 address with this? 
 Hi, sam. Thanks for reviewing the doc:-)

 The main focus of this initiative is about doing efficient
 replication/backup. Specifically, we intended to provide higher level
 modules, especially to rbd and cephfs, the ability to do very high
 rate snapshots (like one snapshot every 5 seconds or even multiple
 snapshots within a second) and efficient snapshot diff and
 export-diff. 
 I didn't mention it explicitly, but the refcounts in the seastore doc
 lba tree are intended to permit extent sharing to support the existing
 snapshot machinery via clone().

 We thought, with this ability, upper level applications can achieve
 near real-time replication that can be compared to the common op-by-op
 replication, but with less overhead. Because it doesn't involve any
 extra replication-dedicated journal operations. And as multiple write
 operations' targeting extents may overlap with each other, even the
 op-by-op replication can also avoid extra journal operations, they
 inevitably replicate overlapped extents multiple times, while, in
 snapshot diff export, only the latest version of the overlapped
 extents need to be replicated. 
 It's not clear to me how this versioning scheme changes journaling or
 pg logging.  For recovery, we already track overlapping extents
 between versions and use cloning appropriately.  Can you expand on
 this portion? 
 Oh, sorry that I caused this miss understanding. By op-by-op
 replication, I meant mechanisms which, to do cross cluster
 replication, do an extra journaling operation for every rbd image
 write operation, like rbd mirroring. The upper layer applications did
 this because RADOS doesn't provide cross cluster replication, which
 leaves them no other choice. And even if we implement op-by-op cross
 cluster replication within RADOS, there still seems to be some
 drawbacks, because multiple write operations' target extents could
 overlap with each other and op-by-op replication will replicate each
 one of those write ops many of which may already been outdated, while
 on the contrary, lightweight snapshots based cross cluster replication
 won't have these overheads. 
The op-by-op scheme has a really specific advantage: interrupting
replication scheme at any point results in a point-in-time consistent
state which I believe is why rbd uses it for maintaining near-realtime
replication.  RBD does have a different replication mode based on
looking at deltas between snapshots which does simply ship only the
new version of the changed extents, right?  Something like this might
well be a useful primitive for doing cross-cluster rados level
replication, but I'd want to start with a sketch of how it fits into a
larger replication scheme and work down to a disk format rather than
going in the other direction.

...

 We thought maybe we can let upper layer applications to choose whether
 to replicate their data instead of doing the replication forcibly at
 the whole rados pool scale. 
 The existing self-managed snapshot scheme already gives rbd image
 granularity snapshots and cephfs recursive, subtree granularity
 snapshots. The difference is that the versioning lives in the
 hobject_t tuple -- each version is a different object with shared
 extents. 
 Yes, but, if I'm understanding correctly, in the current snapshot
 mechanism, we have to create clone objects, copy attrs and omaps of
 the cloned ones, store them on the disk and calculate and record
 clone_overlaps for every write when doing writes on a snapshoted
 objects. And in the upper layer applications, snapshot-creating
 clients need to request a snapshot id from MONs and cooperate with
 other clients who are writing to the same set of objects to finish the
 snapshot creation. All of these may produce non-negligible impact on
 the performance of normal read/write operations when doing high rate
 snapshots. 
The "create clone objects" etc is just osd record keeping.  You'd have
similar overheads once you attempt to build cross-cluster replication
on top of this scheme since I think you need to remember a lot of the
same stuff.  At the least, all osds need to agree on which versions
they are choosing not to gc and need an efficient index for reclaiming
obsolete extents.  Moreover, needing to keep around prior versions
until async gc clears them constrains online segment cleaning and
would require additional space to be kept free.  You also need to
maintain the R tree, which increases metadata overhead in both space
and write amp.  Recovery would need a way to query and duplicate the
versioned state of an object across osds during backfill and log-based
recovery just as it currently needs the clone_overlap metadata to
preserve sharing.

...

 On the other hand, HLC gives systems a method to query any system
 state in the past by physical clock as long as those states are
 remembered and it promises point-in-time consistency within the scope
 of the system just as what Lamport clock promises. So if write ops is
 tagged with a HLC timestamp and recorded somewhere (in the journal,
 for example), they are already snapshots, and all we need to do when
 creating a snapshot is just remembering which snapshot is needed and
 tell OSDs not to clear that snapshot, right? So, for the upper layer
 applications to do snapshots, they just need to record the time at
 which they want to do the snapshot and tell OSDs not to clear the
 corresponding write ops. There's no need to cooperate with other
 clients through some kind of mutual locking or request snapshot ids.
 With R-tree, we can a range search to easily read any data at any time
 out of the journal or calculate a snapshot diff, so there's no need to
 do those objects clone work. So, generally speaking, with HLC and
 R-tree, the system don't need to do any normal R/W performance
 influencing job when doing snapshot related work, which makes the
 snapshots really lightweight. And as a side-effect, since we can read
 data out of journal with the help of r-tree, there's no need for OSDs
 to flush dirty data blocks to the underlying disk because those data
 are already recorded in the journal, which I think could simplify the
 design of OSDs. 
What I'm not really sure of is why we need to track arbitrary prior
versions.  Normally, schemes like this are handy because if you track
N minutes of prior updates, you can consistently respond to arbitrary
read queries with time stamps up to N minutes in the past permitting
efficient read snapshot isolation across partitions without explicit
snapshots.

However, if we're going to tell osds up front what snapshots to
remember and we aren't serving transactional reads, they need only
take note of when they are crossing such a snapshot point and
explicitly remember a snapshot of any such objects -- as we already do
with the self-managed snapshot mechanism.  We could likely do that
with the current snapshot machinery if we change the way snapids are
generated.

...

 Whether this approach can really achieve that goal and whether to do
 it is to be discussed, as we also realised that it may not be
 cost-effective with respect to the amount of development work:-) 
 I guess I'm not sure what this approach gets us that the existing
 cloning scheme does not.  The main problem with high snapshot rates
 currently isn't that the ondisk structure doesn't support it, but
 rather that snapshot stamps are mediated through the monitor.  There
 are reasons for doing it that way -- an rbd client need only issue a
 single monitor command to get rid of a snapshot and all involved osds
 will remove the now unnecessary clones asynchronously without
 requiring the client to track them down.  Similarly, the mds needn't
 find every clone within a subtree -- a potentially expensive
 operation.

 I think what I'm missing is how this structure fits into some higher
 level snapshot scheme you are proposing. 
 Um...Actually I didn't thought much about how the higher level
 snapshot mechanism should be to keep the advantages of both the
 current snapshot mechanism and the OSD structure I'm proposing.
 Because I thought the higher level snapshot mechanism would be
 relatively easy once we have a clear in-OSD snapshot mechanism, which
 has obviously been proved wrong now....

 I think I can take some time to try to figure out one such higher
 level snapshot mechanism:-) 
I think something like this does have the potential to both simplify
the existing snapshot machinery within rados and permit interesting
cross-cluster replication capabilities.  I think the next step would
be to sketch out a not detailed view of a scheme up through the osd
and clients which can accomplish that and continue discussing from
there.
-Sam

...

 Thanks.

  -Sam

 Thanks.

2024

2023

2022

2021

2020

2019

Re: Seastore storage structure design consideration based on HLC and R-tree