Re: Seastore storage structure design consideration based on HLC and R-tree

9 Jan 2020

On Wed, Jan 8, 2020 at 7:35 PM Xuehan Xu &lt;xxhdx1985126(a)gmail.com&gt; wrote:
...

  I added a few comments, my high level perspective
is that it looks
 like an approach for dealing with multiversioned extents which might
 be a component of rados pool level point-in-time globally consistent
 snapshots for purposes like rados pool level cross-cluster
 replication.  However, that sort of thing would require a great deal
 of higher level support, so I'd consider the disk layout portion to be
 out of scope for now.  Is there another use case you are hoping to
 address with this? 
 Hi, sam. Thanks for reviewing the doc:-)

 The main focus of this initiative is about doing efficient
 replication/backup. Specifically, we intended to provide higher level
 modules, especially to rbd and cephfs, the ability to do very high
 rate snapshots (like one snapshot every 5 seconds or even multiple
 snapshots within a second) and efficient snapshot diff and
 export-diff. 
I didn't mention it explicitly, but the refcounts in the seastore doc
lba tree are intended to permit extent sharing to support the existing
snapshot machinery via clone().

...

 We thought, with this ability, upper level applications can achieve
 near real-time replication that can be compared to the common op-by-op
 replication, but with less overhead. Because it doesn't involve any
 extra replication-dedicated journal operations. And as multiple write
 operations' targeting extents may overlap with each other, even the
 op-by-op replication can also avoid extra journal operations, they
 inevitably replicate overlapped extents multiple times, while, in
 snapshot diff export, only the latest version of the overlapped
 extents need to be replicated. 
It's not clear to me how this versioning scheme changes journaling or
pg logging.  For recovery, we already track overlapping extents
between versions and use cloning appropriately.  Can you expand on
this portion?

...

 We thought maybe we can let upper layer applications to choose whether
 to replicate their data instead of doing the replication forcibly at
 the whole rados pool scale. 
The existing self-managed snapshot scheme already gives rbd image
granularity snapshots and cephfs recursive, subtree granularity
snapshots. The difference is that the versioning lives in the
hobject_t tuple -- each version is a different object with shared
extents.

...

 Whether this approach can really achieve that goal and whether to do
 it is to be discussed, as we also realised that it may not be
 cost-effective with respect to the amount of development work:-) 
I guess I'm not sure what this approach gets us that the existing
cloning scheme does not.  The main problem with high snapshot rates
currently isn't that the ondisk structure doesn't support it, but
rather that snapshot stamps are mediated through the monitor.  There
are reasons for doing it that way -- an rbd client need only issue a
single monitor command to get rid of a snapshot and all involved osds
will remove the now unnecessary clones asynchronously without
requiring the client to track them down.  Similarly, the mds needn't
find every clone within a subtree -- a potentially expensive
operation.

I think what I'm missing is how this structure fits into some higher
level snapshot scheme you are proposing.
-Sam

...

 Thanks.

2024

2023

2022

2021

2020

2019

Re: Seastore storage structure design consideration based on HLC and R-tree