Seastore storage structure design consideration based on HLC and R-tree

List overview All Threads
Download

newer

older

Re: New Ceph Plugin - assistance

v14.2.6 Nautilus released

Xuehan Xu

8 Jan 2020 8 Jan '20

7:41 a.m.

Hi, everyone. The following link is a thorough description of what we think may be an option for implementing the on-disk structure of seastore. Please take a look, thanks:-) https://docs.google.com/document/d/1hs5jJ1F8rVz7m2PZbBONlsiJsjRAh_DjVTrHVY3…

Show replies by date

Xuehan Xu

8 Jan 8 Jan

9:47 a.m.

...

The following link is a thorough description of what we think may be

Sorry, it's a rough description, not a thorough one.

Sam Just

7:12 p.m.

I added a few comments, my high level perspective is that it looks like an approach for dealing with multiversioned extents which might be a component of rados pool level point-in-time globally consistent snapshots for purposes like rados pool level cross-cluster replication. However, that sort of thing would require a great deal of higher level support, so I'd consider the disk layout portion to be out of scope for now. Is there another use case you are hoping to address with this? -Sam On Wed, Jan 8, 2020 at 12:47 AM Xuehan Xu <xxhdx1985126(a)gmail.com> wrote:

...

The following link is a thorough description of what we think may be

Sorry, it's a rough description, not a thorough one.

Matt Benjamin

7:44 p.m.

Large intrinsic value though. Historically, and the discussion has been about the perf. and complexity trade-offs, I think. Matt On Wed, Jan 8, 2020 at 1:13 PM Sam Just <sjust(a)redhat.com> wrote:

...

The following link is a thorough description of what we think may be

Sorry, it's a rough description, not a thorough one.

_______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io

-- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-821-5101 fax. 734-769-8938 cel. 734-216-5309

Xuehan Xu

9 Jan 9 Jan

4:42 a.m.

On Thu, 9 Jan 2020 at 02:45, Matt Benjamin <mbenjami(a)redhat.com> wrote:

...

Historically, and the discussion has been about the perf. and complexity trade-offs, I think.

Thanks for reviewing, Matt, I totally agree:-) > > Matt

Brett Niver

4:08 p.m.

...

On Thu, 9 Jan 2020 at 02:45, Matt Benjamin <mbenjami(a)redhat.com> wrote:

Historically, and the discussion has been about the perf. and complexity

trade-offs, I think. Thanks for reviewing, Matt, I totally agree:-)

Matt

_______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io

Ronen Friedman

6:36 p.m.

...

I played with vectorial clocks 10-15 years ago. I think in 4) you have a typo, casual should be causal, right? I really like Lamport clocks conceptually, I personally found implementing based on that concept to have more corner cases than I liked but that could have been just me. What are your time sync requirements with respect to NTP? Aren't there better algorithms out today? On Wed, Jan 8, 2020 at 10:43 PM Xuehan Xu <xxhdx1985126(a)gmail.com> wrote: > On Thu, 9 Jan 2020 at 02:45, Matt Benjamin <mbenjami(a)redhat.com> wrote: > > > >Historically, and the discussion has been about the perf. and complexity > trade-offs, I think. > > Thanks for reviewing, Matt, I totally agree:-) > > > > > Matt > _______________________________________________ > Dev mailing list -- dev(a)ceph.io > To unsubscribe send an email to dev-leave(a)ceph.io > >

Ronen Friedman

6:42 p.m.

...1588v2.. On Thu, Jan 9, 2020 at 7:36 PM Ronen Friedman <rfriedma(a)redhat.com> wrote:

...

Just a note re time sync standards: In my previous job, we had been playing with the TSN standards, trying to achieve synchronization in the 10s of microseconds realm. The specific TSN standard (802.1AS) is a specific PTP profile (more or less 1588 v3, if I remember correctly). To achieve sub-millisecond accuracy one should better have hardware timestamping, but we did achieve very good results with just software times. Ronen On Thu, Jan 9, 2020 at 5:09 PM Brett Niver <bniver(a)redhat.com> wrote: > I played with vectorial clocks 10-15 years ago. I think in 4) you have a > typo, casual should be causal, right? I really like Lamport clocks > conceptually, I personally found implementing based on that concept to have > more corner cases than I liked but that could have been just me. > > What are your time sync requirements with respect to NTP? Aren't there > better algorithms out today? > > > On Wed, Jan 8, 2020 at 10:43 PM Xuehan Xu <xxhdx1985126(a)gmail.com> wrote: > >> On Thu, 9 Jan 2020 at 02:45, Matt Benjamin <mbenjami(a)redhat.com> wrote: >> > >> >Historically, and the discussion has been about the perf. and >> complexity trade-offs, I think. >> >> Thanks for reviewing, Matt, I totally agree:-) >> >> > >> > Matt >> _______________________________________________ >> Dev mailing list -- dev(a)ceph.io >> To unsubscribe send an email to dev-leave(a)ceph.io >> >>

Xuehan Xu

4:35 a.m.

...

Hi, sam. Thanks for reviewing the doc:-) The main focus of this initiative is about doing efficient replication/backup. Specifically, we intended to provide higher level modules, especially to rbd and cephfs, the ability to do very high rate snapshots (like one snapshot every 5 seconds or even multiple snapshots within a second) and efficient snapshot diff and export-diff. We thought, with this ability, upper level applications can achieve near real-time replication that can be compared to the common op-by-op replication, but with less overhead. Because it doesn't involve any extra replication-dedicated journal operations. And as multiple write operations' targeting extents may overlap with each other, even the op-by-op replication can also avoid extra journal operations, they inevitably replicate overlapped extents multiple times, while, in snapshot diff export, only the latest version of the overlapped extents need to be replicated. We thought maybe we can let upper layer applications to choose whether to replicate their data instead of doing the replication forcibly at the whole rados pool scale. Whether this approach can really achieve that goal and whether to do it is to be discussed, as we also realised that it may not be cost-effective with respect to the amount of development work:-) Thanks.

Sam Just

6:59 p.m.

On Wed, Jan 8, 2020 at 7:35 PM Xuehan Xu <xxhdx1985126(a)gmail.com> wrote:

...

I didn't mention it explicitly, but the refcounts in the seastore doc lba tree are intended to permit extent sharing to support the existing snapshot machinery via clone().

...

We thought, with this ability, upper level applications can achieve near real-time replication that can be compared to the common op-by-op replication, but with less overhead. Because it doesn't involve any extra replication-dedicated journal operations. And as multiple write operations' targeting extents may overlap with each other, even the op-by-op replication can also avoid extra journal operations, they inevitably replicate overlapped extents multiple times, while, in snapshot diff export, only the latest version of the overlapped extents need to be replicated.

It's not clear to me how this versioning scheme changes journaling or pg logging. For recovery, we already track overlapping extents between versions and use cloning appropriately. Can you expand on this portion?

...

We thought maybe we can let upper layer applications to choose whether to replicate their data instead of doing the replication forcibly at the whole rados pool scale.

The existing self-managed snapshot scheme already gives rbd image granularity snapshots and cephfs recursive, subtree granularity snapshots. The difference is that the versioning lives in the hobject_t tuple -- each version is a different object with shared extents.

...

Whether this approach can really achieve that goal and whether to do it is to be discussed, as we also realised that it may not be cost-effective with respect to the amount of development work:-)

I guess I'm not sure what this approach gets us that the existing cloning scheme does not. The main problem with high snapshot rates currently isn't that the ondisk structure doesn't support it, but rather that snapshot stamps are mediated through the monitor. There are reasons for doing it that way -- an rbd client need only issue a single monitor command to get rid of a snapshot and all involved osds will remove the now unnecessary clones asynchronously without requiring the client to track them down. Similarly, the mds needn't find every clone within a subtree -- a potentially expensive operation. I think what I'm missing is how this structure fits into some higher level snapshot scheme you are proposing. -Sam

...

Thanks.

Xuehan Xu

10 Jan 10 Jan

11:42 a.m.

On Fri, 10 Jan 2020 at 02:00, Sam Just <sjust(a)redhat.com> wrote:

...

On Wed, Jan 8, 2020 at 7:35 PM Xuehan Xu <xxhdx1985126(a)gmail.com> wrote:

I didn't mention it explicitly, but the refcounts in the seastore doc lba tree are intended to permit extent sharing to support the existing snapshot machinery via clone().

Oh, sorry that I caused this miss understanding. By op-by-op replication, I meant mechanisms which, to do cross cluster replication, do an extra journaling operation for every rbd image write operation, like rbd mirroring. The upper layer applications did this because RADOS doesn't provide cross cluster replication, which leaves them no other choice. And even if we implement op-by-op cross cluster replication within RADOS, there still seems to be some drawbacks, because multiple write operations' target extents could overlap with each other and op-by-op replication will replicate each one of those write ops many of which may already been outdated, while on the contrary, lightweight snapshots based cross cluster replication won't have these overheads.

...

We thought maybe we can let upper layer applications to choose whether to replicate their data instead of doing the replication forcibly at the whole rados pool scale.

Yes, but, if I'm understanding correctly, in the current snapshot mechanism, we have to create clone objects, copy attrs and omaps of the cloned ones, store them on the disk and calculate and record clone_overlaps for every write when doing writes on a snapshoted objects. And in the upper layer applications, snapshot-creating clients need to request a snapshot id from MONs and cooperate with other clients who are writing to the same set of objects to finish the snapshot creation. All of these may produce non-negligible impact on the performance of normal read/write operations when doing high rate snapshots. On the other hand, HLC gives systems a method to query any system state in the past by physical clock as long as those states are remembered and it promises point-in-time consistency within the scope of the system just as what Lamport clock promises. So if write ops is tagged with a HLC timestamp and recorded somewhere (in the journal, for example), they are already snapshots, and all we need to do when creating a snapshot is just remembering which snapshot is needed and tell OSDs not to clear that snapshot, right? So, for the upper layer applications to do snapshots, they just need to record the time at which they want to do the snapshot and tell OSDs not to clear the corresponding write ops. There's no need to cooperate with other clients through some kind of mutual locking or request snapshot ids. With R-tree, we can a range search to easily read any data at any time out of the journal or calculate a snapshot diff, so there's no need to do those objects clone work. So, generally speaking, with HLC and R-tree, the system don't need to do any normal R/W performance influencing job when doing snapshot related work, which makes the snapshots really lightweight. And as a side-effect, since we can read data out of journal with the help of r-tree, there's no need for OSDs to flush dirty data blocks to the underlying disk because those data are already recorded in the journal, which I think could simplify the design of OSDs.

...

Whether this approach can really achieve that goal and whether to do it is to be discussed, as we also realised that it may not be cost-effective with respect to the amount of development work:-)

Um...Actually I didn't thought much about how the higher level snapshot mechanism should be to keep the advantages of both the current snapshot mechanism and the OSD structure I'm proposing. Because I thought the higher level snapshot mechanism would be relatively easy once we have a clear in-OSD snapshot mechanism, which has obviously been proved wrong now.... I think I can take some time to try to figure out one such higher level snapshot mechanism:-) Thanks. > -Sam > > > > > Thanks. > > >

Sam Just

8:10 p.m.

On Fri, Jan 10, 2020 at 2:43 AM Xuehan Xu <xxhdx1985126(a)gmail.com> wrote:

...

On Fri, 10 Jan 2020 at 02:00, Sam Just <sjust(a)redhat.com> wrote:

On Wed, Jan 8, 2020 at 7:35 PM Xuehan Xu <xxhdx1985126(a)gmail.com> wrote:

I didn't mention it explicitly, but the refcounts in the seastore doc lba tree are intended to permit extent sharing to support the existing snapshot machinery via clone().

The op-by-op scheme has a really specific advantage: interrupting replication scheme at any point results in a point-in-time consistent state which I believe is why rbd uses it for maintaining near-realtime replication. RBD does have a different replication mode based on looking at deltas between snapshots which does simply ship only the new version of the changed extents, right? Something like this might well be a useful primitive for doing cross-cluster rados level replication, but I'd want to start with a sketch of how it fits into a larger replication scheme and work down to a disk format rather than going in the other direction.

...

We thought maybe we can let upper layer applications to choose whether to replicate their data instead of doing the replication forcibly at the whole rados pool scale.

The "create clone objects" etc is just osd record keeping. You'd have similar overheads once you attempt to build cross-cluster replication on top of this scheme since I think you need to remember a lot of the same stuff. At the least, all osds need to agree on which versions they are choosing not to gc and need an efficient index for reclaiming obsolete extents. Moreover, needing to keep around prior versions until async gc clears them constrains online segment cleaning and would require additional space to be kept free. You also need to maintain the R tree, which increases metadata overhead in both space and write amp. Recovery would need a way to query and duplicate the versioned state of an object across osds during backfill and log-based recovery just as it currently needs the clone_overlap metadata to preserve sharing.

...

On the other hand, HLC gives systems a method to query any system state in the past by physical clock as long as those states are remembered and it promises point-in-time consistency within the scope of the system just as what Lamport clock promises. So if write ops is tagged with a HLC timestamp and recorded somewhere (in the journal, for example), they are already snapshots, and all we need to do when creating a snapshot is just remembering which snapshot is needed and tell OSDs not to clear that snapshot, right? So, for the upper layer applications to do snapshots, they just need to record the time at which they want to do the snapshot and tell OSDs not to clear the corresponding write ops. There's no need to cooperate with other clients through some kind of mutual locking or request snapshot ids. With R-tree, we can a range search to easily read any data at any time out of the journal or calculate a snapshot diff, so there's no need to do those objects clone work. So, generally speaking, with HLC and R-tree, the system don't need to do any normal R/W performance influencing job when doing snapshot related work, which makes the snapshots really lightweight. And as a side-effect, since we can read data out of journal with the help of r-tree, there's no need for OSDs to flush dirty data blocks to the underlying disk because those data are already recorded in the journal, which I think could simplify the design of OSDs.

What I'm not really sure of is why we need to track arbitrary prior versions. Normally, schemes like this are handy because if you track N minutes of prior updates, you can consistently respond to arbitrary read queries with time stamps up to N minutes in the past permitting efficient read snapshot isolation across partitions without explicit snapshots. However, if we're going to tell osds up front what snapshots to remember and we aren't serving transactional reads, they need only take note of when they are crossing such a snapshot point and explicitly remember a snapshot of any such objects -- as we already do with the self-managed snapshot mechanism. We could likely do that with the current snapshot machinery if we change the way snapids are generated.

...

Whether this approach can really achieve that goal and whether to do it is to be discussed, as we also realised that it may not be cost-effective with respect to the amount of development work:-)

I think something like this does have the potential to both simplify the existing snapshot machinery within rados and permit interesting cross-cluster replication capabilities. I think the next step would be to sketch out a not detailed view of a scheme up through the osd and clients which can accomplish that and continue discussing from there. -Sam

...

Thanks.

-Sam

Thanks.

1565

days inactive

1567

days old

dev@ceph.io

Manage subscription

11 comments

5 participants

tags (0)

participants (5)

Brett Niver
Matt Benjamin
Ronen Friedman
Sam Just
Xuehan Xu