rgw: draft of multisite resharding design for review

List overview All Threads
Download

newer

older

CI in cpeh

Re: [ceph-users] Re: Q release name

Casey Bodley

25 Feb 2020 25 Feb '20

8:53 p.m.

I opened a pull request https://github.com/ceph/ceph/pull/33539 with a design doc for dynamic resharding in multisite. Please review and give feedback, either here or in PR comments!

Show replies by date

Casey Bodley

11 Mar 11 Mar

4:53 p.m.

Well shoot. It turns out that some critical pieces of this design were based on a flawed assumption about the datalog; namely that all changes for a particular bucket would map to the same datalog shard. I now see that writes to different shards of a bucket are mapped to different shards in RGWDataChangesLog::choose_oid(). We use cls_locks on these datalog shards so that several gateways can sync the shards independently. But this isn't compatible with several elements of this resharding design that require bucket-wide coordination: * changing bucket full sync from per-bucket-shard to per-bucket * moving bucket sync status from per-bucket-shard to per-bucket * waiting for incremental sync of each bucket shard to complete before advancing to the next log generation On 2/25/20 2:53 PM, Casey Bodley wrote:

...

I opened a pull request https://github.com/ceph/ceph/pull/33539 with a design doc for dynamic resharding in multisite. Please review and give feedback, either here or in PR comments!

Casey Bodley

12 Mar 12 Mar

8:15 p.m.

One option would be to change the mapping in choose_oid() so that it doesn't spread bucket shards. But doing this in a way that's backward-compatible with existing datalogs would be complicated and probably require a lot of extra io (for example, writing all datalog entries on the 'wrong' shard to the error repo of the new shard so they get retried there). This spreading is an important scaling property, allowing gateways to share and parallelize a sync workload that's clustered around a small number of buckets. So I think the better option is to minimize this bucket-wide coordination, and provide the locking/atomicity necessary to do it across datalog shards. The common case (incremental sync of a bucket shard) should require no synchronization, so we should continue to track that incremental sync status in per-bucket-shard objects instead of combining them into a single per-bucket object. We do still need some per-bucket status to track the full sync progress and the current incremental sync generation. If the bucket status is in full sync, we can use a cls_lock on that bucket status object to allow one shard to run the full sync, while shards that fail to get the lock will go to the error repo for retry. Once full sync completes, shards can go directly to incremental sync without a lock. To coordinate the transitions across generations (reshard events), we need some new logic when incremental sync on one shard reaches the end of its log. Instead of cls_lock here, we can use cls_version to do a read-modify-write of the bucket status object. This will either add our shard to the set of completed shards for the current generation, or (if all other shards completed already) increment that generation number so we can start processing incremental sync on shards from that generation. This also modifies the handling of datalog entries for generations that are newer than the current generation in the bucket status. In the previous design draft, we would process these inline by running incremental sync on all shards of the current generation - but this would require coordination across datalog shards. So instead, each unfinished shard in the bucket status can be written to the corresponding datalog shard's error repo so it gets retried there. And once these retries are successful, they will drive the transition to the next generation and allow the original datalog entry to retry and make progress. Does this sound reasonable? I'll start preparing an update to the design doc. On 3/11/20 11:53 AM, Casey Bodley wrote:

...

Matt Benjamin

8:26 p.m.

inline On Thu, Mar 12, 2020 at 3:16 PM Casey Bodley <cbodley(a)redhat.com> wrote:

...

I agree. We know this to be important.

...

So I think the better option is to minimize this bucket-wide coordination, and provide the locking/atomicity necessary to do it across datalog shards. The common case (incremental sync of a bucket shard) should require no synchronization, so we should continue to track that incremental sync status in per-bucket-shard objects instead of combining them into a single per-bucket object. We do still need some per-bucket status to track the full sync progress and the current incremental sync generation. If the bucket status is in full sync, we can use a cls_lock on that bucket status object to allow one shard to run the full sync, while shards that fail to get the lock will go to the error repo for retry. Once full sync completes, shards can go directly to incremental sync without a lock.

Yes, this seems reasonable.

...

To coordinate the transitions across generations (reshard events), we need some new logic when incremental sync on one shard reaches the end of its log. Instead of cls_lock here, we can use cls_version to do a read-modify-write of the bucket status object. This will either add our shard to the set of completed shards for the current generation, or (if all other shards completed already) increment that generation number so we can start processing incremental sync on shards from that generation. This also modifies the handling of datalog entries for generations that are newer than the current generation in the bucket status. In the previous design draft, we would process these inline by running incremental sync on all shards of the current generation - but this would require coordination across datalog shards. So instead, each unfinished shard in the bucket status can be written to the corresponding datalog shard's error repo so it gets retried there. And once these retries are successful, they will drive the transition to the next generation and allow the original datalog entry to retry and make progress.

...

Does this sound reasonable? I'll start preparing an update to the design doc.

Sounds promising to me. Matt

...

On 3/11/20 11:53 AM, Casey Bodley wrote:

_______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io

-- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-821-5101 fax. 734-769-8938 cel. 734-216-5309

Yehuda Sadeh-Weinraub

23 Mar 23 Mar

6:59 p.m.

Sorry for the late reply, this went under my radar as I was looking into similar design issues. I'm looking into reducing the complexity of dealing with separate stages for full sync and incremental sync. It seems to me that we can maybe simplify things by moving that knowledge to the source. Instead of the target behaves differently on incremental vs full sync, we can create an api that would have the source either send the bucket listing, or the bilog entries using a common result structure. The marker can be used by the source to distinguish the state and to keep other information. This will allow reducing complexities on the sync side, We can do similar stuff for the metadata sync too. Moreover, it will allow later for defining different data sources that don't have a bilog and their scheme might be different. This is just the idea in general. There are a lot of details to clear and I glossed over some of the complexities. However, what difference will such a change make to your design? Yehuda On Thu, Mar 12, 2020 at 9:16 PM Casey Bodley <cbodley(a)redhat.com> wrote:

...

_______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io

Casey Bodley

7:50 p.m.

On Mon, Mar 23, 2020 at 1:59 PM Yehuda Sadeh-Weinraub <yehuda(a)redhat.com> wrote:

...

In this resharding design, full- and incremental-sync are dimensioned differently, with full sync being per-bucket and incremental per-bilog-shard. So if you want to abstract them into a single pipe, it would probably need to be one per bucket. I like the idea though. It definitely could simplify sync in the long run, but in the short term, we would still need to sync from octopus source zones. So a lot of the complexity would come from backward compatibility.

...

Yehuda On Thu, Mar 12, 2020 at 9:16 PM Casey Bodley <cbodley(a)redhat.com> wrote:

coordination:

* changing bucket full sync from per-bucket-shard to per-bucket * moving bucket sync status from per-bucket-shard to per-bucket * waiting for incremental sync of each bucket shard to complete before advancing to the next log generation On 2/25/20 2:53 PM, Casey Bodley wrote: > I opened a pull request https://github.com/ceph/ceph/pull/33539 with > a design doc for dynamic resharding in multisite. Please review and > give feedback, either here or in PR comments! >

_______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io

Yehuda Sadeh-Weinraub

10:37 p.m.

On Mon, Mar 23, 2020 at 8:50 PM Casey Bodley <cbodley(a)redhat.com> wrote:

...

On Mon, Mar 23, 2020 at 1:59 PM Yehuda Sadeh-Weinraub <yehuda(a)redhat.com> wrote:

The idea that I had in mind was to have the ability for the source send information about changes in the number of active shards at each stage. For example, let's say we start with 1 active shard (for full sync), then switch to 17 (incremental), then reshard to 23: Targe -> Source: init S -> Target: { now = { num_shards = 1, markers= [ ... ], } next = { num_shards = 17, markers = [ ... ], } } S -> T: fetch T -> S: data ... S -> T: fetch T -> S: complete, next! Then we'll move to the next stage and start processing the new number of shards with the appropriate markers. When a reshard happens the target will send back a notice that now next.num_shards=17 with the corresponding markers. The next stage will only start after all the previous stage shards have been completed.

...

I like the idea though. It definitely could simplify sync in the long run, but in the short term, we would still need to sync from octopus source zones. So a lot of the complexity would come from backward compatibility.

For backward compatibility we can create a module on the target side that would return the needed information for the source if the source doesn't support the new api. Yehuda Yehuda

...

> > Yehuda > > On Thu, Mar 12, 2020 at 9:16 PM Casey Bodley <cbodley(a)redhat.com> wrote: > > > > One option would be to change the mapping in choose_oid() so that it > > doesn't spread bucket shards. But doing this in a way that's > > backward-compatible with existing datalogs would be complicated and > > probably require a lot of extra io (for example, writing all datalog > > entries on the 'wrong' shard to the error repo of the new shard so they > > get retried there). This spreading is an important scaling property, > > allowing gateways to share and parallelize a sync workload that's > > clustered around a small number of buckets. > > > > So I think the better option is to minimize this bucket-wide > > coordination, and provide the locking/atomicity necessary to do it > > across datalog shards. The common case (incremental sync of a bucket > > shard) should require no synchronization, so we should continue to track > > that incremental sync status in per-bucket-shard objects instead of > > combining them into a single per-bucket object. > > > > We do still need some per-bucket status to track the full sync progress > > and the current incremental sync generation. > > > > If the bucket status is in full sync, we can use a cls_lock on that > > bucket status object to allow one shard to run the full sync, while > > shards that fail to get the lock will go to the error repo for retry. > > Once full sync completes, shards can go directly to incremental sync > > without a lock. > > > > To coordinate the transitions across generations (reshard events), we > > need some new logic when incremental sync on one shard reaches the end > > of its log. Instead of cls_lock here, we can use cls_version to do a > > read-modify-write of the bucket status object. This will either add our > > shard to the set of completed shards for the current generation, or (if > > all other shards completed already) increment that generation number so > > we can start processing incremental sync on shards from that generation. > > > > This also modifies the handling of datalog entries for generations that > > are newer than the current generation in the bucket status. In the > > previous design draft, we would process these inline by running > > incremental sync on all shards of the current generation - but this > > would require coordination across datalog shards. So instead, each > > unfinished shard in the bucket status can be written to the > > corresponding datalog shard's error repo so it gets retried there. And > > once these retries are successful, they will drive the transition to the > > next generation and allow the original datalog entry to retry and make > > progress. > > > > Does this sound reasonable? I'll start preparing an update to the design > > doc. > > > > On 3/11/20 11:53 AM, Casey Bodley wrote: > > > Well shoot. It turns out that some critical pieces of this design were > > > based on a flawed assumption about the datalog; namely that all > > > changes for a particular bucket would map to the same datalog shard. I > > > now see that writes to different shards of a bucket are mapped to > > > different shards in RGWDataChangesLog::choose_oid(). > > > > > > We use cls_locks on these datalog shards so that several gateways can > > > sync the shards independently. But this isn't compatible with several > > > elements of this resharding design that require bucket-wide coordination: > > > > > > * changing bucket full sync from per-bucket-shard to per-bucket > > > > > > * moving bucket sync status from per-bucket-shard to per-bucket > > > > > > * waiting for incremental sync of each bucket shard to complete before > > > advancing to the next log generation > > > > > > > > > On 2/25/20 2:53 PM, Casey Bodley wrote: > > >> I opened a pull request https://github.com/ceph/ceph/pull/33539 with > > >> a design doc for dynamic resharding in multisite. Please review and > > >> give feedback, either here or in PR comments! > > >> > > _______________________________________________ > > Dev mailing list -- dev(a)ceph.io > > To unsubscribe send an email to dev-leave(a)ceph.io > > >

Yehuda Sadeh-Weinraub

24 Mar 24 Mar

5:11 a.m.

On Mon, Mar 23, 2020 at 11:37 PM Yehuda Sadeh-Weinraub <yehuda(a)redhat.com> wrote:

...

On Mon, Mar 23, 2020 at 8:50 PM Casey Bodley <cbodley(a)redhat.com> wrote:

On Mon, Mar 23, 2020 at 1:59 PM Yehuda Sadeh-Weinraub <yehuda(a)redhat.com> wrote:

Sorry, got these reversed (and confusing). Should be: T -> S: fetch T <- S: data ... T -> S: fetch T <- S: complete, next!

...

Then we'll move to the next stage and start processing the new number of shards with the appropriate markers. When a reshard happens the target will send back a notice that now next.num_shards=17 with the corresponding markers. The next stage will only start after all the previous stage shards have been completed.

For backward compatibility we can create a module on the target side that would return the needed information for the source if the source doesn't support the new api. Yehuda Yehuda >> >> Yehuda >> >> On Thu, Mar 12, 2020 at 9:16 PM Casey Bodley <cbodley(a)redhat.com> wrote: >> > >> > One option would be to change the mapping in choose_oid() so that it >> > doesn't spread bucket shards. But doing this in a way that's >> > backward-compatible with existing datalogs would be complicated and >> > probably require a lot of extra io (for example, writing all datalog >> > entries on the 'wrong' shard to the error repo of the new shard so they >> > get retried there). This spreading is an important scaling property, >> > allowing gateways to share and parallelize a sync workload that's >> > clustered around a small number of buckets. >> > >> > So I think the better option is to minimize this bucket-wide >> > coordination, and provide the locking/atomicity necessary to do it >> > across datalog shards. The common case (incremental sync of a bucket >> > shard) should require no synchronization, so we should continue to track >> > that incremental sync status in per-bucket-shard objects instead of >> > combining them into a single per-bucket object. >> > >> > We do still need some per-bucket status to track the full sync progress >> > and the current incremental sync generation. >> > >> > If the bucket status is in full sync, we can use a cls_lock on that >> > bucket status object to allow one shard to run the full sync, while >> > shards that fail to get the lock will go to the error repo for retry. >> > Once full sync completes, shards can go directly to incremental sync >> > without a lock. >> > >> > To coordinate the transitions across generations (reshard events), we >> > need some new logic when incremental sync on one shard reaches the end >> > of its log. Instead of cls_lock here, we can use cls_version to do a >> > read-modify-write of the bucket status object. This will either add our >> > shard to the set of completed shards for the current generation, or (if >> > all other shards completed already) increment that generation number so >> > we can start processing incremental sync on shards from that generation. >> > >> > This also modifies the handling of datalog entries for generations that >> > are newer than the current generation in the bucket status. In the >> > previous design draft, we would process these inline by running >> > incremental sync on all shards of the current generation - but this >> > would require coordination across datalog shards. So instead, each >> > unfinished shard in the bucket status can be written to the >> > corresponding datalog shard's error repo so it gets retried there. And >> > once these retries are successful, they will drive the transition to the >> > next generation and allow the original datalog entry to retry and make >> > progress. >> > >> > Does this sound reasonable? I'll start preparing an update to the design >> > doc. >> > >> > On 3/11/20 11:53 AM, Casey Bodley wrote: >> > > Well shoot. It turns out that some critical pieces of this design were >> > > based on a flawed assumption about the datalog; namely that all >> > > changes for a particular bucket would map to the same datalog shard. I >> > > now see that writes to different shards of a bucket are mapped to >> > > different shards in RGWDataChangesLog::choose_oid(). >> > > >> > > We use cls_locks on these datalog shards so that several gateways can >> > > sync the shards independently. But this isn't compatible with several >> > > elements of this resharding design that require bucket-wide coordination: >> > > >> > > * changing bucket full sync from per-bucket-shard to per-bucket >> > > >> > > * moving bucket sync status from per-bucket-shard to per-bucket >> > > >> > > * waiting for incremental sync of each bucket shard to complete before >> > > advancing to the next log generation >> > > >> > > >> > > On 2/25/20 2:53 PM, Casey Bodley wrote: >> > >> I opened a pull request https://github.com/ceph/ceph/pull/33539 with >> > >> a design doc for dynamic resharding in multisite. Please review and >> > >> give feedback, either here or in PR comments! >> > >> >> > _______________________________________________ >> > Dev mailing list -- dev(a)ceph.io >> > To unsubscribe send an email to dev-leave(a)ceph.io >> > >>

1488

days inactive

1516

days old

dev@ceph.io

Manage subscription

7 comments

3 participants

tags (0)

participants (3)

Casey Bodley
Matt Benjamin
Yehuda Sadeh-Weinraub