On Mon, Mar 23, 2020 at 1:59 PM Yehuda Sadeh-Weinraub <yehuda@redhat.com> wrote:

Sorry for the late reply, this went under my radar as I was looking
into similar design issues.

I'm looking into reducing the complexity of dealing with separate
stages for full sync and incremental sync. It seems to me that we can
maybe simplify things by moving that knowledge to the source. Instead
of the target behaves differently on incremental vs full sync, we can
create an api that would have the source either send the bucket
listing, or the bilog entries using a common result structure. The
marker can be used by the source to distinguish the state and to keep
other information.
This will allow reducing complexities on the sync side, We can do
similar stuff for the metadata sync too. Moreover, it will allow later
for defining different data sources that don't have a bilog and their
scheme might be different.
This is just the idea in general. There are a lot of details to clear
and I glossed over some of the complexities. However, what difference
will such a change make to your design?

In this resharding design, full- and incremental-sync are dimensioned differently, with full sync being per-bucket and incremental per-bilog-shard. So if you want to abstract them into a single pipe, it would probably need to be one per bucket.

I like the idea though. It definitely could simplify sync in the long run, but in the short term, we would still need to sync from octopus source zones. So a lot of the complexity would come from backward compatibility.

Yehuda

On Thu, Mar 12, 2020 at 9:16 PM Casey Bodley <cbodley@redhat.com> wrote:
>
> One option would be to change the mapping in choose_oid() so that it
> doesn't spread bucket shards. But doing this in a way that's
> backward-compatible with existing datalogs would be complicated and
> probably require a lot of extra io (for example, writing all datalog
> entries on the 'wrong' shard to the error repo of the new shard so they
> get retried there). This spreading is an important scaling property,
> allowing gateways to share and parallelize a sync workload that's
> clustered around a small number of buckets.
>
> So I think the better option is to minimize this bucket-wide
> coordination, and provide the locking/atomicity necessary to do it
> across datalog shards. The common case (incremental sync of a bucket
> shard) should require no synchronization, so we should continue to track
> that incremental sync status in per-bucket-shard objects instead of
> combining them into a single per-bucket object.
>
> We do still need some per-bucket status to track the full sync progress
> and the current incremental sync generation.
>
> If the bucket status is in full sync, we can use a cls_lock on that
> bucket status object to allow one shard to run the full sync, while
> shards that fail to get the lock will go to the error repo for retry.
> Once full sync completes, shards can go directly to incremental sync
> without a lock.
>
> To coordinate the transitions across generations (reshard events), we
> need some new logic when incremental sync on one shard reaches the end
> of its log. Instead of cls_lock here, we can use cls_version to do a
> read-modify-write of the bucket status object. This will either add our
> shard to the set of completed shards for the current generation, or (if
> all other shards completed already) increment that generation number so
> we can start processing incremental sync on shards from that generation.
>
> This also modifies the handling of datalog entries for generations that
> are newer than the current generation in the bucket status. In the
> previous design draft, we would process these inline by running
> incremental sync on all shards of the current generation - but this
> would require coordination across datalog shards. So instead, each
> unfinished shard in the bucket status can be written to the
> corresponding datalog shard's error repo so it gets retried there. And
> once these retries are successful, they will drive the transition to the
> next generation and allow the original datalog entry to retry and make
> progress.
>
> Does this sound reasonable? I'll start preparing an update to the design
> doc.
>
> On 3/11/20 11:53 AM, Casey Bodley wrote:
> > Well shoot. It turns out that some critical pieces of this design were
> > based on a flawed assumption about the datalog; namely that all
> > changes for a particular bucket would map to the same datalog shard. I
> > now see that writes to different shards of a bucket are mapped to
> > different shards in RGWDataChangesLog::choose_oid().
> >
> > We use cls_locks on these datalog shards so that several gateways can
> > sync the shards independently. But this isn't compatible with several
> > elements of this resharding design that require bucket-wide coordination:
> >
> > * changing bucket full sync from per-bucket-shard to per-bucket
> >
> > * moving bucket sync status from per-bucket-shard to per-bucket
> >
> > * waiting for incremental sync of each bucket shard to complete before
> > advancing to the next log generation
> >
> >
> > On 2/25/20 2:53 PM, Casey Bodley wrote:
> >> I opened a pull request https://github.com/ceph/ceph/pull/33539 with
> >> a design doc for dynamic resharding in multisite. Please review and
> >> give feedback, either here or in PR comments!
> >>
> _______________________________________________
> Dev mailing list -- dev@ceph.io
> To unsubscribe send an email to dev-leave@ceph.io
>