rgw: resharding buckets without blocking write ops - Dev

29 Aug 2019

sharing a design for feedback. please let me know if you spot any other 
races, issues or optimizations!

current resharding steps:
1) copy the 'source' bucket instance into a new 'target' bucket instance 
with a new instance id
2) flag all source bucket index shards with RESHARD_IN_PROGRESS
3) flag the source bucket instance with RESHARD_IN_PROGRESS
4) list all omap entries in the source bucket index shards 
(cls_rgw_bi_list) and write each entry to its target bucket index shard 
(cls_rgw_bi_put)
5a) on success: link target bucket instance, delete source bucket index 
shards, delete source bucket instance
5b) on failure: reset RESHARD_IN_PROGRESS flag on source bucket index 
shards, delete target bucket index shards, delete target bucket instance

the current blocking strategy is enforced on the source bucket index 
shards. any write operations received by cls_rgw while the 
RESHARD_IN_PROGRESS flag is set are rejected with ERR_BUSY_RESHARDING. 
radosgw handles these errors by waiting/polling until the reshard 
finishes, then it resends the operation to the new target bucket index 
shard.

to avoid blocking write ops during a reshard, we could instead apply 
their bucket index operations to both the source and target bucket index 
shards in parallel. this includes both the 'prepare' op to start the 
transaction, and the asynchronous 'complete' to commit. allowing both 
buckets to mutate during reshard introduces several new races:

I) between steps (2) and (3), radosgw doesn't yet see the 
RESHARD_IN_PROGRESS flag in the bucket instance info, so doesn't know to 
send the extra index operations to the target bucket index shard

II) operations applied on the target bucket index shards could be 
overwritten by the omap entries copied from the source bucket index 
shards in step (4)

III) radosgw sends a 'prepare' op to the source bucket index shard 
before step (2), then sends the async 'complete' op to the source bucket 
index shard after (2). before step (5), this complete op would fail with 
ERR_BUSY_RESHARDING. after step (5), it would fail with ENOENT. since 
the complete is async, and we've already replied to the client, it's too 
late for any recovery

IV) radosgw sends an operation to both the source and target bucket 
index shards that races with (5) and fails with ENOENT on either the 
source shard (5a) or the target shard (5b)

introducing a new generation number or 'reshard_epoch' to each bucket 
that increments on a reshard attempt can help to resolve these races. so 
in step (2), the call to cls_rgw_set_bucket_resharding() would also 
increment the bucket index shard's reshard_epoch. similarly, step (3) 
would increment the bucket instance's reshard_epoch.

to resolve the race in (I), cls_rgw would reject bucket index operations 
with a reshard_epoch older than the one stored in the bucket index 
shard. this ERR_BUSY_RESHARDING error would direct radosgw to re-read 
its bucket instance, detect the reshard in progress, and resend the 
operation to both the source and target bucket index shards with the 
updated reshard_epoch

to resolve the race in (II), cls_rgw_bi_put() would have to test whether 
the given key exists before overwriting

the race in (III) is benign, because the 'prepared' entry was reliably 
stored in the source shard before reshard, so we're guaranteed to see a 
copy on the target shard. even though the 'complete' operation isn't 
applied, the dir_suggest mechanism will detect the incomplete 
transaction and repair the index the next time the target bucket is listed

the race in (IV) can be treated as a success if the operation succeeds 
on the target bucket index shard. if it fails on the target shard, 
radosgw needs to re-read the bucket entrypoint and instance to retarget 
the operation

one thing this strategy cannot handle is versioned buckets. some index 
operations for versioning (namely cls_rgw_bucket_link_olh and 
cls_rgw_bucket_unlink_instance) involve writes to two or more related 
omap entries. because step (4) copies over single omap entries, it can't 
preserve the consistency of these relationships once we allow mutation. 
so we'd need to stick with the blocking strategy for versioned buckets