Sounds good to me.
To remove the additional delay, using clone_overlap is better than my
implementation.
I will re-push commits based on your idea.
Myoungwon
2019년 8월 7일 (수) 오전 2:33, Sage Weil <sweil(a)redhat.xn--com>-4f21ay07k 작성:
On Fri, 28 Jun 2019, Gregory Farnum wrote:
On Fri, Jun 28, 2019 at 7:50 AM Sage Weil
<sweil(a)redhat.com> wrote:
>
> Hi Myoungwon,
>
> I was thinking about how a refcounted cas pool would interact with
> snapshots and it occurred to me that dropping refs when an object is
> deleted may break snapshotted versions of that object. If object A has
> a ref to chunk X, is snapshotted, then A is deleted, we'll (currently)
> drop the ref to X and remove it. That means that A can't be read.
>
> One way to get around that would be to mirror snaps from the source
pool
> to the chunk pool--this is how cache tiering
works. The problem I see
> there is that I'd hoped to allow multiple pools to share/consume the
same
> chunk pool, but each pool has its own snapid
namespace.
>
> Another would be to bake the refs more deepling into the source rados
pool
> so that the refs are only dropped after all
clones also drop the ref.
> That is harder to track, though, since I think you'd need to examine
all
> of the clones to know whether the ref is
truly gone. Unless we embed
> even more metadata in the SnapSet--something analogous to
clone_overlap to
identifying the chunks. That seems like it will bloat that structure,
though.
Other ideas?
Is there much design work around refcounting and snapshots yet?
I haven't thought it through much but one possibility is that each
on-disk clone counts as its own reference, and on a write to the
manifest object you increment the reference to all the chunks in
common. When snaptrimming finally removes a clone, it has to decrement
all the chunk references contained in the manifest.
I don't love this for the extra trimming work and remote reference
updates, but it's one way to keep the complexity of the data
structures down.
Other options:
* Force 1:1 mapping. Not sure how good or bad this is since I haven't
seen a lot of CAS pool discussion.
This is implemented by
https://github.com/ceph/ceph/pull/29283.
The concern I have with this approach is that the any write that triggers
a clone creation may need to block while all of the ref counts for the
clone are incremented. This is slow, and also introduces one more window
for an OSD failure to lead to leaked references (not critical but not
great either).
Here's a new idea:
Currently all of the write operations populate the OpContext
modified_ranges map, which is then subtracted from the most recent clone's
clone_overlap in the SnapSet. We could use that to take one of two paths:
1) If a newly dereferenced (by head) chunk overlaps with the most recent
clone, do nothing--that clone still has a reference to it.
2) If a newly dereferenced (by head) chunk does NOT overlap with the most
recent clone, then it is the only referant, and we can decrement it after
we apply the update (like we do today).
Then, trim_object() needs to be smart. When a clone is removed, it needs
to compare the clone's chunks to the adjacent clones or head, and make a
similar determination of whether the chunk reference is unique to the
clone or shared by one of its neighbors.
I think this is possible by inspecting *only* the clone_overlap, which is
in the SnapSet, and already always present in memory.
What do you think?
sage
* no longer giving each pool its own snapshot
namespace. Not sure this
was a great design decision to begin with; would require updating
CephFS snap allocation but I don't think anything else outside the
monitors.
* Disallowing snapshots on manifest-based objects/pools. What are the
target workloads for these?
-Greg
sage
_______________________________________________
Dev mailing list -- dev(a)ceph.io
To unsubscribe send an email to dev-leave(a)ceph.io
_______________________________________________
Dev mailing list -- dev(a)ceph.io
To unsubscribe send an email to dev-leave(a)ceph.io