On Fri, Jun 28, 2019 at 7:50 AM Sage Weil <sweil(a)redhat.com> wrote:
Hi Myoungwon,
I was thinking about how a refcounted cas pool would interact with
snapshots and it occurred to me that dropping refs when an object is
deleted may break snapshotted versions of that object. If object A has
a ref to chunk X, is snapshotted, then A is deleted, we'll (currently)
drop the ref to X and remove it. That means that A can't be read.
One way to get around that would be to mirror snaps from the source pool
to the chunk pool--this is how cache tiering works. The problem I see
there is that I'd hoped to allow multiple pools to share/consume the same
chunk pool, but each pool has its own snapid namespace.
Another would be to bake the refs more deepling into the source rados pool
so that the refs are only dropped after all clones also drop the ref.
That is harder to track, though, since I think you'd need to examine all
of the clones to know whether the ref is truly gone. Unless we embed
even more metadata in the SnapSet--something analogous to clone_overlap to
identifying the chunks. That seems like it will bloat that structure,
though.
Other ideas?
Is there much design work around refcounting and snapshots yet?
I haven't thought it through much but one possibility is that each
on-disk clone counts as its own reference, and on a write to the
manifest object you increment the reference to all the chunks in
common. When snaptrimming finally removes a clone, it has to decrement
all the chunk references contained in the manifest.
I don't love this for the extra trimming work and remote reference
updates, but it's one way to keep the complexity of the data
structures down.
Other options:
* Force 1:1 mapping. Not sure how good or bad this is since I haven't
seen a lot of CAS pool discussion.
* no longer giving each pool its own snapshot namespace. Not sure this
was a great design decision to begin with; would require updating
CephFS snap allocation but I don't think anything else outside the
monitors.
* Disallowing snapshots on manifest-based objects/pools. What are the
target workloads for these?
-Greg
>
> sage
> _______________________________________________
> Dev mailing list -- dev(a)ceph.io
> To unsubscribe send an email to dev-leave(a)ceph.io