Fencing an entire client cluster from access to Ceph (in kubernetes) - Dev

Patrick Donnelly

27 Oct 27 Oct

3:49 p.m.

Hi Shyam, Thanks for starting this discussion. On Tue, Oct 27, 2020 at 8:12 AM Shyam Ranganathan <srangana(a)redhat.com> wrote:

...

Asks: ----- This mail is to trigger a discussion on the potential solution, provided later below, for the issue as per the subject, and to possibly gather other ideas/options, to enable the use case as described. Use case/Background: -------------------- Ceph is used by kubernetes to provide persistent storage (block and file, via RBD and CephFS respectively) to pods, via the CSI interface implemented in ceph-csi [1]. One of the use cases that we want to solve is when multiple kubernetes clusters access the same Ceph storage cluster [2], and further these kubernetes clusters provide for DR (disaster recovery) of workloads, when a peer kubernetes cluster becomes unavailable. IOW, if a workload is running on kubernetes cluster-a and has access to persistent storage, it can be migrated to cluster-b in case of a DR event in cluster-a, ensuring workload continuity and with it access to the same persistent storage (as the Ceph cluster is shared and available). Problem: -------- The exact status of all client/nodes in kubernetes cluster-a on a DR event is unknown, all maybe down or some may still be up and running, still accessing storage. This brings about the need to fence all IO from all nodes/container-networks on cluster-a, on a DR event, prior to migrating the workloads to cluster-b. Existing solutions and issues: ------------------------------ Current schemes to fence IO are, per client [3] and further per image for RBD. This makes it a prerequisite that all client addresses in cluster-a are known and are further unique across peer kubernetes clusters, for a fence/blocklist to be effective. Also, during recovery of kubernetes cluster-a, as kubernetes uses current known state of the world (i.e workload "was" running on this cluster) and reconciles to the desired state of the world eventually, it is possible that re-mounts may occur prior to reaching desired state of the world (which would be not to run the said workloads on this cluster). The recovery may hence cause the existing connection based blocklists to be reset, as newer mounts/maps of the fs/image are performed on the recovering cluster. The issues as above, makes the existing blocklist scheme either unreliable or cumbersome to deal with for all possible nodes in the respective kubernetes clusters. Potential solution: ------------------- On discussing the above with Jason, he pointed out to a potential solution (as follows) to resolve the problem, <snip> My suggestion would be to utilize CephX to revoke access to the cluster from site A when site B is promoted. The one immediate issue with this approach is that any clients with existing tickets will keep their access to the cluster until the ticket expires. Therefore, for this to be effective, we would need a transient CephX revocation list capability to essentially blocklist CephX clients for X period of time until we can be sure that their tickets have expired and are therefore no longer usable. </snip> The above is quite trivial from a kubernetes and ceph-csi POV, as each peer kubernetes cluster can be configured to use different cephx identities, and thus independently revoked and later reinstated, solving the issues laid out above. The ability to revoke credentials for an existing cephx identity can be done if we change its existing authorization and hence is readily available. The ability to provide a revocation list for existing valid tickets, that clients already have, would need to be developed. Thoughts and other options?

While tempting, I think we're unnecessarily restricting our attention to current options. It seems to me we should consider another mechanism for blocklisting clients in-mass. I would suggest having clients add a "tag" to their sessions with Ceph daemons which can be separately blocklisted. The tag can be derived from the cephx key they use so it does not require updating all client code to send a tag (like the kernel). The cephx credential would probably look like: "mon `allow r tag=bar'". Once a new tag is added to the blocklist distributed via the MonMap (or OSDMap for consistency), daemons would need to go through their open sessions and blocklist any matches. There's other applications beyond DR. We currently have a heavy-weight "registered clients" map in the MgrMap which records all open RADOS instances by the mgr. These all need blocklisted if the mgr fails over. It is racy to keep this up-to-date so we see virtually unavoidable and annoying test failures [1,2] If we used a "mgr.x" tag for the mgr.x credential (perhaps an implicit tag of several), we could blocklist that instead to avoid keeping track entirely. What do you think? [1] https://tracker.ceph.com/issues/40867 [2] https://tracker.ceph.com/issues/43943 -- Patrick Donnelly, Ph.D. He / Him / His Principal Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

Reply

Patrick Donnelly

5:59 p.m.

On Tue, Oct 27, 2020 at 4:10 PM Jason Dillaman <jdillama(a)redhat.com> wrote:

...

On Tue, Oct 27, 2020 at 6:50 PM Patrick Donnelly <pdonnell(a)redhat.com> wrote:

Hi Shyam, Thanks for starting this discussion. On Tue, Oct 27, 2020 at 8:12 AM Shyam Ranganathan <srangana(a)redhat.com> wrote:

Asks: ----- This mail is to trigger a discussion on the potential solution, provided later below, for the issue as per the subject, and to possibly gather other ideas/options, to enable the use case as described. Use case/Background: -------------------- Ceph is used by kubernetes to provide persistent storage (block and file, via RBD and CephFS respectively) to pods, via the CSI interface implemented in ceph-csi [1]. One of the use cases that we want to solve is when multiple kubernetes clusters access the same Ceph storage cluster [2], and further these kubernetes clusters provide for DR (disaster recovery) of workloads, when a peer kubernetes cluster becomes unavailable. IOW, if a workload is running on kubernetes cluster-a and has access to persistent storage, it can be migrated to cluster-b in case of a DR event in cluster-a, ensuring workload continuity and with it access to the same persistent storage (as the Ceph cluster is shared and available). Problem: -------- The exact status of all client/nodes in kubernetes cluster-a on a DR event is unknown, all maybe down or some may still be up and running, still accessing storage. This brings about the need to fence all IO from all nodes/container-networks on cluster-a, on a DR event, prior to migrating the workloads to cluster-b. Existing solutions and issues: ------------------------------ Current schemes to fence IO are, per client [3] and further per image for RBD. This makes it a prerequisite that all client addresses in cluster-a are known and are further unique across peer kubernetes clusters, for a fence/blocklist to be effective. Also, during recovery of kubernetes cluster-a, as kubernetes uses current known state of the world (i.e workload "was" running on this cluster) and reconciles to the desired state of the world eventually, it is possible that re-mounts may occur prior to reaching desired state of the world (which would be not to run the said workloads on this cluster). The recovery may hence cause the existing connection based blocklists to be reset, as newer mounts/maps of the fs/image are performed on the recovering cluster. The issues as above, makes the existing blocklist scheme either unreliable or cumbersome to deal with for all possible nodes in the respective kubernetes clusters. Potential solution: ------------------- On discussing the above with Jason, he pointed out to a potential solution (as follows) to resolve the problem, <snip> My suggestion would be to utilize CephX to revoke access to the cluster from site A when site B is promoted. The one immediate issue with this approach is that any clients with existing tickets will keep their access to the cluster until the ticket expires. Therefore, for this to be effective, we would need a transient CephX revocation list capability to essentially blocklist CephX clients for X period of time until we can be sure that their tickets have expired and are therefore no longer usable. </snip> The above is quite trivial from a kubernetes and ceph-csi POV, as each peer kubernetes cluster can be configured to use different cephx identities, and thus independently revoked and later reinstated, solving the issues laid out above. The ability to revoke credentials for an existing cephx identity can be done if we change its existing authorization and hence is readily available. The ability to provide a revocation list for existing valid tickets, that clients already have, would need to be developed. Thoughts and other options?

While tempting, I think we're unnecessarily restricting our attention to current options. It seems to me we should consider another mechanism for blocklisting clients in-mass. I would suggest having clients add a "tag" to their sessions with Ceph daemons which can be separately blocklisted. The tag can be derived from the cephx key they use so it does not require updating all client code to send a tag (like the kernel). The cephx credential would probably look like: "mon `allow r tag=bar'".

I like this idea as well, but I think the syntax for describing the tag feels a little funky since you aren't actually "allowing" anything. At this point, why not just extend blocklisting to support entity names in general and avoid the need to touch the caps?

An entity name glob may also work!

...

Would there be other uses for this tag?

Perhaps it'd be another piece of metadata on a cap and not part of "mon" to avoid confusion. The tags could also be useful for mass removal/modification of auth credentials. That has potential use for fine-grained access control with thousands of auth credentials.

...

Once a new tag is added to the blocklist distributed via the MonMap (or OSDMap for consistency), daemons would need to go through their open sessions and blocklist any matches. There's other applications beyond DR. We currently have a heavy-weight "registered clients" map in the MgrMap which records all open RADOS instances by the mgr. These all need blocklisted if the mgr fails over. It is racy to keep this up-to-date so we see virtually unavoidable and annoying test failures [1,2] If we used a "mgr.x" tag for the mgr.x credential (perhaps an implicit tag of several), we could blocklist that instead to avoid keeping track entirely.

How would mgr.x unblocklist itself when it restarts?

Ya, this is the tricky part since there's no nonce. If we used tags, the mgr could configure a tag with a nonce for itself spanning all sessions (in g_ceph_context). The mons would then blocklist that tag. For entity names, I guess you'd have to unblocklist it as part of startup. -- Patrick Donnelly, Ph.D. He / Him / His Principal Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

Reply

Jason Dillaman

29 Oct 29 Oct

6:11 a.m.

On Thu, Oct 29, 2020 at 1:10 AM Madhu Rajanna <mrajanna(a)redhat.com> wrote:

...

A couple of scenarios we may need to consider for fencing. * Fencing the workload of a namespace, when we want to move workload only for a namespace, not for all namespace (the non-critical workload won't be moved to the secondary site but they need to be started again once the primary cluster is recovered).

Couldn't each Ceph namespace utilize a different CephX user? In the case of sync (metro) DR or isolating multiple k8s clusters are independent tenants sharing the same Ceph cluster resources, it seems like this would just be different CephX users.

...

* Few applications which are critical in a namespace may need to be fenced(applications may be running on the different nodes) when they moved to the secondary site. * We also need to revert back to the same state(unblocklist) once the primary cluster is recovered.

That was why I was proposing the revocation list idea since there would be no need to blocklist / unblocklist. The revocation list would force a client to obtain a new ticket -- but by that time the caps should have been updated to revoke its privileges as needed.

...

* Do we need to consider anything In the case of Async DR? if the primary cluster control plane is dead?

For RBD, this would be a forced-failover so there actually would be no need for fencing since it's handled as a split-brain event. Plus, the nature of async DR implies that you have two separate Ceph clusters.

...

Or do we need to fence all the clients in the primary cluster in case of DR? _______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io

-- Jason

Reply

Jason Dillaman

1 Dec 1 Dec

5:51 a.m.

On Mon, Nov 30, 2020 at 8:33 PM Shyam Ranganathan <srangana(a)redhat.com> wrote:

...

Rekindling... responses inline. On 10/27/20 8:59 PM, Patrick Donnelly wrote:

On Tue, Oct 27, 2020 at 4:10 PM Jason Dillaman <jdillama(a)redhat.com> wrote:

On Tue, Oct 27, 2020 at 6:50 PM Patrick Donnelly <pdonnell(a)redhat.com> wrote:

Hi Shyam, Thanks for starting this discussion. On Tue, Oct 27, 2020 at 8:12 AM Shyam Ranganathan <srangana(a)redhat.com> wrote: > > Asks: > ----- > This mail is to trigger a discussion on the potential solution, provided > later below, for the issue as per the subject, and to possibly gather > other ideas/options, to enable the use case as described. > > Use case/Background: > -------------------- > Ceph is used by kubernetes to provide persistent storage (block and > file, via RBD and CephFS respectively) to pods, via the CSI interface > implemented in ceph-csi [1]. > > One of the use cases that we want to solve is when multiple kubernetes > clusters access the same Ceph storage cluster [2], and further these > kubernetes clusters provide for DR (disaster recovery) of workloads, > when a peer kubernetes cluster becomes unavailable. > > IOW, if a workload is running on kubernetes cluster-a and has access to > persistent storage, it can be migrated to cluster-b in case of a DR > event in cluster-a, ensuring workload continuity and with it access to > the same persistent storage (as the Ceph cluster is shared and available). > > Problem: > -------- > The exact status of all client/nodes in kubernetes cluster-a on a DR > event is unknown, all maybe down or some may still be up and running, > still accessing storage. > > This brings about the need to fence all IO from all > nodes/container-networks on cluster-a, on a DR event, prior to migrating > the workloads to cluster-b. > > Existing solutions and issues: > ------------------------------ > Current schemes to fence IO are, per client [3] and further per image > for RBD. This makes it a prerequisite that all client addresses in > cluster-a are known and are further unique across peer kubernetes > clusters, for a fence/blocklist to be effective. > > Also, during recovery of kubernetes cluster-a, as kubernetes uses > current known state of the world (i.e workload "was" running on this > cluster) and reconciles to the desired state of the world eventually, it > is possible that re-mounts may occur prior to reaching desired state of > the world (which would be not to run the said workloads on this cluster). > > The recovery may hence cause the existing connection based blocklists to > be reset, as newer mounts/maps of the fs/image are performed on the > recovering cluster. > > The issues as above, makes the existing blocklist scheme either > unreliable or cumbersome to deal with for all possible nodes in the > respective kubernetes clusters. > > Potential solution: > ------------------- > On discussing the above with Jason, he pointed out to a potential > solution (as follows) to resolve the problem, > > <snip> > My suggestion would be to utilize CephX to revoke access to the cluster > from site A when site B is promoted. The one immediate issue with this > approach is that any clients with existing tickets will keep their > access to the cluster until the ticket expires. Therefore, for this to > be effective, we would need a transient CephX revocation list capability > to essentially blocklist CephX clients for X period of time until we can > be sure that their tickets have expired and are therefore no longer usable. > </snip> > > The above is quite trivial from a kubernetes and ceph-csi POV, as each > peer kubernetes cluster can be configured to use different cephx > identities, and thus independently revoked and later reinstated, solving > the issues laid out above. > > The ability to revoke credentials for an existing cephx identity can be > done if we change its existing authorization and hence is readily available. > > The ability to provide a revocation list for existing valid tickets, > that clients already have, would need to be developed. > > Thoughts and other options? While tempting, I think we're unnecessarily restricting our attention to current options. It seems to me we should consider another mechanism for blocklisting clients in-mass. I would suggest having clients add a "tag" to their sessions with Ceph daemons which can be separately blocklisted. The tag can be derived from the cephx key they use so it does not require updating all client code to send a tag (like the kernel). The cephx credential would probably look like: "mon `allow r tag=bar'".

I like this idea as well, but I think the syntax for describing the tag feels a little funky since you aren't actually "allowing" anything. At this point, why not just extend blocklisting to support entity names in general and avoid the need to touch the caps?

An entity name glob may also work!

If I understand this right, multiple cephx identities with the same entity name label(?) can be blocked in-mass? This would also help the kubernetes case, as we would be using (at least) 2 different cephx identities across cephfs and rbd (maybe more based on different pools in use to provision and use storage from). So if all these identities carry the same entity name, and can be blocked in one go, it would serve the purpose better.

An entity name in Ceph is the standard "client.<user id>" convention.

...

Would there be other uses for this tag?

Perhaps it'd be another piece of metadata on a cap and not part of "mon" to avoid confusion. The tags could also be useful for mass removal/modification of auth credentials. That has potential use for fine-grained access control with thousands of auth credentials.

Once a new tag is added to the blocklist distributed via the MonMap (or OSDMap for consistency), daemons would need to go through their open sessions and blocklist any matches. There's other applications beyond DR. We currently have a heavy-weight "registered clients" map in the MgrMap which records all open RADOS instances by the mgr. These all need blocklisted if the mgr fails over. It is racy to keep this up-to-date so we see virtually unavoidable and annoying test failures [1,2] If we used a "mgr.x" tag for the mgr.x credential (perhaps an implicit tag of several), we could blocklist that instead to avoid keeping track entirely.

How would mgr.x unblocklist itself when it restarts?

Ya, this is the tricky part since there's no nonce. If we used tags, the mgr could configure a tag with a nonce for itself spanning all sessions (in g_ceph_context). The mons would then blocklist that tag. For entity names, I guess you'd have to unblocklist it as part of startup.

An entity-name/tag cannot unblock itself, right? The very action to unblock would be blocked, no? Wouldn't this would have to hence be invoked with a different entity label/tag/nonce to unblock the same?

Yes, that's the issue w/ using broad strokes for blacklisting. However, the nonce that Patrick was referring to is really just the ability to generate a unique tag and not the normal nonce that is part of the existing blacklisting mechanism. i.e. MGR.A registers with tag "MGR.A.1", gets blacklisted by its tag "MGR.A.1", then when it re-registers, it uses a new "MGR.A.2" tag. That would have been special logic only within MGR and wouldn't exist in ceph-csi, librbd, krbd, etc. However, I would suspect that for that scheme to work, the MGR would need to essentially bootstrap a new CephX user at startup with the new tag since otherwise it would still be blacklisted (i.e. connect to the cluster with MGR.A credentials, create MGR.A.1 user, and then reconnect w/ MGR.A.1 for the remainder of its lifetime) -- but that is starting to just look like blacklist by entity name.

...

From a kubernetes standpoint, when a down cluster recovers, it would create newer mounts/maps and the nonce may change as the service mounting/mapping, CSI drivers in this case, would have restarted. IOW, for kubernetes a restart of CSI does not mean it is healthy and can start mounting/mapping the volumes again, it has to be reconciled to a state where all mounts are cleaned up, and then unblocked for future actions. So, if I understand the nonce part of the suggestion then that will not be usable in the case provided. The entity name based approach would help better. How do we take this forward? IOW, what are the next steps, should this be discussed in the upcoming Dec, 02 CDM meeting? (I am not sure how exactly to edit [1] to add this to the agenda though).

I added it to the agenda.

...

[1] CDM 02, Dec: https://tracker.ceph.com/projects/ceph/wiki/CDM_02-DEC-2020

-- Jason

Reply