-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propose KEP to transfer PVC between namespaces #643
Propose KEP to transfer PVC between namespaces #643
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sold on the need for this yet, but a few comments as I read it:
kind: PersistentVolumeClaim | ||
metadata: | ||
name: pvc-foo | ||
annotations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be an annotation. It should probably be whole resource.
E.g.
- Assume you have a PVC object "Foo" in NS1
- Create a PVCTransfer object "Foo" with
sendTo: dev
- Controller observes this, but waits for a receiver
- Create a PVCTransfer object "Bar" in NS2 with "recvFrom: prod/Foo"
- Controller observes this and does the transfer, deleting both PVCTransfer resources when done
This would have to be seriously thought through and audited for attack vectors.
But why are PVCs special? What about snashots? What about other non-storage resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thockin thanks for looking at this!
This should not be an annotation. It should probably be whole resource.
I started with something along those lines, but I wasn't sure it would be acceptable to add a new "object" like that. The annotations aren't the best answer here I agree; I was actually hoping to introduce a formal parameter to the PV and PVC objects instead. I'm happy to look at the other option though if it's preferred. The flow you describe aligns perfectly by the way.
This would have to be seriously thought through and audited for attack vectors.
Indeed, one of the reasons I used the pvc and pv objects by themselves was to try and minimize opening up new vectors, it might be worth it though.
But why are PVCs special? What about snashots? What about other non-storage resources?
Initially I proposed to the sig a generlized transfer resource like you described that could be used for any object, but the more I thought about it it didn't seem to make sense (IOW I talked myself out of it). The reason being that for things like PODs I didn't think there was a good use case (as opposed to just recreating the POD. Sure, a POD may have some heavy containers, but recreating data sets (like say a 200Gig DB store) is a bit ugly. If there's a good use case, or if it's preferred to have consistency across objects on the system I can agree with that.
As far as Snapshots, I have a strong opinion about breaking the linkage between Snapshots and Volumes across Namespaces. For many back end devices these are linked, and even worse some link their snapshots. By transferring a Snapshot to a different Namespace that creates some visibility issues for users to those linkages. Say for example my device uses a cow file or something similar for snapshots, and then each consecutive snapshot is another cow file built from there. The entire chain is linked; if I transfer one snapshot in the link to another namepsace, the original namespace is now unable to delete any of his/her snapshots or volumes without the new user deleting theirs. OR, the new user can't delete theirs if there were subsequent snapshots from the originator.
To get around that, it might be good to limit transfer of Snapshots to a flow like:
- Create Snapshot
- Create New PVC from Snapshot (now the PVC is it's own independent object)
- Transfer PVC to new Namespace
That way there's no linkage, and the new user and old user can do anything they normally could without introducing some weird corner cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the creation of #642 is this still being worked?
reviewers: | ||
- TBD | ||
approvers: | ||
- TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add the SIG Storage chairs/tech leads as approvers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
## Table of Contents | ||
|
||
A table of contents is helpful for quickly jumping to sections of a KEP and for highlighting any additional information provided beyond the standard KEP template. | ||
[Tools for generating][] a table of contents from markdown are available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generic blurb from the template can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, thanks
|
||
## Motivation | ||
|
||
There are a number of uses cases where a user would like to have the ability to transfer an existing PVC from one namespace to another. This is a valuable workflow for persistent storage and enables the ability to easily duplicate and transfer data sets from one environment to another. These populated PVCs could be a clone of another volume, a volume from snapshot, or data that was written to the volume via an application (ie a database). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you note "there are a number of use cases" ... can you please follow the format of the KEP template to share those user stories? Can you highlight the type of user (so we can understand them) and highlight the task they need to do?
This will help us to better understand and discuss the best way to handle the need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not following what you'd like to see changed here. I do have 3 user stories included here. Would you prefer I omitted any details from the motivation section here?
|
||
There are a number of uses cases where a user would like to have the ability to transfer an existing PVC from one namespace to another. This is a valuable workflow for persistent storage and enables the ability to easily duplicate and transfer data sets from one environment to another. These populated PVCs could be a clone of another volume, a volume from snapshot, or data that was written to the volume via an application (ie a database). | ||
|
||
An example use case for this feature would be a cluster segmented into two namespaces; namespace-a for production staging, and namespace-b for production. There are cases where an application could be developed and tested with the same production data without risking any modification or corruption of data in both environments. Rather than reproducing the data in both namespaces, it would be much more efficient to be able to clone or restore the data from a snapshot in to a volume and then transfer that new volume to the desired namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I appreciate this use case, I'm not sure this is a good behavior. For example, in this setup the staging environment is no longer available so the staging wasn't really staging it was pre-launch prod. Why wouldn't this just be the production namespace in the first place?
Wouldn't a better approach be to copy the staging data, as a one time task, from stating to production? Then, wouldn't you want periodic tasks that copy the production data back to staging for continued dev leveraging staging? This copying is far different from transferring.
I'm just thinking out loud but this example does not sound like a good case to justify the behavior. Although, I might be missing something and am happy hear about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah,
Wouldn't a better approach be to copy the staging data, as a one time task, from stating to production? Then, wouldn't you want periodic tasks that copy the production data back to staging for continued dev leveraging staging? This copying is far different from transferring.
Sure, but that's a different workflow IMO (and I'd still prefer to clone and transfer the new PVC in that case anyway), maybe the wording isn't great on my part; the point here was that data isn't static in either case. Say for example you have a production env that is encountering errors, you try and test things in your test env/namespace but it turns out that the issue is dependent upon data (which isn't uncommon); the scenario here allows a method whereby you can reproduce data if needed.
I'd also argue that it's a more efficient and secure method of moving from staging to production as well; by providing a mechanism to easily and completely duplicate the staging env in another namespace (prod or otherwise) without manually moving data which would require poor security settings (allowing both namespaces access to the raw data) which can be extremely time consuming. Given most storage backends will have the ability to quickly and effeciently clone a volume, this is a much easier way to replicate and transfer that data between namespaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining the workflow you are talking about.
If this is your workflow, would it be better to copy a snapshot from one namespace to another so it can be restored there? That way no PV/PVC is within the production namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining the workflow you are talking about.
If this is your workflow, would it be better to copy a snapshot from one namespace to another so it can be restored there? That way no PV/PVC is within the production namespace.
The problem with transferring snapshots is that it opens up a considerable amount of complexity. Many storage devices link their snapshots to their parent volumes, the result being that we would have cross namespace dependencies on resources which IMO is pretty ugly to manage and frankly not worthwhile. Creating a volume from a snapshot and then transferring it means that everything in an end users namespace remains independent and under a single namespace control.
|
||
Upon success, the PV annotation: ``pv.kubernetes.io/transfer-status`` annotation will be updated by the controller to ``complete``. | ||
|
||
### User Stories [optional] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user stories here do not match the title and description for the KEP. Can you please revise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattfarina Sorry, what doesn't match? The title and purpose of the KEP is to enable transferring a PVC from one namespace to another. The user stories below are specific concrete cases. When cloning we create a new PVC, snapshots are restored by creating a new PVC from a Snapshot and the last case perhaps I have a namespace with users that have special tools for populating/generating data that I then want to "give" to another namespace/user to consume.
Let me know how I can make this better, I'm really not sure where the disconnect is currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon re-reading, I see where you are going with this.
The debug comment you had, elsewhere, does a good job explaining the user story using a concrete example. Could you put that under the appropriate user story to add context.
c117f6c
to
ddbd1df
Compare
#642 is a different problem; that just proposes adding existing PVCs as a valid dataSource option in the PVC spec. All of the same rules still apply in that case as far as NameSpace, quota etc. This proposal mentions things like Clone that ideally would be enabled by #642 but don't require it (out of band clones etc) |
@j-griffith Has this gone before SIG Storage? I don't see it in their agenda/meeting minutes or on their mailing list. If you have not, can you please start a conversation there... https://github.com/kubernetes/community/tree/master/sig-storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove any references to NEXT_KEP_NUMBER
and rename the KEP to just be the draft date and KEP title.
KEP numbers will be obsolete once #703 merges.
ddbd1df
to
54021fd
Compare
Propose enhancement to enable the transfer a PVC from one namespace to another within the cluster.
54021fd
to
3e50f90
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: j-griffith If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/assign @saad-ali |
Wanted to leave a use case perspective on this. It's not a support request. I am in the middle converting volumes using CephFS Provisioner to CSI-Ceph. As the copy of the filesystems are the backup, I wanted to make sure that I had a rollback path to the original configuration (thus maintaining integrity) at every step. By definition, I did not want to change a source PVC or PV unless absolutely necessary before the data was tested and working on the destination. When the source and destination namespaces were the same and I could edit the workload manifests to use the new PVC, this was pretty straightforward. It was harder when the workload was either in a In the cross-namespace copy, I started by creating a duplicate PVC in another namespace in with the idea that if the original PVC wasn’t attached, the second PVC would be able to take the PV. When that second PVC continually found itself Where this might have worked is if the PVC specifier in the PV was mutable, or if it wasn't there at all. Bidirectional links in schemas are always troublesome and this is a good example. I assume there are reasons, just pointing out the challenges. What I like about this KEP is it solves the problem when the volume contents need to be kept intact. Transferring the volume is between PVCs is a natural outcome. I think it would be helpful if there was an additional |
I do think it would be useful to make this mechanism more generic than just PVC/PV. Because we have a set of objects that follow the PVC/PV/StorageClass model (VolumeSnapshot/VolumeSnapshotContent/VolumeSnapshotClass, and potentially many more in the future), I want this mechanism to be generic enough to be reusable across different objects with minimal work. On the other hand, I can see how each set of objects may require custom logic: how binding is done, how to determine if transfers are allowed (like in the snapshots case @j-griffith mentioned some storage backends may not support it). But we should at least strive to have a shared API for namespace transfer for different types, even if the implementation (controller) is not necessarily the same to lower the connotative overhead for users. As for annotation vs CRD. Problem with CRDs is if we have a new |
I'm not as convinced that Volumes aren't "special", it's IMO the only really heavy weight object, and it's the only one that I know of (I could be wrong) that has any concerns around persistence. Most other resources are designed with the intent of being ephemeral so that to me changes some of the expectations of what i would do with them. For example, transferring a pod doesn't seem overly useful to me, transferring the "data" associated with a pod however that seems reasonable, in fact it seems like the most difficult part of transferring a pod; everything else is just regenerated; even currently in the case of like a failed node. The problem is Volumes don't fall into that paradigm, they're not destroyed/recreated in the same manner, more importantly they can't be. The persistent data that lives on those PVs is what's valuable and what ideally I think we'd like to be able to share easily/quickly across namespaces.
I'm happy to continue investigating this sort of approach. We did start with this sort of idea but determined that it may not be necessary and due to the permissions and the sensitive nature of volumes (the data that resides on them) taking a volume specific approach keeping things as safe as possible was desirable. That being said I'm more than happy to discuss and explore ideas here, and my opinion could likely be changed.
I think a CRD approach could work similar to what I have currently, but would not be leveraged for the receiver side of things. Instead I would propose the same sort of implemenation that I have here currently, but instead of using the annotation (I agree annotations aren't the right answer here anyway) the CRD could be the signal for the transfer. Making this work IIRC was considerably more challenging, reliably syncing back to the PV controller when a claim was deleted, but it would certainly be possible. |
I may have an idea to make this work with a Transfer CRD (that could then be implemented for other objects if desired). I'd still love to hear a use case for other objects though for what it's worth. It's obvious there may be something in mind that folks are working on, would be interested to hear about it. The Transfer object could integrate nicely with the existing proposal on the receiver side if we add it as a valid dataSource, which seems like it might be a logical step. This way we still do not:
The one thing I haven't quite sorted out yet is a way to signify to the PV controller to not delete the PV when the originator deletes their claim; this is where things get a little tricky with a transfer object for me. We could have the transfer CRD label or annotate the volume maybe and still use the same sort of mechanism I have proposed in the KEP but maybe there's a better approach. I'll think about this a bit more and start working up another approach, in the meantime if there are suggestions or ideas regarding the originator side. I'll get something worked up and also send it out to the mail list for review, hopefully that will be a better medium for discussing this than the PR has been. |
Could the CRD become the owner of the volume while it is mid-flight? This might avoid the peculiarities of the volume provider who thinks they gave away the volume when a tardy recipient simply hasn't accepted it yet. In this manner, the volume provider isn't tempted to delete the volume, thinking the volume has been given away. This is less important when the administrators are the same person. |
Yeah the problem is "where" does the volume go when it's deleted by the originator up until this point? I was thinking about introducing a new state for the PVC "transferring", which would be the thing that makes it available to a recipient, the PV controller could then key in on that to know if it needs to delete or create the new claim reference. It's not much different than what I'm proposing now except it provides a generic API (which seems to be a requirement) and it gets rid of using annotations for everything. I'll work some things through and get it out to folks or update here. Thanks for the feedback on this PR! |
Yes, I see what you mean now. It becomes a bit like someone in an international airport that has lost their passport or had it revoked. It doesn't matter what the type of the object is that owns the PV, only that it's stateless (in the diplomatic sense of the word...)
One thought in the direction of both the generic transfer API and a process that seems intuitive:
|
I agree this would only be valuable for objects that represent data.\ But we're going to have more and more of those. One example is VolumeSnapshot/VolumeSnapshotContent/VolumeSnapshotClass. I understand your concerns that some storage systems won't support this. But some will. So I can see this being a driver capability. Another example, SIG Apps is working on a proposal for Application level snapshots following the same model as VolumeSnapshot above (ApplicationSnapshot/ApplicationSnapshotContent/ApplicationSnapshotClass). And being able to move those app level snapshots across namespaces would be useful.
This would be difficult to enforce in a backwards compat way.
Non-namespaced transfer was proposed at some point but there was push back. We want to allow app devs who are not cluster admins and only have permissions to their namespace the ability to transfer a object in to their namespace, or approve the transfer of an object out of their namespace. So app devs with permissions for two namespaces should be able to work together to move an object across the namespaces without involving someone with cluster admin privileges.
I like that idea. I was a little hesitant to propose it because so far
So the transfer would have to be carried out by the PV/PVC controller: which would verify that the source PVC exists and is bound to a PV. If it is, it will unbind the source PVC from the PV. At this point if the source PVC is deleted, no big deal, since it is no longer bound. It would then rebind the destination PVC to the existing PV. Lots of fun race conditions to think through with this, however. CC @jsafrane |
I took a look at your proposal. How about we break the problem in to 2 parts:
For step 1: We introduce two new API objects
For step 2: We have a controller for each type of object we want to transfer (e.g. PVC, VolumeSnapshots, AppSnapshot, etc.) -- this way the logic for how transfer happens is custom per object type. Proposed logic for PVC transfer controller: The existing PV/PVC controller can be modified to include new
Example User Journey
|
@saad-ali Breaking it down into two certainly makes sense (I believe we started there at one point early on). Couple of things regarding your suggested approach:
I tried a number of things here, and rebinding is extremely touchy in my experience. It leads to lost claims and inaccessible PVs. We may be able to come up with a way around this, but it also introduces a number of corner cases around resource quotas etc. Is there a compelling reason to avoid using the delete operation as the finalization of the process?
Avoiding the controller doing any create/delete operations on a claim is certainly best IMO, I suppose if we mark it as unusable it does solve problems with contingency or races that might occur during the transfer process so that might work out fine. That "unusable" state would need to include some sort of reference that a transfer had taken place but in general seems fine if we don't want to tie it to deletion. FWIW I also like your suggestion that this is initialized via the Request on the destination side, that solves a concern regarding whether this would behave in a declarative manner or not. |
Actually, this might not be a problem now given that the existing claim can stay there. I'll take another look with some of the new ideas you suggested and update. Meanwhile I'll start reworking the proposal and get an update shortly. |
|
||
The process of transferring a PVC/Volume is as follows: | ||
1. Original user indicates they're willing to ``give`` the volume to another namespace | ||
2. The receiving user indicates they'd like to ``accept`` the volume into their namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@j-griffith I would like to clarify the users
defined in give and accept
. In short, who are these users ? Isnt it all the users in a namespace ? If all the users can trigger this operation, do we see/explore misuse of this functionality, thus security issues? @saad-ali @liggitt thoughts ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we its "set" of users, how do we define that group?
## Motivation | ||
|
||
There are a number of uses cases where a user would like to have the ability to transfer an existing PVC from one namespace to another. This is a valuable workflow for persistent storage and enables the ability to easily duplicate and transfer data sets from one environment to another. These populated PVCs could be a clone of another volume, a volume from snapshot, or data that was written to the volume via an application (ie a database). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@j-griffith once the PVC is successfully transferred, are we expecting that, the source PVC object is wiped from API server for source namespace ?
/remove-sig architecture pm |
This KEP provides a starting point to propose and discuss the addition of an external (CRD) NameSpace Transfer API. This is a result from discussions in the VolumeNamespaceTransfer proposal: kubernetes#643
This KEP provides a starting point to propose and discuss the addition of an external (CRD) NameSpace Transfer API. This is a result from discussions in the VolumeNamespaceTransfer proposal: kubernetes#643
New API proposal here: #1112 |
Propose enhancement to enable the transfer a PVC from one namespace to
another within the cluster.