Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose KEP to transfer PVC between namespaces #643

Conversation

j-griffith
Copy link
Contributor

Propose enhancement to enable the transfer a PVC from one namespace to
another within the cluster.

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/pm sig/storage Categorizes an issue or PR as relevant to SIG Storage. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 3, 2018
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sold on the need for this yet, but a few comments as I read it:

kind: PersistentVolumeClaim
metadata:
name: pvc-foo
annotations:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be an annotation. It should probably be whole resource.

E.g.

  • Assume you have a PVC object "Foo" in NS1
  • Create a PVCTransfer object "Foo" with sendTo: dev
  • Controller observes this, but waits for a receiver
  • Create a PVCTransfer object "Bar" in NS2 with "recvFrom: prod/Foo"
  • Controller observes this and does the transfer, deleting both PVCTransfer resources when done

This would have to be seriously thought through and audited for attack vectors.

But why are PVCs special? What about snashots? What about other non-storage resources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin thanks for looking at this!

This should not be an annotation. It should probably be whole resource.

I started with something along those lines, but I wasn't sure it would be acceptable to add a new "object" like that. The annotations aren't the best answer here I agree; I was actually hoping to introduce a formal parameter to the PV and PVC objects instead. I'm happy to look at the other option though if it's preferred. The flow you describe aligns perfectly by the way.

This would have to be seriously thought through and audited for attack vectors.

Indeed, one of the reasons I used the pvc and pv objects by themselves was to try and minimize opening up new vectors, it might be worth it though.

But why are PVCs special? What about snashots? What about other non-storage resources?

Initially I proposed to the sig a generlized transfer resource like you described that could be used for any object, but the more I thought about it it didn't seem to make sense (IOW I talked myself out of it). The reason being that for things like PODs I didn't think there was a good use case (as opposed to just recreating the POD. Sure, a POD may have some heavy containers, but recreating data sets (like say a 200Gig DB store) is a bit ugly. If there's a good use case, or if it's preferred to have consistency across objects on the system I can agree with that.

As far as Snapshots, I have a strong opinion about breaking the linkage between Snapshots and Volumes across Namespaces. For many back end devices these are linked, and even worse some link their snapshots. By transferring a Snapshot to a different Namespace that creates some visibility issues for users to those linkages. Say for example my device uses a cow file or something similar for snapshots, and then each consecutive snapshot is another cow file built from there. The entire chain is linked; if I transfer one snapshot in the link to another namepsace, the original namespace is now unable to delete any of his/her snapshots or volumes without the new user deleting theirs. OR, the new user can't delete theirs if there were subsequent snapshots from the originator.

To get around that, it might be good to limit transfer of Snapshots to a flow like:

  1. Create Snapshot
  2. Create New PVC from Snapshot (now the PVC is it's own independent object)
  3. Transfer PVC to new Namespace

That way there's no linkage, and the new user and old user can do anything they normally could without introducing some weird corner cases.

Copy link
Contributor

@mattfarina mattfarina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the creation of #642 is this still being worked?

reviewers:
- TBD
approvers:
- TBD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add the SIG Storage chairs/tech leads as approvers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

## Table of Contents

A table of contents is helpful for quickly jumping to sections of a KEP and for highlighting any additional information provided beyond the standard KEP template.
[Tools for generating][] a table of contents from markdown are available.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generic blurb from the template can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, thanks


## Motivation

There are a number of uses cases where a user would like to have the ability to transfer an existing PVC from one namespace to another. This is a valuable workflow for persistent storage and enables the ability to easily duplicate and transfer data sets from one environment to another. These populated PVCs could be a clone of another volume, a volume from snapshot, or data that was written to the volume via an application (ie a database).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you note "there are a number of use cases" ... can you please follow the format of the KEP template to share those user stories? Can you highlight the type of user (so we can understand them) and highlight the task they need to do?

This will help us to better understand and discuss the best way to handle the need.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not following what you'd like to see changed here. I do have 3 user stories included here. Would you prefer I omitted any details from the motivation section here?


There are a number of uses cases where a user would like to have the ability to transfer an existing PVC from one namespace to another. This is a valuable workflow for persistent storage and enables the ability to easily duplicate and transfer data sets from one environment to another. These populated PVCs could be a clone of another volume, a volume from snapshot, or data that was written to the volume via an application (ie a database).

An example use case for this feature would be a cluster segmented into two namespaces; namespace-a for production staging, and namespace-b for production. There are cases where an application could be developed and tested with the same production data without risking any modification or corruption of data in both environments. Rather than reproducing the data in both namespaces, it would be much more efficient to be able to clone or restore the data from a snapshot in to a volume and then transfer that new volume to the desired namespace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I appreciate this use case, I'm not sure this is a good behavior. For example, in this setup the staging environment is no longer available so the staging wasn't really staging it was pre-launch prod. Why wouldn't this just be the production namespace in the first place?

Wouldn't a better approach be to copy the staging data, as a one time task, from stating to production? Then, wouldn't you want periodic tasks that copy the production data back to staging for continued dev leveraging staging? This copying is far different from transferring.

I'm just thinking out loud but this example does not sound like a good case to justify the behavior. Although, I might be missing something and am happy hear about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah,

Wouldn't a better approach be to copy the staging data, as a one time task, from stating to production? Then, wouldn't you want periodic tasks that copy the production data back to staging for continued dev leveraging staging? This copying is far different from transferring.

Sure, but that's a different workflow IMO (and I'd still prefer to clone and transfer the new PVC in that case anyway), maybe the wording isn't great on my part; the point here was that data isn't static in either case. Say for example you have a production env that is encountering errors, you try and test things in your test env/namespace but it turns out that the issue is dependent upon data (which isn't uncommon); the scenario here allows a method whereby you can reproduce data if needed.

I'd also argue that it's a more efficient and secure method of moving from staging to production as well; by providing a mechanism to easily and completely duplicate the staging env in another namespace (prod or otherwise) without manually moving data which would require poor security settings (allowing both namespaces access to the raw data) which can be extremely time consuming. Given most storage backends will have the ability to quickly and effeciently clone a volume, this is a much easier way to replicate and transfer that data between namespaces.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining the workflow you are talking about.

If this is your workflow, would it be better to copy a snapshot from one namespace to another so it can be restored there? That way no PV/PVC is within the production namespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining the workflow you are talking about.

If this is your workflow, would it be better to copy a snapshot from one namespace to another so it can be restored there? That way no PV/PVC is within the production namespace.

The problem with transferring snapshots is that it opens up a considerable amount of complexity. Many storage devices link their snapshots to their parent volumes, the result being that we would have cross namespace dependencies on resources which IMO is pretty ugly to manage and frankly not worthwhile. Creating a volume from a snapshot and then transferring it means that everything in an end users namespace remains independent and under a single namespace control.


Upon success, the PV annotation: ``pv.kubernetes.io/transfer-status`` annotation will be updated by the controller to ``complete``.

### User Stories [optional]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user stories here do not match the title and description for the KEP. Can you please revise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattfarina Sorry, what doesn't match? The title and purpose of the KEP is to enable transferring a PVC from one namespace to another. The user stories below are specific concrete cases. When cloning we create a new PVC, snapshots are restored by creating a new PVC from a Snapshot and the last case perhaps I have a namespace with users that have special tools for populating/generating data that I then want to "give" to another namespace/user to consume.

Let me know how I can make this better, I'm really not sure where the disconnect is currently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon re-reading, I see where you are going with this.

The debug comment you had, elsewhere, does a good job explaining the user story using a concrete example. Could you put that under the appropriate user story to add context.

@j-griffith j-griffith force-pushed the propose_pvc_namespace_transfer branch from c117f6c to ddbd1df Compare January 7, 2019 18:15
@j-griffith
Copy link
Contributor Author

@mattfarina

With the creation of #642 is this still being worked?

#642 is a different problem; that just proposes adding existing PVCs as a valid dataSource option in the PVC spec. All of the same rules still apply in that case as far as NameSpace, quota etc. This proposal mentions things like Clone that ideally would be enabled by #642 but don't require it (out of band clones etc)

@mattfarina
Copy link
Contributor

@j-griffith Has this gone before SIG Storage? I don't see it in their agenda/meeting minutes or on their mailing list. If you have not, can you please start a conversation there... https://github.com/kubernetes/community/tree/master/sig-storage

Copy link
Member

@justaugustus justaugustus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove any references to NEXT_KEP_NUMBER and rename the KEP to just be the draft date and KEP title.
KEP numbers will be obsolete once #703 merges.

@j-griffith j-griffith force-pushed the propose_pvc_namespace_transfer branch from ddbd1df to 54021fd Compare January 25, 2019 23:59
@k8s-ci-robot k8s-ci-robot added the do-not-merge/blocked-paths Indicates that a PR should not merge because it touches files in blocked paths. label Jan 25, 2019
Propose enhancement to enable the transfer a PVC from one namespace to
another within the cluster.
@j-griffith j-griffith force-pushed the propose_pvc_namespace_transfer branch from 54021fd to 3e50f90 Compare January 26, 2019 01:24
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/blocked-paths Indicates that a PR should not merge because it touches files in blocked paths. label Jan 26, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: j-griffith
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: saad-ali

If they are not already assigned, you can assign the PR to them by writing /assign @saad-ali in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@j-griffith
Copy link
Contributor Author

/assign @saad-ali

@briantopping
Copy link

Wanted to leave a use case perspective on this. It's not a support request. I am in the middle converting volumes using CephFS Provisioner to CSI-Ceph. As the copy of the filesystems are the backup, I wanted to make sure that I had a rollback path to the original configuration (thus maintaining integrity) at every step. By definition, I did not want to change a source PVC or PV unless absolutely necessary before the data was tested and working on the destination. When the source and destination namespaces were the same and I could edit the workload manifests to use the new PVC, this was pretty straightforward. It was harder when the workload was either in a StatefulSet or a Helm chart that needed to be upgradable (due to the sensitivity of names to the manifests). As I found, it was practically impossible to complete across namespaces without resorting to the copy at the node level, in which case could have unexpected metadata transfer issues that might not be caught until after the source volume was gone. In every case, my intent was to mount the source and destination volumes on a throwaway pod and copy with find | cpio

In the cross-namespace copy, I started by creating a duplicate PVC in another namespace in with the idea that if the original PVC wasn’t attached, the second PVC would be able to take the PV. When that second PVC continually found itself Lost, I took a deep breath, a copy of the original PVC and deleted it. But as I realized the PV has an immutable reference to the PVC and the creation of an API object will not allow it's UID to be specified, I came to realize I am stuck. In trying to restore my rollback-ready state, I tried to delete the second PVC and re-add the saved copy of the first, but back to the UID problem.

Where this might have worked is if the PVC specifier in the PV was mutable, or if it wasn't there at all. Bidirectional links in schemas are always troublesome and this is a good example. I assume there are reasons, just pointing out the challenges.

What I like about this KEP is it solves the problem when the volume contents need to be kept intact. Transferring the volume is between PVCs is a natural outcome. I think it would be helpful if there was an additional action annotation with values of Move and Copy. That would solve the situation I am in, where the provisioners of the source and destination where different. I'm not clear whether this should be in-tree or at the provisioner level, but it seems to be safe to have that functionality in-tree since it doesn't change between provisioners.

@saad-ali
Copy link
Member

saad-ali commented Apr 29, 2019

I do think it would be useful to make this mechanism more generic than just PVC/PV.

Because we have a set of objects that follow the PVC/PV/StorageClass model (VolumeSnapshot/VolumeSnapshotContent/VolumeSnapshotClass, and potentially many more in the future), I want this mechanism to be generic enough to be reusable across different objects with minimal work.

On the other hand, I can see how each set of objects may require custom logic: how binding is done, how to determine if transfers are allowed (like in the snapshots case @j-griffith mentioned some storage backends may not support it).

But we should at least strive to have a shared API for namespace transfer for different types, even if the implementation (controller) is not necessarily the same to lower the connotative overhead for users.

As for annotation vs CRD. Problem with CRDs is if we have a new NamespaceTransfer object, will it be used just by source namespace to to approve a transfer, or also by receiver namespace to request a transfer? If it will also be used by receiver then does that mean the transfer controller will need to provision a new (e.g. PVC) object in the receiver namespace?

@j-griffith
Copy link
Contributor Author

I do think it would be useful to make this mechanism more generic than just PVC/PV.

Because we have a set of objects that follow the PVC/PV/StorageClass model (VolumeSnapshot/VolumeSnapshotContent/VolumeSnapshotClass, and potentially many more in the future), I want this mechanism to be generic enough to be reusable across different objects with minimal work.

I'm not as convinced that Volumes aren't "special", it's IMO the only really heavy weight object, and it's the only one that I know of (I could be wrong) that has any concerns around persistence. Most other resources are designed with the intent of being ephemeral so that to me changes some of the expectations of what i would do with them. For example, transferring a pod doesn't seem overly useful to me, transferring the "data" associated with a pod however that seems reasonable, in fact it seems like the most difficult part of transferring a pod; everything else is just regenerated; even currently in the case of like a failed node. The problem is Volumes don't fall into that paradigm, they're not destroyed/recreated in the same manner, more importantly they can't be. The persistent data that lives on those PVs is what's valuable and what ideally I think we'd like to be able to share easily/quickly across namespaces.

On the other hand, I can see how each set of objects may require custom logic: how binding is done, how to determine if transfers are allowed (like in the snapshots case @j-griffith mentioned some storage backends may not support it).

But we should at least strive to have a shared API for namespace transfer for different types, even if the implementation (controller) is not necessarily the same to lower the connotative overhead for users.

I'm happy to continue investigating this sort of approach. We did start with this sort of idea but determined that it may not be necessary and due to the permissions and the sensitive nature of volumes (the data that resides on them) taking a volume specific approach keeping things as safe as possible was desirable. That being said I'm more than happy to discuss and explore ideas here, and my opinion could likely be changed.

As for annotation vs CRD. Problem with CRDs is if we have a new NamespaceTransfer object, will it be used just by source namespace to to approve a transfer, or also by receiver namespace to request a transfer? If it will also be used by receiver then does that mean the transfer controller will need to provision a new (e.g. PVC) object in the receiver namespace?

I think a CRD approach could work similar to what I have currently, but would not be leveraged for the receiver side of things. Instead I would propose the same sort of implemenation that I have here currently, but instead of using the annotation (I agree annotations aren't the right answer here anyway) the CRD could be the signal for the transfer. Making this work IIRC was considerably more challenging, reliably syncing back to the PV controller when a claim was deleted, but it would certainly be possible.

@j-griffith
Copy link
Contributor Author

j-griffith commented Apr 30, 2019

I may have an idea to make this work with a Transfer CRD (that could then be implemented for other objects if desired). I'd still love to hear a use case for other objects though for what it's worth. It's obvious there may be something in mind that folks are working on, would be interested to hear about it.

The Transfer object could integrate nicely with the existing proposal on the receiver side if we add it as a valid dataSource, which seems like it might be a logical step. This way we still do not:

  1. dump volumes in a ns that does't want them
  2. get way from attributes (mostly, still some details I haven't figured out)
  3. enforce receiver namespace rules around things like quotas, sc access etc

The one thing I haven't quite sorted out yet is a way to signify to the PV controller to not delete the PV when the originator deletes their claim; this is where things get a little tricky with a transfer object for me. We could have the transfer CRD label or annotate the volume maybe and still use the same sort of mechanism I have proposed in the KEP but maybe there's a better approach.

I'll think about this a bit more and start working up another approach, in the meantime if there are suggestions or ideas regarding the originator side. I'll get something worked up and also send it out to the mail list for review, hopefully that will be a better medium for discussing this than the PR has been.

@briantopping
Copy link

The one thing I haven't quite sorted out yet is a way to signify to the PV controller to not delete the PV when the originator deletes their claim; this is where things get a little tricky with a transfer object for me.

Could the CRD become the owner of the volume while it is mid-flight? This might avoid the peculiarities of the volume provider who thinks they gave away the volume when a tardy recipient simply hasn't accepted it yet. In this manner, the volume provider isn't tempted to delete the volume, thinking the volume has been given away. This is less important when the administrators are the same person.

@j-griffith
Copy link
Contributor Author

The one thing I haven't quite sorted out yet is a way to signify to the PV controller to not delete the PV when the originator deletes their claim; this is where things get a little tricky with a transfer object for me.

Could the CRD become the owner of the volume while it is mid-flight? This might avoid the peculiarities of the volume provider who thinks they gave away the volume when a tardy recipient simply hasn't accepted it yet. In this manner, the volume provider isn't tempted to delete the volume, thinking the volume has been given away. This is less important when the administrators are the same person.

Yeah the problem is "where" does the volume go when it's deleted by the originator up until this point? I was thinking about introducing a new state for the PVC "transferring", which would be the thing that makes it available to a recipient, the PV controller could then key in on that to know if it needs to delete or create the new claim reference. It's not much different than what I'm proposing now except it provides a generic API (which seems to be a requirement) and it gets rid of using annotations for everything.

I'll work some things through and get it out to folks or update here. Thanks for the feedback on this PR!

@briantopping
Copy link

briantopping commented Apr 30, 2019

Yeah the problem is "where" does the volume go when it's deleted by the originator up until this point? I was thinking about introducing a new state for the PVC "transferring", which would be the thing that makes it available to a recipient, the PV controller could then key in on that to know if it needs to delete or create the new claim reference.

Yes, I see what you mean now. It becomes a bit like someone in an international airport that has lost their passport or had it revoked. It doesn't matter what the type of the object is that owns the PV, only that it's stateless (in the diplomatic sense of the word...)

It's not much different than what I'm proposing now except it provides a generic API (which seems to be a requirement) and it gets rid of using annotations for everything.

One thought in the direction of both the generic transfer API and a process that seems intuitive:

  • A number of objects throughout the API (pods, etc) could be considered "transferrable" in that they can hold a link to a Transfer object that would be visible from kubectl describe. Any object with a link to a Transfer object is ineligible for new usage or attachments, although existing usage or attachments would not be immediately terminated.
  • Transfer objects are not namespaced so they can be more easily enumerated both by the originator and the recipient with their respective visibilities. To this end, the originator and recipients are bona fide users with RBAC permissions to effect transfers (even if the source and destination are the same user or the admin user).
  • A Transfer object that has not been accepted is still "owned" by the originator and can be deleted as such, cancelling the transfer. Once executed accepted, the originator can no longer recall the transfer.
  • The kubectl command to accept a transfer provides a destination namespace, removing a requirement that the originator knows where it is going.
  • The kubectl process accepting the Transfer is synchronous: If the object being transferred is in use, kubectl hangs until the object is released (similar to other processes like deleting a PVC). Two parties could optimize for downtime with the recipient accepting the transfer of an in-use object, with kubectl blocking as expected. When the originator releases the object, the transfer effects immediately, unblocks kubectl and allowing the recipient to move forward as quickly as possible.

@saad-ali
Copy link
Member

I'd still love to hear a use case for other objects though for what it's worth. It's obvious there may be something in mind that folks are working on, would be interested to hear about it.

I agree this would only be valuable for objects that represent data.\

But we're going to have more and more of those. One example is VolumeSnapshot/VolumeSnapshotContent/VolumeSnapshotClass. I understand your concerns that some storage systems won't support this. But some will. So I can see this being a driver capability.

Another example, SIG Apps is working on a proposal for Application level snapshots following the same model as VolumeSnapshot above (ApplicationSnapshot/ApplicationSnapshotContent/ApplicationSnapshotClass). And being able to move those app level snapshots across namespaces would be useful.

Any object with a link to a Transfer object is ineligible for new usage or attachments, although existing usage or attachments would not be immediately terminated.

This would be difficult to enforce in a backwards compat way.

Transfer objects are not namespaced so they can be more easily enumerated both by the originator and the recipient with their respective visibilities. To this end, the originator and recipients are bona fide users with RBAC permissions to effect transfers (even if the source and destination are the same user or the admin user).

Non-namespaced transfer was proposed at some point but there was push back. We want to allow app devs who are not cluster admins and only have permissions to their namespace the ability to transfer a object in to their namespace, or approve the transfer of an object out of their namespace. So app devs with permissions for two namespaces should be able to work together to move an object across the namespaces without involving someone with cluster admin privileges.

The Transfer object could integrate nicely with the existing proposal on the receiver side if we add it as a valid dataSource, which seems like it might be a logical step.

I like that idea. I was a little hesitant to propose it because so far DataSource has implied provision new volume with data pre-populated, and with a NamespaceTransfer as a DataSource it would now also mean "steal existing volume from another namespace". But I think I'm ok with that overloaded meaning since NamespaceTransfer sounds pretty explicit about what it does.

The one thing I haven't quite sorted out yet is a way to signify to the PV controller to not delete the PV when the originator deletes their claim; this is where things get a little tricky with a transfer object for me. We could have the transfer CRD label or annotate the volume maybe and still use the same sort of mechanism I have proposed in the KEP but maybe there's a better approach.

So the transfer would have to be carried out by the PV/PVC controller: which would verify that the source PVC exists and is bound to a PV. If it is, it will unbind the source PVC from the PV. At this point if the source PVC is deleted, no big deal, since it is no longer bound. It would then rebind the destination PVC to the existing PV. Lots of fun race conditions to think through with this, however. CC @jsafrane

@saad-ali
Copy link
Member

saad-ali commented May 1, 2019

I took a look at your proposal.

How about we break the problem in to 2 parts:

  1. Approve namespace transfer.
  2. Do namespace transfer.

For step 1:

We introduce two new API objects

  1. NamespaceTransferRequest - Requests the transfer of the specified object from source namespace in to the same namespace as this object.
apiversion: v1alpha1
kind: NamespaceTransferRequest
metadata:
    name: pvc-transfer-request
    namespace:  destination-namespace
spec:
    source:
        name: source-namespace
        name: source-pvc
        kind: PersistentVolumeClaim
  1. NamespaceTransferApproval - Authorizes the transfer of the specified object to the specified namespace.
apiversion: v1alpha1
kind: NamespaceTransferApproval
metadata:
    name: pvc-transfer-approval
    namespace: source-namespace
spec: 
    source:
        name: source-pvc
        kind: PersistentVolumeClaim
    targetNamespace: destination-namespace

For step 2:

We have a controller for each type of object we want to transfer (e.g. PVC, VolumeSnapshots, AppSnapshot, etc.) -- this way the logic for how transfer happens is custom per object type.

Proposed logic for PVC transfer controller:

The existing PV/PVC controller can be modified to include new pvc-transfer-controller logic. It will be responsible for transferring PVCs across namespaces. It will do the following:

  1. Monitor NamespaceTransferApproval, NamespaceTransferRequest, and PVC objects.
  2. Wait for a destination PVC to be created where PVC.DataSource points to a NamespaceTransferRequest (LocalObjectReference).
  3. Wait for that NamespaceTransferRequest and a matching NamespaceTransferApproval objects to exist. Matching means the request.source and approval.source match, the request.namespace and approval.targetNamespace match, and the request.source.kind is type PVC (this logic could be put in a library so different transfer controllers can all reuse it).
  4. Wait for source PVC to have no pods referencing it.
  5. Do additional validation (e.g. maybe StorageClass has a new AllowTransfer that must be set to true).
  6. Update the PVC object to indicate it is no longer "available" to use.
  7. Update NamespaceTransferRequest.status to indicate that transfer has started.
  8. Initiate rebind such that PV is unbound from source PVC and rebound to destination PVC (devil will be in the details here on how to do this safely).
  9. Leave it up to user to clean up the (now unbound) source PVC (the transfer controllers never creates or deletes objects, it just modifies the objects that users create).

Example User Journey

  1. Create a NamespaceTransferRequest in target namespace to request transfer.
  2. Create a NamespaceTransferApproval in source namespace to approve transfer.
  3. Create a new PVC object with DataSource pointing to the NamespaceTransferRequest created in step 1.
  4. Wait for NamespaceTransferRequest.status to indicate transfer is complete.
  5. Delete the (now unbound) PVC in the source namespace (or leave it there, up to you).
  6. Use the (now bound) PVC in the destination namespace.

@j-griffith
Copy link
Contributor Author

j-griffith commented May 2, 2019

@saad-ali Breaking it down into two certainly makes sense (I believe we started there at one point early on). Couple of things regarding your suggested approach:

  1. Update the PVC object to indicate it is no longer "available" to use.
  2. Update NamespaceTransferRequest.status to indicate that transfer has started.
  3. Initiate rebind such that PV is unbound from source PVC and rebound to destination PVC (devil will be in the details here on how to do this safely).

I tried a number of things here, and rebinding is extremely touchy in my experience. It leads to lost claims and inaccessible PVs. We may be able to come up with a way around this, but it also introduces a number of corner cases around resource quotas etc. Is there a compelling reason to avoid using the delete operation as the finalization of the process?

  1. Leave it up to user to clean up the (now unbound) source PVC (the transfer controllers never creates or deletes objects, it just modifies the objects that users create).

Avoiding the controller doing any create/delete operations on a claim is certainly best IMO, I suppose if we mark it as unusable it does solve problems with contingency or races that might occur during the transfer process so that might work out fine.

That "unusable" state would need to include some sort of reference that a transfer had taken place but in general seems fine if we don't want to tie it to deletion.

FWIW I also like your suggestion that this is initialized via the Request on the destination side, that solves a concern regarding whether this would behave in a declarative manner or not.

@j-griffith
Copy link
Contributor Author

I tried a number of things here, and rebinding is extremely touchy in my experience. It leads to lost claims and inaccessible PVs. We may be able to come up with a way around this, but it also introduces a number of corner cases around resource quotas etc. Is there a compelling reason to avoid using the delete operation as the finalization of the process?

Actually, this might not be a problem now given that the existing claim can stay there. I'll take another look with some of the new ideas you suggested and update. Meanwhile I'll start reworking the proposal and get an update shortly.


The process of transferring a PVC/Volume is as follows:
1. Original user indicates they're willing to ``give`` the volume to another namespace
2. The receiving user indicates they'd like to ``accept`` the volume into their namespace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@j-griffith I would like to clarify the users defined in give and accept . In short, who are these users ? Isnt it all the users in a namespace ? If all the users can trigger this operation, do we see/explore misuse of this functionality, thus security issues? @saad-ali @liggitt thoughts ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we its "set" of users, how do we define that group?

## Motivation

There are a number of uses cases where a user would like to have the ability to transfer an existing PVC from one namespace to another. This is a valuable workflow for persistent storage and enables the ability to easily duplicate and transfer data sets from one environment to another. These populated PVCs could be a clone of another volume, a volume from snapshot, or data that was written to the volume via an application (ie a database).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@j-griffith once the PVC is successfully transferred, are we expecting that, the source PVC object is wiped from API server for source namespace ?

@justaugustus
Copy link
Member

/remove-sig architecture pm

@k8s-ci-robot k8s-ci-robot removed sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/pm labels May 26, 2019
j-griffith added a commit to j-griffith/enhancements that referenced this pull request Jun 24, 2019
This KEP provides a starting point to propose and discuss the addition
of an external (CRD) NameSpace Transfer API.

This is a result from discussions in the VolumeNamespaceTransfer
proposal: kubernetes#643
j-griffith added a commit to j-griffith/enhancements that referenced this pull request Jun 24, 2019
This KEP provides a starting point to propose and discuss the addition
of an external (CRD) NameSpace Transfer API.

This is a result from discussions in the VolumeNamespaceTransfer
proposal: kubernetes#643
@j-griffith
Copy link
Contributor Author

New API proposal here: #1112

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants