design-proposal: VM delete protection #363

jcanocan · 2024-12-04T10:36:54Z

What this PR does / why we need it:

The proposal introduces a mechanism to protect VMs from being deleted. By adding the annotation kubevirt.io/vm-delete-protection and with a value of true or True to the VM object, the VM cannot be deleted by any means. This would prevent accidental deletions in VMs that users considers crucial for their applications/clusters.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes # CNV-48696

Special notes for your reviewer:

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Design: A design document was considered and is present (link) or not required
PR: The PR description is expressive enough and will help future contributors
Code: Write code that humans can understand and Keep it simple
Refactor: You have left the code cleaner than you found it (Boy Scout Rule)
Upgrade: Impact of this change on upgrade flows was considered and addressed if required
Testing: New code requires new unit tests. New features and bug fixes require at least on e2e test
Documentation: A user-guide update was considered and is present (link) or not required. You want a user-guide update if it's a user facing feature / API change.
Community: Announcement to kubevirt-dev was considered

Release note:

None

kubevirt-bot · 2024-12-04T10:36:57Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign davidvossel for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

0xFelix · 2024-12-04T10:46:08Z

design-proposals/vm-delete-protection.md

+Differences between options are negligible in any scenario analyzed. Moreover, **no performance degradation has been
+observed.**
+
+Given the pros and cons, **OPTION \#1 is preferred in this case.**


Will we go with this option then?

IMHO, yes, we should go with this option.

geetikakay · 2024-12-04T12:25:45Z

design-proposals/vm-delete-protection.md

+The VM protection is implemented by adding
+a  [ValidationAdmissionPolicy and ValidationAdmissionPolicyBinding](https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/)
+objects. On every delete action against a VM object, the ValidatingAdmissionPolicy will check for presence of the
+annotation. If it is present and its value equals “true” or “True” the request will be rejected. The


Would it be useful to explain its behavior in scenarios where it is misconfigured?

It might create a really verbose and noisy outputs IMO. Documentation should cover how to properly configure the feature, and it should make clear that other configurations/values, will allow the deletion.

vladikr · 2024-12-04T14:33:50Z

/cc

jcanocan · 2024-12-05T09:45:02Z

v2:

Included Section about Finalizers.

jean-edouard

While I'm not against this (as long as option 1 is used), I don't quite see the point.
Are we trying to protect PVCs that would have the VM as a parent? In that case, I think we should take steps to protect the PVCs directly instead.
The VM spec itself really isn't all that precious, and would cost just a few kilobytes of storage to backup somewhere if it really did contain special data.

This proposal sounds like there are interesting field stories associated with it, if so could you maybe share a concrete example of what happened?

codingben · 2024-12-08T11:00:26Z

While I'm not against this (as long as option 1 is used), I don't quite see the point. Are we trying to protect PVCs that would have the VM as a parent?

No. We're trying to protect VMs from deletion by accident, e.g. by sysadmins that may run business critical applications or services (e.g. SQL server).

design-proposals/vm-delete-protection.md

jean-edouard · 2024-12-09T13:57:06Z

While I'm not against this (as long as option 1 is used), I don't quite see the point. Are we trying to protect PVCs that would have the VM as a parent?

No. We're trying to protect VMs from deletion by accident, e.g. by sysadmins that may run business critical applications or services (e.g. SQL server).

Sure but what do you mean by "VMs"? What, if not disks, are we afraid to lose?
From your answer, it sounds like we're more trying to protect against VM shutdown rather than VM deletion, maybe that's what the annotation should be about?

jcanocan · 2024-12-09T14:47:32Z

While I'm not against this (as long as option 1 is used), I don't quite see the point. Are we trying to protect PVCs that would have the VM as a parent?

No. We're trying to protect VMs from deletion by accident, e.g. by sysadmins that may run business critical applications or services (e.g. SQL server).

Sure but what do you mean by "VMs"? What, if not disks, are we afraid to lose? From your answer, it sounds like we're more trying to protect against VM shutdown rather than VM deletion, maybe that's what the annotation should be about?

As stated in the design document, what we are trying to protect is the VirtualMachine object from deletion.
Please note that this is a costumer request: https://issues.redhat.com/browse/CNV-46398?focusedId=25412651&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-25412651
So, what we would like to achieve is a deletion protection for VM objects similar to RHV or other cloud providers have.
While I think your point about protecting VM from shutting down is valid, users look more concerned about accidental deletions.

jean-edouard · 2024-12-09T15:03:17Z

While I'm not against this (as long as option 1 is used), I don't quite see the point. Are we trying to protect PVCs that would have the VM as a parent?

No. We're trying to protect VMs from deletion by accident, e.g. by sysadmins that may run business critical applications or services (e.g. SQL server).

Sure but what do you mean by "VMs"? What, if not disks, are we afraid to lose? From your answer, it sounds like we're more trying to protect against VM shutdown rather than VM deletion, maybe that's what the annotation should be about?

As stated in the design document, what we are trying to protect is the VirtualMachine object from deletion. Please note that this is a costumer request: https://issues.redhat.com/browse/CNV-46398?focusedId=25412651&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-25412651 So, what we would like to achieve is a deletion protection for VM objects similar to RHV or other cloud providers have. While I think your point about protecting VM from shutting down is valid, users look more concerned about accidental deletions.

Thank you for the pointer. Please note that access to the comment you linked to is restricted. The important part of it is that Having some form of delete protection would provide peace of mind against sysadmin accidents.
With great power comes great responsibility... No matter what we do, kubectl delete -f will bypass it. We're also talking about protecting something that can easily be re-created (a VM spec).
That being said, I don't feel strongly against this, as the maintenance burden seem minimal.
@vladikr WDYT?

jcanocan · 2024-12-09T15:19:43Z

While I'm not against this (as long as option 1 is used), I don't quite see the point. Are we trying to protect PVCs that would have the VM as a parent?

No. We're trying to protect VMs from deletion by accident, e.g. by sysadmins that may run business critical applications or services (e.g. SQL server).

Sure but what do you mean by "VMs"? What, if not disks, are we afraid to lose? From your answer, it sounds like we're more trying to protect against VM shutdown rather than VM deletion, maybe that's what the annotation should be about?

As stated in the design document, what we are trying to protect is the VirtualMachine object from deletion. Please note that this is a costumer request: https://issues.redhat.com/browse/CNV-46398?focusedId=25412651&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-25412651 So, what we would like to achieve is a deletion protection for VM objects similar to RHV or other cloud providers have. While I think your point about protecting VM from shutting down is valid, users look more concerned about accidental deletions.

Thank you for the pointer. Please note that access to the comment you linked to is restricted. The important part of it is that Having some form of delete protection would provide peace of mind against sysadmin accidents.

Yes, this is a good summary. Thanks!

With great power comes great responsibility... No matter what we do, kubectl delete -f will bypass it.

Just one note, both approaches mentioned in this design document does not allow the force deletion either. You could check it in the PoCs linked.

We're also talking about protecting something that can easily be re-created (a VM spec). That being said, I don't feel strongly against this, as the maintenance burden seem minimal. @vladikr WDYT?

I agree, but from the user point of view, they will have the guarantee that the VM will not be deleted no matter what.

stu-gott

Overall I don't see why this is needed. I've seen reference to tekton pipelines as a justification, but I don't understand what the real problem is there.

stu-gott · 2024-12-09T15:22:01Z

design-proposals/vm-delete-protection.md

+
+### Enable the VM Object Protection
+
+To protect VMs against deletion, they need to be annotated with  `kubevirt.io/vm-delete-protection` set to `true` or


Personally, I'd much rather have a first-class setting in the VM. Documentation of annotations tends to be fragmented and doesn't lead to a very good user experience. Making a dedicated setting in the spec means it will be properly documented in swagger.

Many thanks for your feedback, much appreciated.

While I'm not against this idea, I can see potential challenges in the future. IMHO, a good documentation entry should be enough. The API field vs an annotation has the following advantages:

Low maintenance, no need to maintain a new API field.

Extensible, it would be just a matter of adding new resourcesRules to the existing VAP if we would like to protect other KubeVirt objects. Creating a dedicated API field will force us to complicate the VAP and even to move the API setting from different places (creating room for potential backwards compatibility challenges).

jcanocan · 2024-12-10T08:08:30Z

Overall I don't see why this is needed. I've seen reference to tekton pipelines as a justification, but I don't understand what the real problem is there.

IMHO, Jed summed it up really well here: #363 (comment)
Looks like users accidentally delete VMs more that they would like to. Therefore, they are missing some delete protection in the VM objects. Most cloud providers have this option, e.g., https://cloud.google.com/compute/docs/instances/preventing-accidental-vm-deletion

lyarwood

Might be an invalid use case but what about using status and a subresource API to control this behaviour to allow finer grain RBAC control?

lyarwood · 2024-12-10T11:10:03Z

design-proposals/vm-delete-protection.md

+
+## User Stories
+
+As a user who has the permissions to delete VMs, I want to protect my VMs from accidental deletion.


I'd simplify this and just say As a VM Owner, the fact the owner has permissions to delete the VM isn't relevant to the fact that they want to protect it from accidental deletion by them or a third party.

lyarwood · 2024-12-10T11:15:44Z

design-proposals/vm-delete-protection.md

+
+1. Removing the `kubevirt.io/vm-delete-protection` annotation with the following patch operation:
+```bash
+$ kubectl patch vm <vm-name> --type=json -p='[{"op": "remove", "path": "/metadata/annotations/kubevirt.io~1vm-delete-protection"}]'


This ultimately means that anyone with the ability to update/patch the VM can enable or disable this functionality. Have we considered placing it in status controlled by a subresource with finer grain RBAC contol?

Correct me if I'm wrong, but I assume this is the same concern raised in the top comment of your review.

I didn't think about that TBH, but it feels a bit overkill IMHO. However, what would be the ultimate goal of doing that? I'm not sure if this particular use case requires it. Please, elaborate a bit if you think otherwise.

In any case, if we find out this is required, we could implement it in a follow-up.

The ultimate goal would be to allow a smaller subset of users (super admins?) to have access to the delete protection toggle than have the ability to update the VM object itself. It's extra work but given the use case here I'm not sure we want to give everyone access to this toggle right?

The ultimate goal would be to allow a smaller subset of users (super admins?) to have access to the delete protection toggle than have the ability to update the VM object itself.

@lyarwood this feature is only about the single persona, the VM owner with "fat fingers". An additional role would be not helpful in this use case.

@dominikholler well it's any user with DELETE privileges against the VM surely so not just an owner but anyone in the same namespace or admin. It also doesn't protect against that same user accidentently removing the label and thus the protection before requesting the deletion.

Moving the toggle into the status would avoid this as users wouldn't be able to touch it without going through the sub resource API that itself could be limited to a specific owner and/or super set of admins.

It also doesn't protect against that same user accidentently removing the label and thus the protection before requesting the deletion.

This would require two steps, this feature is just protecting from a single unintended step, e.g. a `kubectl delete vm/xxx" .

vladikr · 2024-12-10T19:10:34Z

Thank you for taking the time to write this proposal @jcanocan.

At first, I thought that this might be a good idea. Then I became conflicted as it seems that ultimately we are protecting these VMs from the cluster admins.
Those who have the ultimate responsibility in the cluster...
Aren't we going to ruin the hierarchy in the cluster if we allow users to prevent an admin operation?

If we really need such a layer of protection, I would imagine this feature working from another direction.
Perhaps a user could provide a known "hint" (in the form of a label) and the admin could create a list of protected VMs that have label X.

jcanocan · 2024-12-11T09:36:42Z

Thank you for taking the time to write this proposal @jcanocan.

At first, I thought that this might be a good idea. Then I became conflicted as it seems that ultimately we are protecting these VMs from the cluster admins. Those who have the ultimate responsibility in the cluster... Aren't we going to ruin the hierarchy in the cluster if we allow users to prevent an admin operation?

If we really need such a layer of protection, I would imagine this feature working from another direction. Perhaps a user could provide a known "hint" (in the form of a label) and the admin could create a list of protected VMs that have label X.

We could enable the protection by setting a label like:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  labels:
    kubevirt.io/vm-delete-protection: "true"

In this way, as you mentioned, the cluster-admin could get a list of VMs with the protection enabled with something like:

kubectl get vm --selector=kubevirt.io/vm-delete-protection=true

WDYT? Did I get your idea right?

Thanks for your input.

vladikr · 2024-12-11T20:29:35Z

@jcanocan What I had in mind is something closer to MigrationPolicy or a CR configuration similar to PermittedHostDevices

The idea is to separate the concerns of each persona.
The VM owner would "hint" by setting a label:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  labels:
    kubevirt.io/please-avoid-deleting: "true"

Then the admin would set the rule that would prevent the deletion of the associated VMs with the listed label :

preventVMsDeletion:
  selector: kubevirt.io/please-avoid-deleting

vladikr · 2024-12-12T14:11:49Z

@jcanocan What I had in mind is something closer to MigrationPolicy or a CR configuration similar to PermittedHostDevices

The idea is to separate the concerns of each persona. The VM owner would "hint" by setting a label:
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  labels:
    kubevirt.io/please-avoid-deleting: "true"

perhaps the label/annotations can be protected with cel? @xpivarc

vladikr · 2024-12-12T14:13:18Z

Might be an invalid use case but what about using status and a subresource API to control this behaviour to allow finer grain RBAC control?

It's an interesting idea. How would this work with newly created VMs? Doesn't it have to be only a secondary action?

dominikholler · 2024-12-12T14:39:15Z

@jcanocan What I had in mind is something closer to MigrationPolicy or a CR configuration similar to PermittedHostDevices

The idea is to separate the concerns of each persona. The VM owner would "hint" by setting a label:
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  labels:
    kubevirt.io/please-avoid-deleting: "true"
Then the admin would set the rule that would prevent the deletion of the associated VMs with the listed label :
preventVMsDeletion:
  selector: kubevirt.io/please-avoid-deleting

@vladikr Isn't this already working out-of-the-box using ValidatingAdmissionPolicy ?

This feature is about having a fixed marking (e.g. label or annotation) so that the UI can configure a marking for a pre-defined rule.

dominikholler · 2024-12-12T14:40:26Z

Might be an invalid use case but what about using status and a subresource API to control this behaviour to allow finer grain RBAC control?

It's an interesting idea. How would this work with newly created VMs? Doesn't it have to be only a secondary action?

In which way would help this idea to help the admin with "fat-fingers" ?

vladikr · 2024-12-12T15:46:25Z

@jcanocan What I had in mind is something closer to MigrationPolicy or a CR configuration similar to PermittedHostDevices
The idea is to separate the concerns of each persona. The VM owner would "hint" by setting a label:
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  labels:
    kubevirt.io/please-avoid-deleting: "true"
Then the admin would set the rule that would prevent the deletion of the associated VMs with the listed label :
preventVMsDeletion:
  selector: kubevirt.io/please-avoid-deleting
@vladikr Isn't this already working out-of-the-box using ValidatingAdmissionPolicy ?

Please elaborate on what is working out of the box. :)
I mentioned CEL (and I meant ValidatingAdmissionPolicy ) in my previous comment

I'm only trying to emphasize that we shouldn't now allow users to block admin operations without the admins' consent or/and a way to override this.

We achieve this separation either with ValidatingAdmissionPolicy or with a policy CRD that the admin would install or a CR config.

This feature is about having a fixed marking (e.g. label or annotation) so that the UI can configure a marking for a pre-defined rule.

This proposal is for KubeVirt which doesn't have a UI... but in all cases, I see a label or an annotation being used to request this feature.

vladikr · 2024-12-12T15:49:29Z

Might be an invalid use case but what about using status and a subresource API to control this behaviour to allow finer grain RBAC control?

It's an interesting idea. How would this work with newly created VMs? Doesn't it have to be only a secondary action?

In which way would help this idea to help the admin with "fat-fingers" ?

Not by itself, but this is just another way to limit the configuration of this feature to cluster admins and vm owners.

codingben · 2024-12-12T16:41:47Z

This proposal is for KubeVirt which doesn't have a UI...

I think we should get rid of mentioning "UI" in the design proposal. I've already opened discussion about it earlier this week: #363 (comment)

lyarwood · 2024-12-12T16:58:20Z

Might be an invalid use case but what about using status and a subresource API to control this behaviour to allow finer grain RBAC control?

It's an interesting idea. How would this work with newly created VMs? Doesn't it have to be only a secondary action?

Yeah I can't think of a clean way of providing it with the initial definition as you'd always have to remove it once populated in status to ensure a single source of truth.

In which way would help this idea to help the admin with "fat-fingers" ?

Not by itself, but this is just another way to limit the configuration of this feature to cluster admins and vm owners.

Yup it makes the act more deliberate by forcing an admin or whomever it is with access to the subresource to first toggle delete protection there and then fire off the delete.

Without this any user with the ability to update the VM object can toggle delete protection and then nuke the VM.

dominikholler · 2024-12-13T09:41:56Z

Please elaborate on what is working out of the box. :) I mentioned CEL (and I meant ValidatingAdmissionPolicy ) in my previous comment

I'm only trying to emphasize that we shouldn't now allow users to block admin operations without the admins' consent or/and a way to override this.

We achieve this separation either with ValidatingAdmissionPolicy or with a policy CRD that the admin would install or a CR config.

According to my limited technical understanding, the cluster admin can already right now create a ValidatingAdmissionPolicy like this:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:   
  name: "vm-delete-protection"
spec:   
  failurePolicy: Fail
  matchConstraints:     
    resourceRules:     
    - apiGroups:   ["kubevirt.io"]
      apiVersions: ["*"]
      operations:  ["DELETE"]
      resources:   ["virtualmachines"]
  variables: 
    - expression: string('kubevirt.io/vm-delete-protection')
      name: annotation
  validations: 
    - expression: (!(variables.annotation in oldObject.metadata.annotations) || !oldObject.metadata.annotations[variables.annotation].matches('^(true|True)$'))
      messageExpression: "'Object ' + string(oldObject.metadata.name) + ' cannot be deleted, remove delete protection'"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:   
  name: "vm-delete-protection-binding"
spec:   
  policyName: "vm-delete-protection"
  validationActions: [Deny]
  matchResources:

Now, if the VM owner adds the annotation kubevirt.io/vm-delete-protection: "true" to any VM object, it will be protected against deletion. You can do it by applying the following path to your VM object:

kubectl patch <vm-name> -n <namespace> --type=json -p '[{"op": "add", "path": "/metadata/annotations/kubevirt.io~1vm-delete-protection", "value": "true"}

To reverse the operation and allow deletion, the following command can be used:

kubectl patch vm <vm-name> -n <namespace> --type=json -p '[{"op": "remove", "path": "/metadata/annotations/kubevirt.io~1vm-delete-protection"}]'

the same should work if a label is used instead of the annotation.

This feature is about providing a easy to use, simple and testable interface to the already existing solution.
For this reason the feature would loose its value, if it is introducing more complexity than what already exists.

But I see the point, that there should be a way to introduce a way for the admin to control the deletion protection.

In which way would help this idea to help the admin with "fat-fingers" ?

Not by itself, but this is just another way to limit the configuration of this feature to cluster admins and vm owners.

I see, thanks for the hint!

dominikholler · 2024-12-13T10:32:28Z

Without this any user with the ability to update the VM object can toggle delete protection and then nuke the VM.

Where do you see the issue with this? Isn't it expected that if a user has the VM owners permissions this user is able to delete the VM?

xpivarc · 2024-12-13T15:47:49Z

@dominikholler The policy you wrote is exactly what you want. Now why should this opinionated (subjective at least to me) policy be part of core Kubevirt? Especially when it is so easy to implement as you demonstrated? Maybe more opinionated component could introduce this?

The proposal introduces a mechanism to protect VMs from being deleted. By adding the annotation `kubevirt.io/vm-delete-protection` and with a value of `true` or `True` to the VM object, the VM cannot be deleted by any means. This would prevent accidental deletions in VMs that users considers crucial for their applications/clusters. Signed-off-by: Javier Cano Cano <[email protected]>

jcanocan · 2024-12-16T10:05:15Z

Update:

Added new chosen solution, deploy VAP in SSP.
Changed the mechanism to enable the protection from annotation to label
Changed affected repos from kubevirt to SSP.

kubevirt-bot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label Dec 4, 2024

kubevirt-bot requested review from aburdenthehand and jean-edouard December 4, 2024 10:36

kubevirt-bot added the size/L label Dec 4, 2024

0xFelix reviewed Dec 4, 2024

View reviewed changes

geetikakay reviewed Dec 4, 2024

View reviewed changes

kubevirt-bot requested a review from vladikr December 4, 2024 14:33

jcanocan force-pushed the vm-delete-protection branch from 45dc46b to 510b378 Compare December 5, 2024 09:44

jean-edouard reviewed Dec 5, 2024

View reviewed changes