Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move CSI snapshot KEP into directory and add an extension for tightening validation #1902

Merged
merged 20 commits into from
Aug 18, 2020

Conversation

AndiLi99
Copy link
Contributor

Create KEP extension for tightening validation on snapshot objects #1900

Some details are not fully fleshed out.

  • We need to carefully consider what validation will be tightened, and which fields will become immutable. What types of immutability will we use? Immutable on create, set or once 'valid'?
  • How exactly will the webhook server deploy with TLS? There is no standard boilerplate as far as I am aware of for creating webhook servers. Some suggestions include cert-manager. We need to consider whether TLS deployment is in scope for this feature. How will we ensure the certs don't interfere with existing cert management installed on existing clusters? How will we automate testing of the webhook?

The release process has been outlined in order to tackle backwards compatibility concerns.

cc @xing-yang @msau42 @yuxiangqian @mattcary

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 22, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @AndiLi99!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 22, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @AndiLi99. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Jul 22, 2020
@msau42
Copy link
Member

msau42 commented Jul 22, 2020

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 22, 2020
keps/sig-storage/177-volume-snapshot-validation/kep.yaml Outdated Show resolved Hide resolved
1. The first phase is webhook-only, and will use ratcheting validation combined with a reconciliation method to delete or fix currently persisted objects which are invalid under the new (strict) validation rules.
2. The second phase occurs once all invalid objects are cleared from the cluster. The CRD schema validation will be tightened and the webhook will stick around to enforce immutability until immutable fields come to CRDs. (or the crd upgrade could wait until immutable fields are available to do in one go)

The phases come in separate releases to allow users / cluster admin the opportunity to clean their cluster of any invalid objects. More details are in the Risks and Mitigations section.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 1 release enough? I'm not quite sure where this falls under the kubernetes deprecation policy. This is changing the behavior of a beta API, but the deprecation policy mostly talks about removing entire API groups. cc @liggitt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this tightening is like fixing bugs. The oneof restriction is already handled by the controller. So I think we should not wait for more than 1 release.

@msau42
Copy link
Member

msau42 commented Jul 22, 2020

@kubernetes/sig-storage-proposals

@k8s-ci-robot k8s-ci-robot added the kind/design Categorizes issue or PR as related to design. label Jul 22, 2020
@msau42
Copy link
Member

msau42 commented Jul 22, 2020

cc @jsafrane @liggitt

@yuxiangqian
Copy link

cc @xing-yang

@xing-yang
Copy link
Contributor

There's a validation error:

    --- FAIL: TestValidation/../../keps/sig-storage/177-volume-snapshot-validation/kep.yaml (0.00s)
        main_test.go:80: exit code was 1 and not 0. Output:
            ../../keps/sig-storage/177-volume-snapshot-validation/kep.yaml has an error: "error validating KEP metadata: \"prr-approvers\" must be one of (deads2k,johnbelamaric,wojtek-t) but it is a string: TBD"

@AndiLi99
Copy link
Contributor Author

@xing-yang who would be the prr reviewer for the original csi snapshot kep?

@mattcary
Copy link
Contributor

@xing-yang who would be the prr reviewer for the original csi snapshot kep?

Looks like PRRs are new: #1620

There's a note in the changes to "Please reach out on the
#prod-readiness channel" for help & guidance.

I'll ask on there what to do.

@xing-yang
Copy link
Contributor

The original snapshot KEP didn't go through the PRR process. You can pick one of (deads2k,johnbelamaric,wojtek-t) as PRR reviewer.


## Proposal

Tighten the validation on Volume Snapshot objects. The following fields will begin to enforce immutability: `VolumeSnapshot.Spec.Source` and `VolumeSnapshotContent.Spec.Source`. The following fields will begin to enforce `oneOf`: `VolumeSnapshot.Spec.Source` and `VolumeSnapshotContent.Spec.Source`). The following fields will begin to enforce non-empty strings: `spec.VolumeSnapshotClassName`. More details are in the Validating Scenarios section.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also spec.VolumeSnapshotRef

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to the list with a note that it is immutable only after UID is set. I'm not sure if that behaviour is good though. I feel like we shall either make it fully immutable after creation, or not make it immutable at all. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the VolumeSnapshotRef field says it's immutable after creation, is that inconsistent with the UID being set after creation? Also, I thought the controller created the whole VSC at once (but am I mistaken?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the dynamic snapshot case, the controller creates the entire VSC at once, you are correct.

In the pre-provisioned snapshot case, the CA must create the VSC. They must specify the name and namespace of the VS they wish to be bound to the VSC on the VSC.Spec.VolumeSnapshotRef field. If the VS does not exist yet there is no UID set, The controller after binding will set the UID in the VSC.Spec.VolumeSnapshotRef field. So the controller must be able to change that field after it is created to add set the UID one time.

This is a bit confusing but I'm not sure if anything can be done to change this behaviour.

cc @yuxiangqian to check if I got this behaviour right.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Could you make a quick PR to update the comment in types.go so that it's consistent with what is actually the case?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what Andi described is accurate, I guess the semantic is more accurate if put this way: VSC.Spec.VolumeSnapshotRef is immutable (more specifically name, namespace, and UID fields) once bound.
for dynamic case, there is little ambiguity, all three fields will be set at once by controller.
for preprovisioned case, name/namespace are the users responsibility to set, while the UID is the controller's responsibility during binding (two stages instead of one compared to dynamic case). I think it's reasonable to allow updating for name/namespace fields BEFORE a content is bound to a VS, i.e., UID == "", for the pre-provisioned case. A potential use case could be:

  1. CA creates a pre-provisioned VSC with VS-A name = "a", ns = "a-ns" (not bound yet)
  2. CA realized they need the VSC to point to VS-B name = "b", ns = "b-ns"
  3. CA updates the VSC to point to VS-B (if VSC has been bound(by checking UID == ""), it should be rejected by webhook, if not, webhook should allow?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think those semantics make sense. But we shouldn't call the field just "immutable" in that case. I like making more precise by saying "immutable after binding" or something like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm making a quick PR to update the comment in types.go to say it's immutable after UID is set only.

Fix typo
Fix wrong date for beta of snapshot
Change to reference release tag
Add `VolumeSnapshotContent.spec.VolumeSnapshotRef` to list of immutable
fields with note that it is immutable only after UID is set.
| CREATE | spec.Source | Exactly one of PersistentVolumeClaimName (Dynamic) or VolumeSnapshotContentName (Pre-provisioned) should be specified. | 400 |
| UPDATE | spec.Source | Immutable, no updates allowed. If the user has specified an incorrect source, they must delete and remake the snapshot. The webhook validation server will not be able to guarantee that only incorrect sources are allowed to be updated. | 400 |
| CREATE | spec.VolumeSnapshotClassName | Must not be the empty string. Can be unset (to use the default snapshot class, if it is set. If the default snapshot class is not set or there is more than 1 default class, then the hook will allow the creation but the snapshot will fail.), or set to a non-empty string (the snapshot class). | 400 |
| UPDATE | spec.VolumeSnapshotClassName | Same restrictions as CREATE. We won’t restrict updating by making this field immutable (only applying the same restrictions as creation) but this field should only be changed by those who know exactly what they are doing. | 400 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the reason for not restricting updates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuxiangqian would know what specific scenarios are valid for changing the class name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a TODO to resolve this by the time we implement the webhook? We should very clearly understand the use cases for supporting mutability of each field, and perhaps even document the use case in the CRD schema.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one scenario was the following:

  1. user creates a VS which points to a non-existing volume snapshot class (fat-fingered or whatever reason), in this case the volume snapshot controller will fail the create a snapshot.
  2. user updates the volume snapshot class to point to a correct one.
    IMO it sounds like a reasonable mutation scenario?
    If it's a mutable field, then it makes sense to me not to list this entry here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if a valid snapshot was created, but then the user changes the class to something different?

I guess there's a similar scenario where a user could delete and create a volumesnapshotclass with the same name, but completely different settings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msau42
Copy link
Member

msau42 commented Aug 17, 2020

/approve
Awesome work Andy on tackling this difficult problem!

Will leave it to @xing-yang and @yuxiangqian for final review

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 17, 2020
Copy link

@yuxiangqian yuxiangqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM. Great work Andi!

| CREATE | spec.Source | Exactly one of PersistentVolumeClaimName (Dynamic) or VolumeSnapshotContentName (Pre-provisioned) should be specified. | 400 |
| UPDATE | spec.Source | Immutable, no updates allowed. If the user has specified an incorrect source, they must delete and remake the snapshot. The webhook validation server will not be able to guarantee that only incorrect sources are allowed to be updated. | 400 |
| CREATE | spec.VolumeSnapshotClassName | Must not be the empty string. Can be unset (to use the default snapshot class, if it is set. If the default snapshot class is not set or there is more than 1 default class, then the hook will allow the creation but the snapshot will fail.), or set to a non-empty string (the snapshot class). | 400 |
| UPDATE | spec.VolumeSnapshotClassName | Same restrictions as CREATE. We won’t restrict updating by making this field immutable (only applying the same restrictions as creation) but this field should only be changed by those who know exactly what they are doing. | 400 |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one scenario was the following:

  1. user creates a VS which points to a non-existing volume snapshot class (fat-fingered or whatever reason), in this case the volume snapshot controller will fail the create a snapshot.
  2. user updates the volume snapshot class to point to a correct one.
    IMO it sounds like a reasonable mutation scenario?
    If it's a mutable field, then it makes sense to me not to list this entry here?

Due to backwards compatibility concerns, the tightening will occur in three phases.

1. The first phase is webhook-only, and will use [ratcheting validation](#backwards-compatibility). It will be the user's responsibility to clean up invalid objects which already existed before the webhook was enabled. Invalid objects are those which fail the new, stricter validation. The controller will not be able to automatically fix invalid objects, however it will apply a [label](#automatic-labelling-of-invalid-objects) to invalid objects so that users can easily locate them.
2. The second phase can occur once all invalid objects are cleared from the cluster. It will be the cluster admin's responsibility to check and detect when it is safe to move to the second phase. The CRD schema validation will be tightened and the webhook will stick around to enforce immutability until immutable fields come to CRDs (Custom Resource Definition). This will be accompanied by a version change to make it clear the CRD is using different validation, however the storage version will be kept as `v1beta1` to ensure a [rollback](#rollback) is possible at phase 2.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What "version change" do you mean? Cut a different release?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version change in the api of the object. So volume snapshot moves from v1beta1 to v1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify in the KEP that "version change" means snapshot API version moves from v1beta1 to v1?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not quite a "move" at this point. We're just making a new api version available at this stage, and changing our controllers to use the new api version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What API version are you referring to? Is it "v1"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does phase 2 implies snapshot will go GA?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworded it to show that in phase 2 there will be a new snapshot version (v1). The plan is for phase 2 to be v1, implying GA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is not GA ready, potentially the version could be v1beta2.

@yuxiangqian
Copy link

LGTM
@liggitt @mattcary @xing-yang any more comments?

@xing-yang
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 18, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AndiLi99, msau42, xing-yang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit fd1df62 into kubernetes:master Aug 18, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone Aug 18, 2020
@xing-yang
Copy link
Contributor

Thanks Andi!

@AndiLi99
Copy link
Contributor Author

Thank you everyone for all your help! Pleasure to work with you all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.