Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate/Upgrade previous version with a default timeout… #1083

Merged
merged 1 commit into from
Aug 6, 2019

Conversation

vdemeester
Copy link
Member

@vdemeester vdemeester commented Jul 16, 2019

Changes

In case of a controller upgrade, we may still store previous versions
of resource that may not have been the default set. This migrates
those "in memory" to make sure we do not panic in those cases.

This also gives a great example of how to use a mutating webhook to
migrate some resources from one version to another.

/priority critical-urgent
/cc @bobcatfish

Signed-off-by: Vincent Demeester [email protected]

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Release Notes

Add a mechanisms to update CRD objects from one version to another

@vdemeester vdemeester added this to the Pipelines 0.5 🐱 milestone Jul 16, 2019
@tekton-robot tekton-robot requested a review from bobcatfish July 16, 2019 11:14
@tekton-robot tekton-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jul 16, 2019
@googlebot googlebot added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label Jul 16, 2019
@tekton-robot tekton-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 16, 2019
// add the old default timeout.
// Most likely those TaskRun passing here are already done and/or already running
// and were meant to have the 10 minutes default timeout
timeout = &metav1.Duration{Duration: 10 * time.Minute}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hard-coded old value. We could just use config.DefaultTimeoutMinutes here instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather use the new value - seems odd to sometimes default to old value, sometimes default to the new value

@vdemeester
Copy link
Member Author

/hold

@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 16, 2019
@bobcatfish
Copy link
Collaborator

This also gives a great example of how to use a mutating webhook to migrate some resources from one version to another.

I tried to google this but didnt see an obvious answer: I thought there was a difference between a "mutating" webhook and a non-mutating webhook, i.e. something you need to declare when actually creating the webhook controller itself? Up until this point I thought we were only validating, not mutating, so I am surprised we can just start suddenly mutating. I think I'm missing some knowledge on how this actually works tho, any help appreciated! :bow_panda:

Is it because of:

- apiGroups: ["admissionregistration.k8s.io"]
resources: ["mutatingwebhookconfigurations"]
verbs: ["get", "list", "create", "update", "delete", "patch", "watch"]

And if so, does that mean we've been mutating all this time? (I didn't think we were 🤔)

// HasDefaultConfigurationName checks to see whether the given context has
// been marked as having a default configurationName.
func HasDefaultConfigurationName(ctx context.Context) bool {
return ctx.Value(hdcnKey{}) != nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's hdcn ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not ? 😹 hdcn is for hasdefaultconfigurationname

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤭

Copy link
Collaborator

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vdemeester do we want to take the approach of defaulting values via a mutating webhook in all cases like this in the future?

It looks to me like the root cause is:

timeout := tr.Spec.Timeout.Duration

So another way to approach this would be to check if tr.Spec.Timeout is nil before trying to use Duration?


// lemonadeKey is used as the key for associating information
// with a context.Context.
type lemonadeKey struct{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's lemonade?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just a random key struct used internal for storing some metadata in the context (here just that we are upgrading it). I was debating on naming this sthg else, but it's how knative named it, and I like lemonade 😝

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v good in hot weather 😎

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the comment could mention that lemonadeKey is a random name? when i see this it makes me think there is some kinda important thing called "lemonade" - e.g. a "lemonade" defaulting algorithm?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, yeah I can add a comment 😉

@bobcatfish
Copy link
Collaborator

I have two thoughts:

  • We should also probably update
    timeout := tr.Spec.Timeout.Duration
    to check for nil
  • I'm a bit confused about why the defaulting that we added before wasn't getting applied anymore (I'm guessing b/c before we were only applying it via the webhook, and in this case, the TaskRuns already exist in etcd?) - I'd like to revisit defaulting in general and our story around values that can be nil (but that's probably a story for our postmortem!)

But I'm happy to go ahead and get this in to fix 0.5.1! Thanks for fixing this so fast and for putting up with my many comments @vdemeester 🙇‍♀

/lgtm
/approve
/meow space

@vdemeester
Copy link
Member Author

@vdemeester do we want to take the approach of defaulting values via a mutating webhook in all cases like this in the future?

I think we do, yes

It looks to me like the root cause is:

timeout := tr.Spec.Timeout.Duration

So another way to approach this would be to check if tr.Spec.Timeout is nil before trying to use Duration?

Right, that would another way to fix it. The only thing I didn't want is to add yet another place where we check / add defaults. Might be the simplest fix though 👼

@tekton-robot
Copy link
Collaborator

@bobcatfish: cat image

In response to this:

I have two thoughts:

  • We should also probably update
    timeout := tr.Spec.Timeout.Duration
    to check for nil
  • I'm a bit confused about why the defaulting that we added before wasn't getting applied anymore (I'm guessing b/c before we were only applying it via the webhook, and in this case, the TaskRuns already exist in etcd?) - I'd like to revisit defaulting in general and our story around values that can be nil (but that's probably a story for our postmortem!)

But I'm happy to go ahead and get this in to fix 0.5.1! Thanks for fixing this so fast and for putting up with my many comments @vdemeester 🙇‍♀

/lgtm
/approve
/meow space

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 16, 2019
@bobcatfish
Copy link
Collaborator

Right, that would another way to fix it. The only thing I didn't want is to add yet another place where we check / add defaults. Might be the simplest fix though 👼

I feel like we should do that also, b/c as it is we have code that if called as-is could panic 😓 (e.g. in a unit test). It's a separation of concerns for sure - I guess the ideal case would be ensuring via the way these objects are instantiated that all the fields have valid values so every single function accessing them doesn't have to worry about them.

@vdemeester vdemeester removed the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jul 16, 2019
@tekton-robot tekton-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jul 16, 2019
@vdemeester vdemeester mentioned this pull request Jul 16, 2019
3 tasks
@vdemeester vdemeester removed the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jul 16, 2019
@vdemeester
Copy link
Member Author

@bobcatfish #1085 for the simplest fix (to be cherry-picked), I'll update this one to have this mechanism to all resources 👼 😉


var timeout *metav1.Duration
if IsUpgradeViaDefaulting(ctx) {
// This case is for preexisting `TaskRun` before 0.5.0, so let's
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm missing something: the lemonade key is added to the context at the beginning of the reconcile function, so it will be added to old and new TaskRuns alike. However new TaskRuns willl have the timeout set, so they won't hit L47.

@vdemeester vdemeester force-pushed the fix-possible-nil-timeout branch from fc4c3a7 to b76ad0f Compare July 23, 2019 17:43
@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Jul 23, 2019
@vdemeester
Copy link
Member Author

@bobcatfish Updated for pipelinerun too (and only that for now).

@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/contexts.go 0.0% 50.0% 50.0
pkg/apis/pipeline/v1alpha1/pipelinerun_defaults.go 0.0% 100.0% 100.0
pkg/apis/pipeline/v1alpha1/taskrun_defaults.go 0.0% 100.0% 100.0
pkg/reconciler/v1alpha1/pipelinerun/pipelinerun.go 79.6% 79.7% 0.1
pkg/reconciler/v1alpha1/taskrun/taskrun.go Do not exist 70.7%

Copy link
Collaborator

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments ... mostly asking for comments haha - there is a bit about how this defaulting works that I am not 100% clear on 🙏

// HasDefaultConfigurationName checks to see whether the given context has
// been marked as having a default configurationName.
func HasDefaultConfigurationName(ctx context.Context) bool {
return ctx.Value(hdcnKey{}) != nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤭


// lemonadeKey is used as the key for associating information
// with a context.Context.
type lemonadeKey struct{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v good in hot weather 😎

}

// IsUpgradeViaDefaulting checks whether we should be "defaulting" from v1alpha1 to
// the v1beta1 subset.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about what v1alpha1 and v1beta1 are referring to in this context - my understanding is that we are going from pre 0.5.0 to 0.5.0 ? Or does v1beta1 here refer to the version of a lib we are using? (maybe the comment could clarify?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah lol, I write that "ahead" of time. it should be from pr 0.5.0 to post 0.5.0 indeed 😉
I need to update the comment indeed 👍


// lemonadeKey is used as the key for associating information
// with a context.Context.
type lemonadeKey struct{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the comment could mention that lemonadeKey is a random name? when i see this it makes me think there is some kinda important thing called "lemonade" - e.g. a "lemonade" defaulting algorithm?

@vdemeester vdemeester force-pushed the fix-possible-nil-timeout branch from b76ad0f to 451d0b4 Compare July 24, 2019 12:19
@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/contexts.go 0.0% 50.0% 50.0
pkg/apis/pipeline/v1alpha1/pipelinerun_defaults.go 0.0% 100.0% 100.0
pkg/apis/pipeline/v1alpha1/taskrun_defaults.go 0.0% 100.0% 100.0
pkg/reconciler/v1alpha1/pipelinerun/pipelinerun.go Do not exist 79.7%
pkg/reconciler/v1alpha1/taskrun/taskrun.go 71.0% 70.7% -0.3

@vdemeester
Copy link
Member Author

Updated 👼

@pmorie
Copy link
Contributor

pmorie commented Jul 24, 2019

@vdemeester do we want to take the approach of defaulting values via a mutating webhook in all cases like this in the future?

I think the specific use case here is:

  • In release X, we add a new field with a default value
  • Objects created between X-1 and X do not have the field, but should have the default after release X

Is that accurate? If so, CRD conversion webhooks are the long-term solution here. Use of mutating webhook to perform this function is a stop-gap until we have those.

@pmorie
Copy link
Contributor

pmorie commented Jul 24, 2019

@vdemeester
Copy link
Member Author

I think the specific use case here is:

* In release X, we add a new field with a default value

* Objects created between X-1 and X do not have the field, but should have the default after release X

Is that accurate? If so, CRD conversion webhooks are the long-term solution here. Use of mutating webhook to perform this function is a stop-gap until we have those.

@pmorie yes, that is accurate 🤗

@vdemeester
Copy link
Member Author

Cc @abayer @dibyom

@vdemeester
Copy link
Member Author

/hold cancel

@tekton-robot tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 3, 2019
In case of a controller upgrade, we may still store previous versions
of resource that may not have been the default set. This migrates
those "in memory" to make sure we do not panic in those cases.

This also gives a great example of how to use a mutating webhook to
migrate some resources from one version to another.

Signed-off-by: Vincent Demeester <[email protected]>
@vdemeester vdemeester force-pushed the fix-possible-nil-timeout branch from 451d0b4 to fbfbfea Compare August 6, 2019 14:25
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bobcatfish, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [bobcatfish,vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/contexts.go 0.0% 50.0% 50.0
pkg/apis/pipeline/v1alpha1/pipelinerun_defaults.go 75.0% 100.0% 25.0
pkg/apis/pipeline/v1alpha1/taskrun_defaults.go 83.3% 100.0% 16.7
pkg/reconciler/v1alpha1/pipelinerun/pipelinerun.go 100.0% 82.4% -17.6
pkg/reconciler/v1alpha1/taskrun/taskrun.go 71.0% 70.7% -0.3

@abayer
Copy link
Contributor

abayer commented Aug 6, 2019

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 6, 2019
@vdemeester
Copy link
Member Author

/test pull-tekton-pipeline-integration-tests

@tekton-robot tekton-robot merged commit cff3043 into tektoncd:master Aug 6, 2019
@vdemeester vdemeester deleted the fix-possible-nil-timeout branch August 6, 2019 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants