Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration tests for changing feature flags #5999

Closed
lbernick opened this issue Jan 17, 2023 · 11 comments · Fixed by #6017
Closed

Integration tests for changing feature flags #5999

lbernick opened this issue Jan 17, 2023 · 11 comments · Fixed by #6017

Comments

@lbernick
Copy link
Member

When a cluster operator changes the value of a feature flag, or when they upgrade to a new Pipeline version that modifies the default value for a feature flag, this has the potential to affect running PipelineRuns. We don't have any tests for this as far I know.

One example of how this can cause problems is #5991, where a running PipelineRun fails to complete if embedded-status is modified during its execution (thanks @JeromeJu for spotting this problem!). There are other feature flags (disable-affinity-assistant, enable-api-fields, etc) that have the potential to interfere with running PipelineRuns, but we haven't tested out these interactions.

I'm not sure how problematic this is. I expect these changes happen infrequently, but operators should be able to smoothly upgrade tekton versions, and clusters that run a large volume of PipelineRuns will likely run into this problem. We are also planning to change the value of "disable-affinity-assistant" in the v1 release.

I'm also not sure of the best way to test for this. Modifying feature flags during unit tests is challenging, because our unit tests test one reconcile loop at a time. Inserting a flag change into unit tests means baking in assumptions about when during execution a flag flip would happen, and makes tests much more complicated. However, doing the same during integration tests also has challenges, because we run all integration tests at the same time on the same cluster. Maybe we could write some new integration tests that run on their own cluster?

JeromeJu added a commit to JeromeJu/pipeline that referenced this issue Jan 17, 2023
… Switch

This commit fixes the updates for reconciling PipelineRunStatus when switching
the EmbeddedStatus feature flag. It resets the status.runs and status.taskruns
to nil with minimal EmbeddedStatus and status.childReferences to nil with
full EmbeddedStatus.

Prior to this change, the childReferences runs and taskruns in pipelineRunStatus
would not have been reset to nil and could have led to validation error when the feature
flag is being switched. The following issue could help prevent such bugs in the future:
Integration tests for changing feature flags tektoncd#5999
JeromeJu added a commit to JeromeJu/pipeline that referenced this issue Jan 18, 2023
… Switch

This commit fixes the updates for reconciling PipelineRunStatus when switching
the EmbeddedStatus feature flag. It resets the status.runs and status.taskruns
to nil with minimal EmbeddedStatus and status.childReferences to nil with
full EmbeddedStatus.

Prior to this change, the childReferences runs and taskruns in pipelineRunStatus
would not have been reset to nil and could have led to validation error when the feature
flag is being switched. The following issue could help prevent such bugs in the future:
Integration tests for changing feature flags tektoncd#5999
JeromeJu added a commit to JeromeJu/pipeline that referenced this issue Jan 18, 2023
… Switch

This commit fixes the updates for reconciling PipelineRunStatus when switching
the EmbeddedStatus feature flag. It resets the status.runs and status.taskruns
to nil with minimal EmbeddedStatus and status.childReferences to nil with
full EmbeddedStatus.

Prior to this change, the childReferences runs and taskruns in pipelineRunStatus
would not have been reset to nil and could have led to validation error when the feature
flag is being switched. The following issue could help prevent such bugs in the future:
Integration tests for changing feature flags tektoncd#5999
tekton-robot pushed a commit that referenced this issue Jan 18, 2023
… Switch

This commit fixes the updates for reconciling PipelineRunStatus when switching
the EmbeddedStatus feature flag. It resets the status.runs and status.taskruns
to nil with minimal EmbeddedStatus and status.childReferences to nil with
full EmbeddedStatus.

Prior to this change, the childReferences runs and taskruns in pipelineRunStatus
would not have been reset to nil and could have led to validation error when the feature
flag is being switched. The following issue could help prevent such bugs in the future:
Integration tests for changing feature flags #5999
@dibyom
Copy link
Member

dibyom commented Jan 19, 2023

I'm not sure how problematic this is. I expect these changes happen infrequently, but operators should be able to smoothly upgrade tekton versions, and clusters that run a large volume of PipelineRuns will likely run into this problem. We are also planning to change the value of "disable-affinity-assistant" in the v1 release.

I agree that these changes will be infrequent. So one option might be to declaring that changing values mid-flight will result in undefined behavior.
If we do decide to provide defined behavior here, we could "freeze" the config used for a particular run in the beginning of the run.

@afrittoli
Copy link
Member

I'm not sure how problematic this is. I expect these changes happen infrequently, but operators should be able to smoothly upgrade tekton versions, and clusters that run a large volume of PipelineRuns will likely run into this problem. We are also planning to change the value of "disable-affinity-assistant" in the v1 release.

I agree that these changes will be infrequent. So one option might be to declaring that changing values mid-flight will result in undefined behavior. If we do decide to provide defined behavior here, we could "freeze" the config used for a particular run in the beginning of the run.

The only way I can imagine we could freeze the config, would be store in the status of the particular run, but I'm not sure we should go down this path. Perhaps declaring "undefined behaviour" like @dibyom suggested is a good way to handle this.

@lbernick
Copy link
Member Author

SGTM, I'll open a PR.

@lbernick
Copy link
Member Author

@afrittoli @dibyom I realized we now do actually store feature flags in taskrun/pipelinerun status via the provenance field. Is it worth updating the codebase to use the value of feature flags stored in the provenance field, if they are present, rather than always using the value from the configmap? @chuangw6 curious to hear your thoughts as well.

@chuangw6
Copy link
Member

Thanks @lbernick for bringing up this. Good point.
Generally, I think the featured flags stored in the TR/PR status.provenance field should be a good solution for the issue you mentioned.

However, the time point when we store the feature flags to status is the same as the time when we store the taskSpec/pipelineSpec into status line173-prepare, which is before the execution line188-reconcile.

I am wondering if there is a race condition where the feature flag version is X at the point we store feature flags into status, but the feature flags are changed to version Y before the execution.

This triggers my thinking of whether storeTaskSpecAndMergeMeta and storePipelineSpecAndMergeMeta are good place to store the feature flags into status or not. cc @chitrangpatel for thoughts.

@lbernick
Copy link
Member Author

lbernick commented May 30, 2023

I am wondering if there is a race condition where the feature flag version is X at the point we store feature flags into status, but the feature flags are changed to version Y before the execution.

This seems like a reason in favor of using the feature flags from provenance, not a reason against. Shouldn't we make sure the provenance reflects the flags that were actually used to execute the pipelinerun?

@chuangw6
Copy link
Member

This seems like a reason in favor of using the feature flags from provenance, not a reason against.

+1

Shouldn't we make sure the provenance reflects the flags that were actually used to execute the pipelinerun?

yes, we definitely need to make sure the data there reflects the actually flags being used. I will revise this with @chitrangpatel and get back here with updates.

@chitrangpatel
Copy link
Contributor

chitrangpatel commented Jun 7, 2023

I've been giving this a lot of thought and tried to prototype things a bit. The challenge with feature flags is as follows:

  • Validation happens in two places: webhook and reconciler.
  • If the feature flags are saved in the status of the taskrun/pipelinerun then when we validate a stand-alone pipelineSpec/taskSpec (particularly at the webhook stage), this taskrun/pipelinerun status field is not accessible and so we cannot rely on it. The status idea would work if we were only validating at the reconciler level.
  • This means that we need to rely on the context. Now, the challenge here is that the webhook and controller watches for updates to the config map and the context is updated when changes are detected (I think. Please correct me if I'm wrong). This means that we cannot rely on the context directly as well.

@lbernick
Copy link
Member Author

lbernick commented Jun 8, 2023

@chitrangpatel thanks for the update, that's a great point. Just one clarification: Are the feature flags in provenance meant to reflect only the values used for the taskrun/pipelinerun, and not for the task/pipeline? And would it be worth creating a separate issue to figure out how feature flags in provenance can better reflect the execution of the pipelinerun, or do you think this issue is sufficient?

@chitrangpatel
Copy link
Contributor

The feature flags in the provenance are meant to reflect the values for the entire build process assuming that they weren't mutated midway (so it assumes that both pipeline and pipelinerun were validated with the same set of feature flags). Clearly we cannot guarantee that case since if the feature-flags change midway, we don't currently record it or fail the task/pipeline.

I think we can open a separate issue to figure out how feature flags in provenance can better reflect the execution of the pipelinerun. Let me do that now.

@chitrangpatel
Copy link
Contributor

#6797 continues this discussion from the provenance perspective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants