Integration tests for changing feature flags #5999

lbernick · 2023-01-17T15:07:31Z

When a cluster operator changes the value of a feature flag, or when they upgrade to a new Pipeline version that modifies the default value for a feature flag, this has the potential to affect running PipelineRuns. We don't have any tests for this as far I know.

One example of how this can cause problems is #5991, where a running PipelineRun fails to complete if embedded-status is modified during its execution (thanks @JeromeJu for spotting this problem!). There are other feature flags (disable-affinity-assistant, enable-api-fields, etc) that have the potential to interfere with running PipelineRuns, but we haven't tested out these interactions.

I'm not sure how problematic this is. I expect these changes happen infrequently, but operators should be able to smoothly upgrade tekton versions, and clusters that run a large volume of PipelineRuns will likely run into this problem. We are also planning to change the value of "disable-affinity-assistant" in the v1 release.

I'm also not sure of the best way to test for this. Modifying feature flags during unit tests is challenging, because our unit tests test one reconcile loop at a time. Inserting a flag change into unit tests means baking in assumptions about when during execution a flag flip would happen, and makes tests much more complicated. However, doing the same during integration tests also has challenges, because we run all integration tests at the same time on the same cluster. Maybe we could write some new integration tests that run on their own cluster?

The text was updated successfully, but these errors were encountered:

… Switch This commit fixes the updates for reconciling PipelineRunStatus when switching the EmbeddedStatus feature flag. It resets the status.runs and status.taskruns to nil with minimal EmbeddedStatus and status.childReferences to nil with full EmbeddedStatus. Prior to this change, the childReferences runs and taskruns in pipelineRunStatus would not have been reset to nil and could have led to validation error when the feature flag is being switched. The following issue could help prevent such bugs in the future: Integration tests for changing feature flags tektoncd#5999

… Switch This commit fixes the updates for reconciling PipelineRunStatus when switching the EmbeddedStatus feature flag. It resets the status.runs and status.taskruns to nil with minimal EmbeddedStatus and status.childReferences to nil with full EmbeddedStatus. Prior to this change, the childReferences runs and taskruns in pipelineRunStatus would not have been reset to nil and could have led to validation error when the feature flag is being switched. The following issue could help prevent such bugs in the future: Integration tests for changing feature flags #5999

dibyom · 2023-01-19T19:22:46Z

I'm not sure how problematic this is. I expect these changes happen infrequently, but operators should be able to smoothly upgrade tekton versions, and clusters that run a large volume of PipelineRuns will likely run into this problem. We are also planning to change the value of "disable-affinity-assistant" in the v1 release.

I agree that these changes will be infrequent. So one option might be to declaring that changing values mid-flight will result in undefined behavior.
If we do decide to provide defined behavior here, we could "freeze" the config used for a particular run in the beginning of the run.

afrittoli · 2023-01-20T11:03:27Z

I'm not sure how problematic this is. I expect these changes happen infrequently, but operators should be able to smoothly upgrade tekton versions, and clusters that run a large volume of PipelineRuns will likely run into this problem. We are also planning to change the value of "disable-affinity-assistant" in the v1 release.

I agree that these changes will be infrequent. So one option might be to declaring that changing values mid-flight will result in undefined behavior. If we do decide to provide defined behavior here, we could "freeze" the config used for a particular run in the beginning of the run.

The only way I can imagine we could freeze the config, would be store in the status of the particular run, but I'm not sure we should go down this path. Perhaps declaring "undefined behaviour" like @dibyom suggested is a good way to handle this.

lbernick · 2023-01-23T13:01:37Z

SGTM, I'll open a PR.

lbernick · 2023-05-30T14:33:18Z

@afrittoli @dibyom I realized we now do actually store feature flags in taskrun/pipelinerun status via the provenance field. Is it worth updating the codebase to use the value of feature flags stored in the provenance field, if they are present, rather than always using the value from the configmap? @chuangw6 curious to hear your thoughts as well.

chuangw6 · 2023-05-30T15:55:12Z

Thanks @lbernick for bringing up this. Good point.
Generally, I think the featured flags stored in the TR/PR status.provenance field should be a good solution for the issue you mentioned.

However, the time point when we store the feature flags to status is the same as the time when we store the taskSpec/pipelineSpec into status line173-prepare, which is before the execution line188-reconcile.

I am wondering if there is a race condition where the feature flag version is X at the point we store feature flags into status, but the feature flags are changed to version Y before the execution.

This triggers my thinking of whether storeTaskSpecAndMergeMeta and storePipelineSpecAndMergeMeta are good place to store the feature flags into status or not. cc @chitrangpatel for thoughts.

lbernick · 2023-05-30T16:30:42Z

I am wondering if there is a race condition where the feature flag version is X at the point we store feature flags into status, but the feature flags are changed to version Y before the execution.

This seems like a reason in favor of using the feature flags from provenance, not a reason against. Shouldn't we make sure the provenance reflects the flags that were actually used to execute the pipelinerun?

chuangw6 · 2023-05-30T19:23:44Z

This seems like a reason in favor of using the feature flags from provenance, not a reason against.

+1

Shouldn't we make sure the provenance reflects the flags that were actually used to execute the pipelinerun?

yes, we definitely need to make sure the data there reflects the actually flags being used. I will revise this with @chitrangpatel and get back here with updates.

chitrangpatel · 2023-06-07T15:26:36Z

I've been giving this a lot of thought and tried to prototype things a bit. The challenge with feature flags is as follows:

Validation happens in two places: webhook and reconciler.
If the feature flags are saved in the status of the taskrun/pipelinerun then when we validate a stand-alone pipelineSpec/taskSpec (particularly at the webhook stage), this taskrun/pipelinerun status field is not accessible and so we cannot rely on it. The status idea would work if we were only validating at the reconciler level.
This means that we need to rely on the context. Now, the challenge here is that the webhook and controller watches for updates to the config map and the context is updated when changes are detected (I think. Please correct me if I'm wrong). This means that we cannot rely on the context directly as well.

lbernick · 2023-06-08T14:13:24Z

@chitrangpatel thanks for the update, that's a great point. Just one clarification: Are the feature flags in provenance meant to reflect only the values used for the taskrun/pipelinerun, and not for the task/pipeline? And would it be worth creating a separate issue to figure out how feature flags in provenance can better reflect the execution of the pipelinerun, or do you think this issue is sufficient?

chitrangpatel · 2023-06-08T14:39:56Z

The feature flags in the provenance are meant to reflect the values for the entire build process assuming that they weren't mutated midway (so it assumes that both pipeline and pipelinerun were validated with the same set of feature flags). Clearly we cannot guarantee that case since if the feature-flags change midway, we don't currently record it or fail the task/pipeline.

I think we can open a separate issue to figure out how feature flags in provenance can better reflect the execution of the pipelinerun. Let me do that now.

chitrangpatel · 2023-06-08T15:03:11Z

#6797 continues this discussion from the provenance perspective.

xchapter7x added this to Tekton Community Roadmap Jan 17, 2023

github-project-automation bot moved this to Todo in Tekton Community Roadmap Jan 17, 2023

JeromeJu mentioned this issue Jan 17, 2023

Fix PipelineRunStatus Reconciler for EmbeddedStatus Switch #5989

Merged

6 tasks

lbernick mentioned this issue Jan 23, 2023

Document behavior of feature-flag flips #6017

Merged

4 tasks

tekton-robot closed this as completed in #6017 Jan 31, 2023

github-project-automation bot moved this from Todo to Done in Tekton Community Roadmap Jan 31, 2023

lbernick mentioned this issue Mar 31, 2023

Revert removal of PipelineResources related fields #6436

Merged

4 tasks

wlynch mentioned this issue Mar 31, 2023

Discussion: handling backwards incompatible changes for pending resources #6479

Open

chitrangpatel mentioned this issue Jun 8, 2023

How can feature flags in the provenance better reflect the execution of the pipeline run? #6797

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration tests for changing feature flags #5999

Integration tests for changing feature flags #5999

lbernick commented Jan 17, 2023

dibyom commented Jan 19, 2023

afrittoli commented Jan 20, 2023

lbernick commented Jan 23, 2023

lbernick commented May 30, 2023

chuangw6 commented May 30, 2023

lbernick commented May 30, 2023 •

edited

Loading

chuangw6 commented May 30, 2023

chitrangpatel commented Jun 7, 2023 •

edited

Loading

lbernick commented Jun 8, 2023

chitrangpatel commented Jun 8, 2023

chitrangpatel commented Jun 8, 2023

Integration tests for changing feature flags #5999

Integration tests for changing feature flags #5999

Comments

lbernick commented Jan 17, 2023

dibyom commented Jan 19, 2023

afrittoli commented Jan 20, 2023

lbernick commented Jan 23, 2023

lbernick commented May 30, 2023

chuangw6 commented May 30, 2023

lbernick commented May 30, 2023 • edited Loading

chuangw6 commented May 30, 2023

chitrangpatel commented Jun 7, 2023 • edited Loading

lbernick commented Jun 8, 2023

chitrangpatel commented Jun 8, 2023

chitrangpatel commented Jun 8, 2023

lbernick commented May 30, 2023 •

edited

Loading

chitrangpatel commented Jun 7, 2023 •

edited

Loading