Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEP-0050 - ignore task failures #342

Merged
merged 1 commit into from
Mar 4, 2021

Conversation

pritidesai
Copy link
Member

TEP-0040 is proposing a TEP to ignore step errors. We have identified a need for two separate proposals during the multiple rounds of discussion in TEP-0040:

(1) ignore errors at the step level (TEP-0040)
(2) ignore failures at the task level

Proposing a TEP for (2) to ignore failures at the task level and allow pipeline authors to unblock execution after a single task failure.

@tekton-robot tekton-robot requested review from nikhil-thomas and a user February 5, 2021 19:20
@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 5, 2021
Copy link
Contributor

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign


* As a pipeline author, I would like to design a pipeline where a task running
[unit tests](https://github.com/tektoncd/catalog/tree/master/task/golang-test/0.1) might fail,
but can continue running integration tests, so that my pipeline can identify failures in both the tests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to explain a bit more here? one way this could be addressed today is to run the unit tests and integration tests in parallel; that way the unit tests can fail but the integration tests will still complete

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pritidesai i was thinking about this more and i have a couple ideas of how we can tweak the use cases so that running the tests in parallel won't solve the problem (maybe this is what you already had in mind)

  • A "warning only" use case - e.g. a pipeline wants to run some tests, and in parallel it wants to run something option, e.g. maybe some really intense linting, and the pipeline doesn't need to fail if the linting fails
    |             |
    v             v
linting       unit tests
                  |
                  v    
              integration tests

There's another very similar use case, where you have the same pipeline as above, but you DO want linting to fail the pipeline, you just don't want it to interrupt the unit tests -> integration tests branch <-- for that case i wonder if pipelines in pipelines is more appropriate? not sure!

Copy link
Member Author

@pritidesai pritidesai Feb 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @bobcatfish for looking into it 😻

Yes, pipeline in pipeline can be an appropriate solution to not interrupt the combo of unit tests -> integration tests. I will add both the points (running in parallel and pipeline in pipeline) in the TEP.

But, I was thinking of a use case in which I don't want linting failure or unit test failure to interrupt deploying an application to a staging cluster so that the new features in the app can be shared for early feedback.

    |               |                     |
    v               v                     v
linting       unit tests      integration tests
                                          |
                                          v    
                                      deploying to staging cluster

One more use case I was thinking of having one single GitHub repo with multiple runtimes, and building/deploying an image for each runtime, and I don't want failure in one runtime interrupt the other runtime. But we can argue that this use case can fit well with pipeline in pipeline approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another very similar use case, where you have the same pipeline as above, but you DO want linting to fail the pipeline, you just don't want it to interrupt the unit tests -> integration tests branch <-- for that case i wonder if pipelines in pipelines is more appropriate? not sure!

@bobcatfish @pritidesai could this be handled at the pipeline level too —aka some configuration of the pipeline that says "don't interup other branches" execution ?
I see two different problem we are trying to tackle here:

  1. Do not fail the pipeline if a task fail (and continue the execution, …)
  2. Do fail the pipeline if a task fail but do not interrupt other branches (or even the current one, also this would seem weird to me)

Do we want to tackle those at the same time, in the TEP ? I also tend to agree that something might be better handle by pipeline in pipeline approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do fail the pipeline if a task fail but do not interrupt other branches (or even the current one, also this would seem weird to me)

I agree - I think we're trying to tackle (1) and could leave (2) as separate (maybe pipelines in pipelines or something there is the best solution for (2))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @bobcatfish @vdemeester
yup sure, I just updated the TEP to include only (1)

but can continue running tests, so that my pipeline can report failures from the linting and all the tests.

* As a new Tekton user, I want to migrate existing workflows from the other CI/CD systems that allowed a
similar task unit of failure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to give some examples? i think a lot of systems that folks are migrating from dont have the same parallelism that Pipelines support, so I'm wondering if that might be the answer to most of these cases

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jenkins (a Jenkinsfile) would be an example 😛

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can include an example of a Jenkinsfile (with a failure allowed) that we'd be migrating? I'm struggling with trying to understand if one of the main motivations around the migration use case is that many CD systems do not allow for parallel execution, so it'd be great to see a counter example

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I can add an example, there are other tools as well, Travis, Argo Workflow, etc

@pritidesai
Copy link
Member Author

/kind tep

@tekton-robot tekton-robot added the kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). label Feb 8, 2021
@jerop
Copy link
Member

jerop commented Feb 10, 2021

/assign

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented, I am a bit fuzzy if we are tackling two use cases at the same time or not, and if we should.


* As a pipeline author, I would like to design a pipeline where a task running
[unit tests](https://github.com/tektoncd/catalog/tree/master/task/golang-test/0.1) might fail,
but can continue running integration tests, so that my pipeline can identify failures in both the tests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another very similar use case, where you have the same pipeline as above, but you DO want linting to fail the pipeline, you just don't want it to interrupt the unit tests -> integration tests branch <-- for that case i wonder if pipelines in pipelines is more appropriate? not sure!

@bobcatfish @pritidesai could this be handled at the pipeline level too —aka some configuration of the pipeline that says "don't interup other branches" execution ?
I see two different problem we are trying to tackle here:

  1. Do not fail the pipeline if a task fail (and continue the execution, …)
  2. Do fail the pipeline if a task fail but do not interrupt other branches (or even the current one, also this would seem weird to me)

Do we want to tackle those at the same time, in the TEP ? I also tend to agree that something might be better handle by pipeline in pipeline approach.

but can continue running tests, so that my pipeline can report failures from the linting and all the tests.

* As a new Tekton user, I want to migrate existing workflows from the other CI/CD systems that allowed a
similar task unit of failure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jenkins (a Jenkinsfile) would be an example 😛

@bobcatfish
Copy link
Contributor

From my end, the only thing missing to merge the problem statement is to be clear on which of the 2 cases (or both) that @vdemeester pointed out that we want to tackle and then i think we'll be gtg!

@pritidesai pritidesai force-pushed the ignore-task-failure branch 6 times, most recently from 8f9766f to eaecdea Compare February 19, 2021 22:56
@afrittoli
Copy link
Member

/assign @vdemeester

@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 23, 2021
Copy link
Contributor

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

/approve

just waiting for @vdemeester to approve (and maybe @jerop since you're assigned?)

/hold

}
}
}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!! thanks this example really helps

@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 23, 2021
@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bobcatfish

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 23, 2021
Adding a tep to ignore task failures, allowing pipeline authors
to unblock execution after a single failure
@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 23, 2021
but the pipeline can continue running integration tests and deploying an application to a staging cluster, so that the
application can be shared with other developers for early feedback.

For example, do not fail the pipeline if `unit tests` fail. Continue deploying to a staging cluster if integration tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pritidesai does do not fail the pipeline mean that not only does execution continue but also the pipelinerun status should not be failure?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or is it just about the execution flow and it doesn't affect the overall status?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jerop the high level idea here is to ignore/allow pipelineTask failure, continue execution and do not fail the pipelineRun.

@pritidesai does do not fail the pipeline mean that not only does execution continue but also the pipelinerun status should not be failure?

Yes, pipelineRun status should not be failure. I will cover this in the design proposal with potential alternatives. We could set the overall pipelineRun status to success, completed, success with warning, or something more relevant.

In this example, `linting` and `unit tests` are executed in parallel. This specific use case can be
supported by `pipeline in pipeline` approach by creating two separate pipelines,
(1) `linting` -> `report linter output` and (2) `unit tests` -> `integration tests`.
`pipeline in pipeline` approach would be better fit when you want `linting` to fail the pipeline.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the pipeline here referring to the first sub-pipeline or the whole pipeline?

if pipelines in pipelines approach is used, it might be better for report linter output to be a finally task which would then report the linting output and the tests sub-pipeline would execute independently in parallel so linting failure won't affect it

what do you think?

Copy link
Member Author

@pritidesai pritidesai Feb 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the pipeline here referring to the first sub-pipeline or the whole pipeline?

Its referring to the sub pipeline and in turn the whole pipeline since the sub pipeline status would be reflected in the overall pipelineRun status.

what do you think?

Yup, having report linter output under finally won't be affected in terms of the execution by the linting failure but that sub pipeline would still result in failure.

Here the idea is to treat linting as a warning. linting is part of the pipeline but its failure has no impact on the execution of the rest of the pipeline or the overall status.

Edit: And, if we combine two features pipeline in pipeline and ignore/allow failure of the linter sub-pipeline (when defined as a pipelineTask) would not impact overall status as well.

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, how does this relates to TEP-0007 ?

  • Design a task failure strategy so that the pipeline author can control the behavior of the underlying tasks
    and decide whether to continue executing the rest of the graph in the event of failure.

This goal seems to be aligned with TEP-0007. I guess the only difference is the final status of the Pipeline (succeeded in this TEP, still fail in 0007).

If we go with this approach, is the other TEP required ? and vice-versa ?

@jerop
Copy link
Member

jerop commented Mar 3, 2021

One question, how does this relates to TEP-0007 ?

  • Design a task failure strategy so that the pipeline author can control the behavior of the underlying tasks
    and decide whether to continue executing the rest of the graph in the event of failure.

This goal seems to be aligned with TEP-0007. I guess the only difference is the final status of the Pipeline (succeeded in this TEP, still fail in 0007).

If we go with this approach, is the other TEP required ? and vice-versa ?

yes yes, the two are aligned and that's why I'd put #258 on hold to explore failure strategies more broadly beyond allowing skipping a single task, which led to creating Pipelines in Pipelines to solve for some of the failure scenarios

if you define failures as:

which one did you intend to address here @pritidesai?

regardless, these issues should/will be solved together and there are some thoughts about the overlap in this doc I shared with @pritidesai

@pritidesai
Copy link
Member Author

Thanks @vdemeester and @jerop

One question, how does this relates to TEP-0007 ?

  • Design a task failure strategy so that the pipeline author can control the behavior of the underlying tasks
    and decide whether to continue executing the rest of the graph in the event of failure.

This goal seems to be aligned with TEP-0007. I guess the only difference is the final status of the Pipeline (succeeded in this TEP, still fail in 0007).

TEP-0007 is limited to a skipped task i.e. a task is skipped and not executed. The main goal of this TEP is to ignore failure of a task i.e. a task is executed and declared failure after exhausting all the retries, such task would be considered as a success, skipped tasks remain skipped.

If we go with this approach, is the other TEP required ? and vice-versa ?

Yes both the TEPs are required, as both are addressing the flow control based on the different statuses of a pipelineTask. And at the same time, it is possible to combine features from both the TEPs.

In the following example, skip-this-task will be skipped since when expressions evaluates to false and since it is not executed, ignoreFailure will have no effect. The overall status of the pipeline stays completed.

tasks:
  - name: skip-this-task
    when:
      - input: "foo"
        operator: in
        values: ["bar"]
    ignoreFailure: true

Now, in this example, ignore-failure-of-this-task is executed and exited with failure, ignoreFailure will declare this task as a success and run-after-ignore-failure-of-this-task is executed. The overall status of the pipeline will be success.

tasks:
  - name: ignore-failure-of-this-task
    when:
      - input: "foo"
        operator: in
        values: ["foo"]
    ignoreFailure: true
  - name: run-after-ignore-failure-of-this-task
    runAfter: [ "ignore-failure-of-this-task" ]

Now, with TEP-0007, skip-this-task is skipped because of the when expressions evaluating to false(and ignoreFailure has no impact) but continueAfterSkip allows execution of the run-after-skip-this-task.

tasks:
  - name: skip-this-task
    when:
      - input: "foo"
        operator: in
        values: ["bar"]
    continueAfterSkip: true
    ignoreFailure: true
  - name: run-after-skip-this-task
    runAfter: [ "skip-this-task" ]

@pritidesai
Copy link
Member Author

got an error, then another difference is #258 is about ignoring task skips while this one is about ignoring task failures

which one did you intend to address here @pritidesai?

thanks @jerop, this TEP is focused on ignoring task failures after getting an error.

@vdemeester
Copy link
Member

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 4, 2021
@pritidesai
Copy link
Member Author

thanks for the reviews @bobcatfish @vdemeester @jerop 🙏

/hold cancel

@tekton-robot tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 4, 2021
@tekton-robot tekton-robot merged commit 050ea47 into tektoncd:main Mar 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Proposed
Development

Successfully merging this pull request may close these issues.

6 participants