Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core Feature] Failure-Node support #1506

Open
kumare3 opened this issue Sep 20, 2021 · 13 comments
Open

[Core Feature] Failure-Node support #1506

kumare3 opened this issue Sep 20, 2021 · 13 comments
Assignees
Labels
backlogged For internal use. Reserved for contributor team workflow. enhancement New feature or request

Comments

@kumare3
Copy link
Contributor

kumare3 commented Sep 20, 2021

Motivation: Why do you think this is important?
Flyte backend supports a Failure-node for every workflow / sub-workflow. This is not currently exposed in flytekit (python or Java)

Goal: What should the final outcome look like, ideally?
Users should be able to define failure nodes for their workflows. An example for the python SDK is as follows

@task
def my_error_handler(...):
   ...

@workflow(on_error=my_error_handler)
def my_wf():
   ...

If my_wf() fails at any point during execution, it'll call my_error_handler() task and will pass some context (error info... etc.) to allow it to handle the error. The expectation is that my_error_handler() would do things like clean up resources, log/send customized notifications... etc. The thing it will NOT let you do is recover from failure... The execution of this workflow will still fail, be marked as failure and upstream callers will still be notified of its failure.

An example of sub-workflows:

@workflow(on_error=my_error_handler)
def my_sub_wf():
    ...

@workflow(failure_policy=WorkflowFailurePolicy.FAIL_AFTER_EXECUTABLE_NODES_COMPLETE)
def my_parent_wf() -> str:
   n1 = my_sub_wf()
   n2 = my_sub_wf()
   n3 = my_sub_wf()

   return n3.out1

In this case, my_parent_wf will continue running even if any of the nodes fails. The overall status of the execution will again be marked as failure but it'll let as many nodes as possible to execute... Whenever my_sub_wf fails, it'll invoke an instance of my_error_handler task to cleanup resources... etc.

Describe alternatives you've considered
NA

[Optional] Propose: Link/Inline OR Additional context
More discussion in
#1012

Related flytekit java issue - #1012

@kumare3 kumare3 added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers and removed untriaged This issues has not yet been looked at by the Maintainers labels Sep 20, 2021
@kumare3
Copy link
Contributor Author

kumare3 commented Sep 20, 2021

cc @EngHabu / @cosmicBboy / @kanterov

@EngHabu EngHabu added this to the 0.19.3 - Feb 2021 milestone Feb 2, 2022
@EngHabu EngHabu changed the title [Core Feature]Failure-Node support in flytekit [Core Feature]Failure-Node support Feb 16, 2022
@EngHabu EngHabu changed the title [Core Feature]Failure-Node support [Core Feature] Failure-Node support Feb 16, 2022
@EngHabu EngHabu modified the milestones: 1.0.0 - Phoenix!, 1.0.1 Mar 9, 2022
@EngHabu EngHabu removed this from the 1.0.1 milestone May 4, 2022
@eapolinario eapolinario added this to the 1.3.0-candidate milestone Oct 6, 2022
@kumare3 kumare3 modified the milestones: 1.3.0-candidate, 1.4.0 Nov 15, 2022
@cosmicBboy cosmicBboy removed this from the 1.4.0 milestone Jan 18, 2023
@fvde
Copy link
Contributor

fvde commented Mar 30, 2023

As old as this issue may be, we would absolutely LOVE this and it has been a stable feature on other orchestration engines (such as Kubeflow) for many years.

@dylanwilder
Copy link

@kumare3 is this a flytekit only change or would it require changes to propellor to propagate error state

@EngHabu
Copy link
Contributor

EngHabu commented May 4, 2023

@dylanwilder the changes in the backend are mostly already done.. it's possible they have regressed because of the lack of end to end testing for it (because it's not implemented in flytekit).. would you be able to help with the flytekit side if things?

@dylanwilder
Copy link

Potentially we could pitch in since we'd like to see this, do you have anything outlining what's required?

@EngHabu
Copy link
Contributor

EngHabu commented May 10, 2023

@dylanwilder I think this is really close. Maybe attempt to use it in an example and start the debugging journey from there? Happy to be pulled in once you get flytekit to produce the spec in case you deem it a problem with the backend...

@dylanwilder
Copy link

Thanks will take a look and see!

@kumare3
Copy link
Contributor Author

kumare3 commented May 19, 2023

@eapolinario is probably also looking into this

@gitgraghu
Copy link

wondering if there is a way to support inputs and outputs that are different from the workflow interface in the failure handler..below is an example use case we were trying to implement:

  • A validation task which creates a dataframe and fails the task if there are invalid values.
  • On workflow failure, return this validation dataframe as an output of the workflow.

@EngHabu
Copy link
Contributor

EngHabu commented Oct 24, 2023

Was just brainstorming with @pingsutw now on this... here are my thoughts on UX:

@workflow
def my_wf(a: int) -> str:
   b = my_task(a=a)
   flytekit.current_context().on_failure = clean_up(a=a, b=b)
   return b

@task
def clean_up(err: Error, a: Optional[int], b: Optional[str]) -> str:
   ...
  1. Set failure node within the workflow code instead of the decorator to allow passing inputs (don't know if return b or clean_up(a=a, b=b) as an alternative syntax is too hacky?
  2. clean_up must take Error as the first parameter and can take any number of extra inputs as long as they are all Optional. Propeller will fill them in if they are available or None otherwise... it's the implementor's job to handle those cases correctly within clean_up

wdyt @kumare3 @eapolinario @gitgraghu

@pingsutw
Copy link
Member

Need some discussion about

  1. if a subworkflow and the parent workflow both have failure handlers… do both run?
  2. if i have parent_wf and sub_wf… both have failure handlers. i can see an argument to say - when i run parent_wf, if the error was in the sub_wf, just run the sub_wf failure handler, if the error happened in the parent_wf then run the parent_wf. or both run.

PRs for failure node. (still WIP)

also cc @cosmicBboy @wild-endeavor

@eapolinario eapolinario added the backlogged For internal use. Reserved for contributor team workflow. label Oct 27, 2023
@EngHabu
Copy link
Contributor

EngHabu commented Oct 28, 2023

  1. I believe it should
  2. The way I think about it is that first the subworkflow fails, so it kicks off the failure node handler. After it finishes, the subworkflow itself will report that it failed overall, then the parent node will handle failure as well..

@wild-endeavor
Copy link
Contributor

is this okay to close?

@runllm runllm bot mentioned this issue Jun 26, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlogged For internal use. Reserved for contributor team workflow. enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants