[BUG] Aborted child workflows should better report root cause of abort #408

katrogan · 2020-07-10T21:36:55Z

Describe the bug
Currently if a parent workflow launches a child workflow and parent workflow fails it will auto-abort children workflows. On the parent overview page the node that launched the child workflow correctly reports Some node execution failed, auto-abort. However on the page for the actual child workflow the aborted task shows as Workflow aborted. without any indication why. This is a confusing experience for users debugging or looking at the failed child workflow.

Expected behavior
A clear and concise description of what you expected to happen.

Flyte component

To Reproduce
Steps to reproduce the behavior:

Create a workflow that launches a long-ish running child workflow
Have this parent workflow fail a downstream task
Observe the execution page for the child workflow.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment
Flyte component

Additional context
N/A

The text was updated successfully, but these errors were encountered:

gsimko-lyft · 2020-07-10T21:38:11Z

👍

pmahindrakar-oss · 2021-05-20T10:59:56Z

@katrogan I am unable to reproduce this issue.

For reproing the issue , i have used the following example from flytesnacks control_flow/subworkflows.py

And modified it to fail from the parent workflow after 45 sec . I always launch the workflow with value of a= 3 so that the parent throws an exception. Have added sleep in parent execution to allow enough time for the child workflow containers to be created and also a sleep in child workflow of 90 sec to keep them running and not be completed before the parent.

Let me know if there some other specific example that can repro it.

@task
def t1(a: int) -> typing.NamedTuple("OutputsBC", t1_int_output=int, c=str):
    if a == 3:
      time.sleep(45)
      raise Exception("Sorry, failing the parent workflow")
    time.sleep(90) 
    return a + 2, "world"

@workflow
def my_subwf(a: int = 42) -> (str, str):
    x, y = t1(a=a)
    u, v = t1(a=23)
    return y, v

@workflow
def parent_wf(a: int) -> (int, str, str):
    x, y = t1(a=a).with_overrides(node_name="node-t1-parent")
    u, v = my_subwf(a=10)
    return x, u, v

Here are the screenshots with two cases of the workflow execution
1] Parent workflow aborts due to failure in execution of a child task and child worklow is aborted due to this.

2] Execution is aborted while the parent and child workflows are executing

Also going through the propeller code to dig if there is a case i am missing here

pmahindrakar-oss · 2021-05-20T12:40:18Z

Another data point here
There was an issue where the reason field was not being used at all during handling failed workflow execution.

It was fixed here
https://github.com/flyteorg/flytepropeller/pull/84/files#diff-854663e4d10286e61f445300917772f010c74cbdc602948568084d4ef894259cR97

As part of this PR flyteorg/flytepropeller#84

handleFailingWorkflow And HandleAbortedWorkflow are two different paths

Also
Log message : "Some node execution failed, auto-abort" is part of handleFailingWorkflow
Log message : "Workflow aborted" is part of the HandleAbortedWorkflow

Trying to see how these two converge

katrogan · 2021-05-20T16:17:50Z

do you have screenshots of the child workflow? i think the issue is that error messaging 'Workflow aborted' is non-descriptive and doesn't indicate by whom it was aborted (the failing parent wf in thise case)

pmahindrakar-oss · 2021-05-20T16:58:14Z

@katrogan can you guide me in how to look for the child workflow page.I assumed the screenshots I posted would cover this .Also there is graph view where also I saw the same data .Is there some place else ?

pmahindrakar-oss · 2021-05-24T14:23:18Z

Ok . I was able to repro this with a slightly different example where we use the launchplan within another workflow


@task
def greet(day_of_week: str, number: int, am: bool) -> str:
    if day_of_week == "unknown_day" :
      time.sleep(45)
      raise Exception("Sorry, failing the parent workflow since i dont know how to greet for unknown_day")
    time.sleep(90)
    greeting = "Have a great " + day_of_week + " "
    greeting += "morning" if am else "evening"
    return greeting + "!" * number


@workflow
def go_greet(day_of_week: str, number: int, am: bool = False) -> str:
    return greet(day_of_week=day_of_week, number=number, am=am)


morning_greeting = LaunchPlan.create(
    "morning_greeting",
    go_greet,
    #fixed_inputs={"am": True},
    default_inputs={"number": 1, "am" : True},
)

@workflow
def child_workflow():
    # Let's see if we can convincingly pass a Turing test!
    today = datetime.datetime.today()
    greet(day_of_week="unknown_day", number=3, am=True)

    for n in range(7):
        day = today + datetime.timedelta(days=n)
        weekday = calendar.day_name[day.weekday()]
        if day.weekday() < 5:
            print(morning_greeting(day_of_week=weekday))
        else:
            # We're extra enthusiastic on weekends
            print(morning_greeting(number=3, day_of_week=weekday))

Parent workflow

Child workflow

cc : @kumare3

pmahindrakar-oss · 2021-05-25T13:10:50Z

After debugging further with the current state the abort workflow relies on the DeleteTimestamp to determine if the workflow has been aborted and this cannot distinguish between user abort versus a failing parent task triggering a subworkflow execution abort .

The abort stage which is part of the control loop, it only has access to v1alpha1.FlyteWorkflow object which it compares desired and actual states to make changes.

When the parent Workflow fails , it calls abort with abort handler for the nodeExecutor which inturn call the subWorflow abort handler with the reason for all downstream nodes and ultimately recording this as

event.NodeExecutionEvent{
			Id:         nodeExecutionID,
			Phase:      core.NodeExecution_ABORTED,
			OccurredAt: ptypes.TimestampNow(),
			OutputResult: &event.NodeExecutionEvent_Error{
				Error: &core.ExecutionError{
					Code:    "NodeAborted",
					Message: reason,
				},
			},

with reason "Some node execution failed, auto-abort"

And this is shown in parent overview page

When this event is raised and auto-abort is called, a reason string is constructed to be sent to admin for Terminate request
eg : cascading abort as parent execution id [f38e7e5290f1d443fa5f] aborted, reason [Some node execution failed, auto-abort.]

But this information is not shown on the console. Ideally we should be pulling this in UI console.
Which is also mentioned by @EngHabu on this issue #232

Now on the child workflow page since after the control loop runs later, It sees the all child sub workflow as having a deleteTimeStamp that were terminated by parent and hence send another NodeExecutionEvent but this time with static reason 'Workflow aborted.'

No where the information of error is stored in the InMemory v1alpha1.FlyteWorkflow object which can be accessed through the control loop flow and hence when this executes, the error is stored as "Workflow aborted" in the event.NodeExecutionEvent

pmahindrakar-oss · 2021-05-27T11:25:09Z

Following is the execution data of the child workflow

  "closure": {
        "abort_metadata": {
            "cause": "cascading abort as parent execution id [f38e7e5290f1d443fa5f] aborted, reason [Some node execution failed, auto-abort.]"
        },
        "phase": "ABORTED",

And this the node execution data for the same identifier.

            "closure": {
                "error": {
                    "code": "NodeAborted",
                    "message": "Workflow aborted."
                },
                "phase": "ABORTED",

UI shows the error coming from the node execution.

I see three options to fix this.
1] Flytepropeller sends the NodeExecutionEvent with abort metadata in the error object.

2] Flyteadmin get the current event but find if its an abort and then find the execution data for it and amends the error message to the value coming from the execution's abort metadata.

3] UI should show the execution error message view which shows both the node abort message and execution abort message which will make it clearer.

Cleaner in my opinion would be to show the abort metadata from the execution on the flyteconsole instead of modifying anything in propeller or admin.

@katrogan @kumare3 please suggest what would be appropriate here.

katrogan · 2021-05-27T16:21:07Z

+1 for option 1) i like the idea of not having an external service (admin, console) deal with reconciling state. much simpler and straightforward to have it recorded correctly

katrogan · 2021-05-28T18:41:29Z

after chatting with @kumare3 ignore my suggestion! I think admin as the control plane is better positioned to reconcile the errors message vs abort cause by traversing the parent lineage

pmahindrakar-oss · 2021-05-31T10:14:47Z

hey @katrogan @kumare3 ,

Heres a simple abort usecase, the UI doesn't show the right message which was entered by user from flyteconsole.

heres the o/p from flytectl. So abort string of "Console termination " was entered from flyteconsole but it doesn't show up in the UI , since it show the node execution data from closure "error": { "code": "NodeAborted", "message": "Workflow aborted." },

But if we see the actual execution data , it does show the abort data .(Node execution closure . abort != Execution closure .abort)

In other cases where we have ERROR data populated which is basically failure cases in those cases
Node execution closure . error = = Execution closure .error AND UI correctly pull the execution closure . error.message field and displays it.

Reconciling in admin for the abort case where Node execution closure . abort != Execution closure .abort would be ok , But i think if we can have the UI show both the abort and error data from executions then that would ok too for now. I would leave the reconciliation case up to you guys.

I have setup time for discussing this . Shouldn't take much time

 ---------------------- --------------------------------------------------- ----------------------------------------------------------------------------------------------------------------- 
| NAME (79)            | ABORT DATA                                        | ERROR DATA                                                                                                      |
 ---------------------- --------------------------------------------------- ----------------------------------------------------------------------------------------------------------------- 
| ffaz4xni             | cascading abort as parent execution id            |                                                                                                                 |
|                      | [f1ac227769dfc48c481f] aborted, reason [Some node |                                                                                                                 |
|                      | execution failed, auto-abort.]                    |                                                                                                                 |
 ---------------------- --------------------------------------------------- ----------------------------------------------------------------------------------------------------------------- 
| f43be7092143744efb5e | Console termination                               |                                                                                                                 |
 ---------------------- --------------------------------------------------- ----------------------------------------------------------------------------------------------------------------- 
| fcdae8acc683240de84c |                                                   | [4/4] currentAttempt done. Last Error: USER::Traceback (most recent call last):                                 |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/flytekit/bin/entrypoint.py", line 101, in _dispatch_execute       |
|                      |                                                   |     outputs = task_def.dispatch_execute(ctx, idl_input_literals)                                                |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/flytekit/core/base_task.py", line 402, in dispatch_execute        |
|                      |                                                   |     native_inputs = TypeEngine.literal_map_to_kwargs(exec_ctx, input_literal_map, self.python_interface.inputs) |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/flytekit/core/type_engine.py", line 306, in literal_map_to_kwargs |
|                      |                                                   |     return {k: TypeEngine.to_python_value(ctx, lm.literals[k], v) for k, v in python_types.items()}             |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/flytekit/core/type_engine.py", line 306, in <dictcomp>            |
|                      |                                                   |     return {k: TypeEngine.to_python_value(ctx, lm.literals[k], v) for k, v in python_types.items()}             |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/flytekit/core/type_engine.py", line 284, in to_python_value       |
|                      |                                                   |     return transformer.to_python_value(ctx, lv, expected_python_type)                                           |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/flytekit/core/type_engine.py", line 186, in to_python_value       |
|                      |                                                   |     dc = expected_python_type.from_json(_json_format.MessageToJson(lv.scalar.generic))                          |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/dataclasses_json/api.py", line 76, in from_json                   |
|                      |                                                   |     return cls.from_dict(kvs, infer_missing=infer_missing)                                                      |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/dataclasses_json/api.py", line 83, in from_dict                   |
|                      |                                                   |     return _decode_dataclass(cls, kvs, infer_missing)                                                           |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/dataclasses_json/core.py", line 159, in _decode_dataclass         |
|                      |                                                   |     field_value = kvs[field.name]                                                                               |
|                      |                                                   | KeyError: 'y                                                                                                    |
 ---------------------- --------------------------------------------------- ----------------------------------------------------------------------------------------------------------------- 
| fdeb5b4f6771a47368bc |                                                   | [1/1] currentAttempt done. Last Error: USER::Traceback (most recent call last):                                 |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/flytekit/bin/entrypoint.py", line 101, in _dispatch_execute       |
|                      |                                                   |     outputs = task_def.dispatch_execute(ctx, idl_input_literals)                                                |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/flytekit/core/base_task.py", line 411, in dispatch_execute        |
|                      |                                                   |     raise e                                                                                                     |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/flytekit/core/base_task.py", line 408, in dispatch_execute        |
|                      |                                                   |     native_outputs = self.execute(**native_inputs)                                                              |
|                      |                                                   |   File "/opt/venv/lib/python3.8/site-packages/flytekit/core/python_function_task.py", line 130, in execute      |
|                      |                                                   |     return self._task_function(**kwargs)                                                                        |
|                      |                                                   |   File "/root/core/flyte_basics/lp.py", line 74, in greet                                                       |
|                      |                                                   |     raise Exception("Sorry, failing the parent workflow since i dont know how to greet for unknown_day")        |
|                      |                                                   | Exception: Sorry, failing the parent workflow since i dont know how to greet for unknown_da                     |

MR to dump the error and abort data from flytectl is here flyteorg/flytectl#79

pmahindrakar-oss · 2021-06-04T11:55:40Z

@jsonporter , Please review this issue. Essentially, the UI is currently not displaying the abortMetadata within an execution closure

https://github.com/flyteorg/flyteidl/blob/08a958c085c939101893f40fb1582e92b31c9a40/protos/flyteidl/admin/execution.proto#L116

It only show if the error field is populated
https://github.com/flyteorg/flyteidl/blob/08a958c085c939101893f40fb1582e92b31c9a40/protos/flyteidl/admin/execution.proto#L110

In my last update i have shown how i have implemented this in flytectl . The UI should do a conditional check based on which data is available from the oneof and use that to show the abort or error data.

I am assigning this to you .Please let me know if you have any questions regarding this .

* Added hotfix for end2end test Signed-off-by: Yuvraj <[email protected]>

* requirements update Signed-off-by: Haytham Abuelfutuh <[email protected]> * Set resources differently for SANDBOX vs prod Signed-off-by: Haytham Abuelfutuh <[email protected]> * bump Signed-off-by: Haytham Abuelfutuh <[email protected]> * use lower resources for sandbox Signed-off-by: Haytham Abuelfutuh <[email protected]> * bump Signed-off-by: Haytham Abuelfutuh <[email protected]> * bump Signed-off-by: Haytham Abuelfutuh <[email protected]> * register without serialize Signed-off-by: Haytham Abuelfutuh <[email protected]> * register without serialize Signed-off-by: Haytham Abuelfutuh <[email protected]> * bump Signed-off-by: Haytham Abuelfutuh <[email protected]> * bump Signed-off-by: Haytham Abuelfutuh <[email protected]> * Update requirements Signed-off-by: Haytham Abuelfutuh <[email protected]> * wip Signed-off-by: Haytham Abuelfutuh <[email protected]> * Update eda requirements Signed-off-by: Haytham Abuelfutuh <[email protected]> * Cleanup Signed-off-by: Haytham Abuelfutuh <[email protected]> * format Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Added hotfix for end2end test Signed-off-by: Yuvraj <[email protected]>

* Support envs when creating execution Signed-off-by: Hongxin Liang <[email protected]> * Update doc Signed-off-by: Hongxin Liang <[email protected]> --------- Signed-off-by: Hongxin Liang <[email protected]>

katrogan added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Jul 10, 2020

pmahindrakar-oss added ui Admin console user interface and removed untriaged This issues has not yet been looked at by the Maintainers labels Jun 4, 2021

pmahindrakar-oss assigned jsonporter Jun 4, 2021

kumare3 unassigned jsonporter Jul 7, 2021

EngHabu mentioned this issue Aug 31, 2021

[BUG] When a parent wf fails and aborts sub-workflows, they show as "User Initiated Abort" #232

Closed

20 tasks

Pianist038801 mentioned this issue Aug 31, 2021

improvement: show proper error message for aborted workflows flyteorg/flyteconsole#195

Merged

8 tasks

Pianist038801 self-assigned this Sep 1, 2021

jsonporter closed this as completed in flyteorg/flyteconsole#195 Sep 1, 2021

eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 6, 2022

Hotfix docker image cache key in end2end test (flyteorg#408)

fcddb74

* Added hotfix for end2end test Signed-off-by: Yuvraj <[email protected]>

eapolinario pushed a commit to eapolinario/flyte that referenced this issue Aug 9, 2023

Hotfix docker image cache key in end2end test (flyteorg#408)

b28526a

* Added hotfix for end2end test Signed-off-by: Yuvraj <[email protected]>

runllm bot mentioned this issue May 8, 2024

[BUG] Subworkflow status is not reported to the parent workflow #5335

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Aborted child workflows should better report root cause of abort #408

[BUG] Aborted child workflows should better report root cause of abort #408

katrogan commented Jul 10, 2020

gsimko-lyft commented Jul 10, 2020

pmahindrakar-oss commented May 20, 2021

pmahindrakar-oss commented May 20, 2021

katrogan commented May 20, 2021 •

edited

Loading

pmahindrakar-oss commented May 20, 2021

pmahindrakar-oss commented May 24, 2021 •

edited

Loading

pmahindrakar-oss commented May 25, 2021

pmahindrakar-oss commented May 27, 2021

katrogan commented May 27, 2021

katrogan commented May 28, 2021

pmahindrakar-oss commented May 31, 2021 •

edited

Loading

pmahindrakar-oss commented Jun 4, 2021

[BUG] Aborted child workflows should better report root cause of abort #408

[BUG] Aborted child workflows should better report root cause of abort #408

Comments

katrogan commented Jul 10, 2020

gsimko-lyft commented Jul 10, 2020

pmahindrakar-oss commented May 20, 2021

pmahindrakar-oss commented May 20, 2021

katrogan commented May 20, 2021 • edited Loading

pmahindrakar-oss commented May 20, 2021

pmahindrakar-oss commented May 24, 2021 • edited Loading

Parent workflow

Child workflow

pmahindrakar-oss commented May 25, 2021

pmahindrakar-oss commented May 27, 2021

katrogan commented May 27, 2021

katrogan commented May 28, 2021

pmahindrakar-oss commented May 31, 2021 • edited Loading

pmahindrakar-oss commented Jun 4, 2021

katrogan commented May 20, 2021 •

edited

Loading

pmahindrakar-oss commented May 24, 2021 •

edited

Loading

pmahindrakar-oss commented May 31, 2021 •

edited

Loading