-
Notifications
You must be signed in to change notification settings - Fork 672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Aborted child workflows should better report root cause of abort #408
Comments
👍 |
@katrogan I am unable to reproduce this issue. For reproing the issue , i have used the following example from flytesnacks control_flow/subworkflows.py And modified it to fail from the parent workflow after 45 sec . I always launch the workflow with value of a= 3 so that the parent throws an exception. Have added sleep in parent execution to allow enough time for the child workflow containers to be created and also a sleep in child workflow of 90 sec to keep them running and not be completed before the parent. Let me know if there some other specific example that can repro it.
Here are the screenshots with two cases of the workflow execution 2] Execution is aborted while the parent and child workflows are executing Also going through the propeller code to dig if there is a case i am missing here |
Another data point here It was fixed here As part of this PR flyteorg/flytepropeller#84 handleFailingWorkflow And HandleAbortedWorkflow are two different paths Also Trying to see how these two converge |
do you have screenshots of the child workflow? i think the issue is that error messaging 'Workflow aborted' is non-descriptive and doesn't indicate by whom it was aborted (the failing parent wf in thise case) |
@katrogan can you guide me in how to look for the child workflow page.I assumed the screenshots I posted would cover this .Also there is graph view where also I saw the same data .Is there some place else ? |
Ok . I was able to repro this with a slightly different example where we use the launchplan within another workflow
Parent workflowChild workflowcc : @kumare3 |
After debugging further with the current state the abort workflow relies on the DeleteTimestamp to determine if the workflow has been aborted and this cannot distinguish between user abort versus a failing parent task triggering a subworkflow execution abort . The abort stage which is part of the control loop, it only has access to v1alpha1.FlyteWorkflow object which it compares desired and actual states to make changes. When the parent Workflow fails , it calls abort with abort handler for the nodeExecutor which inturn call the subWorflow abort handler with the reason for all downstream nodes and ultimately recording this as
with reason "Some node execution failed, auto-abort" And this is shown in parent overview page When this event is raised and auto-abort is called, a reason string is constructed to be sent to admin for Terminate request But this information is not shown on the console. Ideally we should be pulling this in UI console. Now on the child workflow page since after the control loop runs later, It sees the all child sub workflow as having a deleteTimeStamp that were terminated by parent and hence send another NodeExecutionEvent but this time with static reason 'Workflow aborted.' No where the information of error is stored in the InMemory v1alpha1.FlyteWorkflow object which can be accessed through the control loop flow and hence when this executes, the error is stored as "Workflow aborted" in the event.NodeExecutionEvent |
Following is the execution data of the child workflow
And this the node execution data for the same identifier.
UI shows the error coming from the node execution. I see three options to fix this. 2] Flyteadmin get the current event but find if its an abort and then find the execution data for it and amends the error message to the value coming from the execution's abort metadata. 3] UI should show the execution error message view which shows both the node abort message and execution abort message which will make it clearer. Cleaner in my opinion would be to show the abort metadata from the execution on the flyteconsole instead of modifying anything in propeller or admin. @katrogan @kumare3 please suggest what would be appropriate here. |
+1 for option 1) i like the idea of not having an external service (admin, console) deal with reconciling state. much simpler and straightforward to have it recorded correctly |
after chatting with @kumare3 ignore my suggestion! I think admin as the control plane is better positioned to reconcile the errors message vs abort cause by traversing the parent lineage |
Heres a simple abort usecase, the UI doesn't show the right message which was entered by user from flyteconsole. heres the o/p from flytectl. So abort string of "Console termination " was entered from flyteconsole but it doesn't show up in the UI , since it show the node execution data from closure But if we see the actual execution data , it does show the abort data .(Node execution closure . abort != Execution closure .abort) In other cases where we have ERROR data populated which is basically failure cases in those cases Reconciling in admin for the abort case where Node execution closure . abort != Execution closure .abort would be ok , But i think if we can have the UI show both the abort and error data from executions then that would ok too for now. I would leave the reconciliation case up to you guys. I have setup time for discussing this . Shouldn't take much time
MR to dump the error and abort data from flytectl is here flyteorg/flytectl#79 |
@jsonporter , Please review this issue. Essentially, the UI is currently not displaying the abortMetadata within an execution closure It only show if the error field is populated In my last update i have shown how i have implemented this in flytectl . The UI should do a conditional check based on which data is available from the oneof and use that to show the abort or error data. I am assigning this to you .Please let me know if you have any questions regarding this . |
* Added hotfix for end2end test Signed-off-by: Yuvraj <[email protected]>
* requirements update Signed-off-by: Haytham Abuelfutuh <[email protected]> * Set resources differently for SANDBOX vs prod Signed-off-by: Haytham Abuelfutuh <[email protected]> * bump Signed-off-by: Haytham Abuelfutuh <[email protected]> * use lower resources for sandbox Signed-off-by: Haytham Abuelfutuh <[email protected]> * bump Signed-off-by: Haytham Abuelfutuh <[email protected]> * bump Signed-off-by: Haytham Abuelfutuh <[email protected]> * register without serialize Signed-off-by: Haytham Abuelfutuh <[email protected]> * register without serialize Signed-off-by: Haytham Abuelfutuh <[email protected]> * bump Signed-off-by: Haytham Abuelfutuh <[email protected]> * bump Signed-off-by: Haytham Abuelfutuh <[email protected]> * Update requirements Signed-off-by: Haytham Abuelfutuh <[email protected]> * wip Signed-off-by: Haytham Abuelfutuh <[email protected]> * Update eda requirements Signed-off-by: Haytham Abuelfutuh <[email protected]> * Cleanup Signed-off-by: Haytham Abuelfutuh <[email protected]> * format Signed-off-by: Haytham Abuelfutuh <[email protected]>
* Added hotfix for end2end test Signed-off-by: Yuvraj <[email protected]>
* Support envs when creating execution Signed-off-by: Hongxin Liang <[email protected]> * Update doc Signed-off-by: Hongxin Liang <[email protected]> --------- Signed-off-by: Hongxin Liang <[email protected]>
* Support envs when creating execution Signed-off-by: Hongxin Liang <[email protected]> * Update doc Signed-off-by: Hongxin Liang <[email protected]> --------- Signed-off-by: Hongxin Liang <[email protected]>
* Support envs when creating execution Signed-off-by: Hongxin Liang <[email protected]> * Update doc Signed-off-by: Hongxin Liang <[email protected]> --------- Signed-off-by: Hongxin Liang <[email protected]>
* Support envs when creating execution Signed-off-by: Hongxin Liang <[email protected]> * Update doc Signed-off-by: Hongxin Liang <[email protected]> --------- Signed-off-by: Hongxin Liang <[email protected]>
Describe the bug
Currently if a parent workflow launches a child workflow and parent workflow fails it will auto-abort children workflows. On the parent overview page the node that launched the child workflow correctly reports
Some node execution failed, auto-abort.
However on the page for the actual child workflow the aborted task shows asWorkflow aborted.
without any indication why. This is a confusing experience for users debugging or looking at the failed child workflow.Expected behavior
A clear and concise description of what you expected to happen.
Flyte component
To Reproduce
Steps to reproduce the behavior:
Screenshots
If applicable, add screenshots to help explain your problem.
Environment
Flyte component
Additional context
N/A
The text was updated successfully, but these errors were encountered: