-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Potential race condition in Flyte Propeller #3582
Closed
2 tasks done
pablocasares opened this issue
Apr 11, 2023
· 3 comments
· Fixed by flyteorg/flyteadmin#551, flyteorg/flytepropeller#553 or flyteorg/flytepropeller#574
Closed
2 tasks done
[BUG] Potential race condition in Flyte Propeller #3582
pablocasares opened this issue
Apr 11, 2023
· 3 comments
· Fixed by flyteorg/flyteadmin#551, flyteorg/flytepropeller#553 or flyteorg/flytepropeller#574
Labels
bug
Something isn't working
Comments
pablocasares
added
bug
Something isn't working
untriaged
This issues has not yet been looked at by the Maintainers
labels
Apr 11, 2023
Thank you for opening your first issue here! 🛠 |
Thank you for the issue. We will tal asap |
eapolinario
removed
the
untriaged
This issues has not yet been looked at by the Maintainers
label
Apr 14, 2023
hey @kumare3 & @EngHabu! This week we upgraded to the flyte 1.6.1 version which includes this patch, and we do some tests but the error is still there. It is not exactly the same error but it is the same behavior. StackTrace
|
This was referenced Jun 6, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
I am monitoring multiple workflows containing subworkflows running in parallel. I'm using Propeller v1.1.70.
Some of the executions fail with this error.
CausedByError: Failed to propagate Abort for workflow. Error: 0: [SystemError] system error, caused by: rpc error: code = PermissionDenied desc = Cannot abort an already terminate workflow execution
.One of the subworkflows is intented to fail under certain conditions. When this workflow fails, Propeller tries to abort the rest of the running subworkflows. Sometimes the rest of the subworkflows are properly aborted but other times Propeller receives that PermissionDenied error from Flyte Admin.
It seems to be a race condition in Propeller, when Propeller tries to abort a workflow in a terminated status because when Propeller checks the Status of the rest of the subworkflows they are in status "running" but at the time when the abort is called they already changed to a terminated status. I checked that the finish time difference when this happened between the failing subworkflow that is trying to abort the rest and the successful one is 3 ms so I think that when propeller checks the status of the rest it is reported as running although it is actually Succeeded when the abort call is executed. Maybe these lines are relevant to the issue: https://github.com/flyteorg/flytepropeller/blob/master/pkg/controller/nodes/task/handler.go#L795-L825
(currentPhase might change when p.Abort is called)
Please check the attached screenshots to see how different executions of the same code produce different results.
Eventually, the parent workflow (the one containing the subworkflows) fails with this error:
RuntimeExecutionError: max number of system retry attempts [51/50] exhausted.
This error is increasing the number of calls made to FlyteAdmin and also this is increasing the metric associated to the PermissionDenied error.
Please do not hesitate to ask for further information if needed.
Expected behavior
FlytePropeller should not retry to abort a node in a terminated status and that node status should be updated in parent workflow with the terminated status (sometimes the node is shown as running although it is succeeded when you open the subworkflow).
Additional context to reproduce
No response
Screenshots
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: