-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Subworkflow recover mode is broken #2840
Comments
cc @hamersaw |
This seems to be happening because we use the node ID from the node execution context to retrieve the node information during recovery here. This function doesn't prepend the node ID with the parent, so in the case of our subworkflow the node IDs are "n0" and "n1" rather than "n0/n0" and "n0/n1" as is they should be. In this example the "n1" node in "my_subwf" is successfully found (but incorrectly) as node "n1" or the top level "parent_wf" from the previous execution. It's unclear if this problem occurs for dynamic tasks as well. Looking deeper and will submit a fix shortly. |
nice find! thank you @hamersaw for looking into this! |
OK, this is actually pretty complex - and can confirm dynamic tasks are broken too, but maybe worse. As explained above when we attempt to recover the node execution we use the NodeID from the NodeExecutionMetadata. This means that nodes with parents will not include the parent info, which is not the same as how the node ID is set during eventing. The result is that recovery of the subworkflow node "n1" searches for "n1" in the top-level workflow execution id rather than "n0-0-n1" like it should. This fix is relatively easy, but unfortunately doesn't cover all cases. The real problem happens in dynamic tasks, because the parent info may include a retry (which in subworkflows will always be 0). If the dynamic task parent task fails (incrementing the retry to 1) then as subnodes execute the fully qualified ID will be prepended with "n0-1" for the parent node id and retry attempt. So in this example rather than "n0-0-n1" it would be "n0-1-n1". If we attempt to recover this dynamic task, we will search for "n0-0" because it doesn't know if the parent was retried or not (this may require an additional lookup in flyteadmin? and gets way more complex). TL;DR quick fix is relatively easy - will not result in incorrect behavior, but in corner cases will recompute previously completed tasks. The correct fix is probably a bit more complex. cc @katrogan thoughts? |
Describe the bug
When recovering this nested_parent_wf, the second t1 was incorrectly recovered with the subworkflow outputs, not the task outputs:
We should expect the outputs to be
c
andt1_int_output
which corresponds to the t1 task signature
and not the subworkflow signature: https://github.com/flyteorg/flytesnacks/blob/1b86bdd84cb55ad65c9c97256d91da0c1a3e2610/cookbook/core/control_flow/subworkflows.py#L67 as observed in the buggy recovered t1
Expected behavior
Subworkflow recovery should work correctly.
Additional context to reproduce
No response
Screenshots
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: