Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Subworkflow recover mode is broken #2840

Closed
2 tasks done
katrogan opened this issue Sep 1, 2022 · 5 comments · Fixed by flyteorg/flytepropeller#481
Closed
2 tasks done

[BUG] Subworkflow recover mode is broken #2840

katrogan opened this issue Sep 1, 2022 · 5 comments · Fixed by flyteorg/flytepropeller#481
Assignees
Labels
bug Something isn't working
Milestone

Comments

@katrogan
Copy link
Contributor

katrogan commented Sep 1, 2022

Describe the bug

When recovering this nested_parent_wf, the second t1 was incorrectly recovered with the subworkflow outputs, not the task outputs:

Screen Shot 2022-09-01 at 2 05 12 PM

We should expect the outputs to be c and t1_int_output
Screen Shot 2022-09-01 at 2 05 56 PM

which corresponds to the t1 task signature
and not the subworkflow signature: https://github.com/flyteorg/flytesnacks/blob/1b86bdd84cb55ad65c9c97256d91da0c1a3e2610/cookbook/core/control_flow/subworkflows.py#L67 as observed in the buggy recovered t1

Expected behavior

Subworkflow recovery should work correctly.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@katrogan katrogan added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Sep 1, 2022
@katrogan
Copy link
Contributor Author

katrogan commented Sep 1, 2022

cc @hamersaw

@kumare3
Copy link
Contributor

kumare3 commented Sep 3, 2022

cc @hamersaw

@katrogan this is a correctness issue. Can we get on this

@hamersaw hamersaw self-assigned this Sep 6, 2022
@hamersaw hamersaw removed the untriaged This issues has not yet been looked at by the Maintainers label Sep 6, 2022
@hamersaw hamersaw added this to the 1.2.0 milestone Sep 6, 2022
@hamersaw
Copy link
Contributor

hamersaw commented Sep 6, 2022

This seems to be happening because we use the node ID from the node execution context to retrieve the node information during recovery here. This function doesn't prepend the node ID with the parent, so in the case of our subworkflow the node IDs are "n0" and "n1" rather than "n0/n0" and "n0/n1" as is they should be. In this example the "n1" node in "my_subwf" is successfully found (but incorrectly) as node "n1" or the top level "parent_wf" from the previous execution.

It's unclear if this problem occurs for dynamic tasks as well. Looking deeper and will submit a fix shortly.

@katrogan
Copy link
Contributor Author

katrogan commented Sep 6, 2022

nice find! thank you @hamersaw for looking into this!

@hamersaw
Copy link
Contributor

hamersaw commented Sep 7, 2022

OK, this is actually pretty complex - and can confirm dynamic tasks are broken too, but maybe worse. As explained above when we attempt to recover the node execution we use the NodeID from the NodeExecutionMetadata. This means that nodes with parents will not include the parent info, which is not the same as how the node ID is set during eventing. The result is that recovery of the subworkflow node "n1" searches for "n1" in the top-level workflow execution id rather than "n0-0-n1" like it should. This fix is relatively easy, but unfortunately doesn't cover all cases.

The real problem happens in dynamic tasks, because the parent info may include a retry (which in subworkflows will always be 0). If the dynamic task parent task fails (incrementing the retry to 1) then as subnodes execute the fully qualified ID will be prepended with "n0-1" for the parent node id and retry attempt. So in this example rather than "n0-0-n1" it would be "n0-1-n1". If we attempt to recover this dynamic task, we will search for "n0-0" because it doesn't know if the parent was retried or not (this may require an additional lookup in flyteadmin? and gets way more complex).

TL;DR quick fix is relatively easy - will not result in incorrect behavior, but in corner cases will recompute previously completed tasks. The correct fix is probably a bit more complex. cc @katrogan thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants