Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Flyte Map Tasks should only retry subtasks #1276

Closed
migueltol22 opened this issue Jul 22, 2021 · 2 comments · Fixed by flyteorg/flyteplugins#236
Closed

[BUG] Flyte Map Tasks should only retry subtasks #1276

migueltol22 opened this issue Jul 22, 2021 · 2 comments · Fixed by flyteorg/flyteplugins#236
Assignees
Labels
bug Something isn't working

Comments

@migueltol22
Copy link
Contributor

Describe the bug
Currently map tasks will retry the entire map task(all sub tasks) if a single subtask fails and count the retry toward the entire map task. This isn't ideal and should only be retried for that specific subtask that failed and be counted toward a single subtask retry.

We can technically get around this by setting retries to a very large number and with caching at some point it would make progress to completion but this is not ideal.

Expected behavior

I would expect map tasks to only retry failed sub tasks even if caching is not specified and to have retries count at the sub task level.

For example if retries are set to 2 and I have 10 subtasks, each subtask can fail twice before the entire map task would be counted as Failed. If any subtask fails more than the retry limit set, I believe it is reasonable to fail the entire map task and stop any other sub tasks currently running.

[Optional] Additional context
To Reproduce
Steps to reproduce the behavior:
1.
2.

Screenshots
If applicable, add screenshots to help explain your problem.

@migueltol22 migueltol22 added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Jul 22, 2021
@georgesnelling
Copy link
Contributor

@EngHabu @katrogan: Is this propeller or admin or both?

@kumare3 kumare3 removed the untriaged This issues has not yet been looked at by the Maintainers label Aug 9, 2021
@EngHabu
Copy link
Contributor

EngHabu commented Aug 9, 2021

There is precedence for us passing the "retry count" down to the downstream system (e.g. Qubole) to do the retries there... I think the right thing here is to follow that and do the retrying in the K8s Array Plugin. One way to do this is:

  1. The ArrayStatuses we track in the plugin should be expanded by the factor of retry count...
  2. We should append the attempt number to the Pod name.
  3. We should track logs... etc. separately for separate retries...
  4. The most delicate change is to construct an output writer with the retry attempt and then modifying the output assembler to know to pick the outputs of all successful attempts...

P.S. We should not use the native Pod retries because there is no way to avoid clobbering the output directory and noway to separate out the logs... besides, they only restart failing container not the entire pod...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants