[BUG] Flyte Map Tasks should only retry subtasks #1276

migueltol22 · 2021-07-22T21:06:01Z

Describe the bug
Currently map tasks will retry the entire map task(all sub tasks) if a single subtask fails and count the retry toward the entire map task. This isn't ideal and should only be retried for that specific subtask that failed and be counted toward a single subtask retry.

We can technically get around this by setting retries to a very large number and with caching at some point it would make progress to completion but this is not ideal.

Expected behavior

I would expect map tasks to only retry failed sub tasks even if caching is not specified and to have retries count at the sub task level.

For example if retries are set to 2 and I have 10 subtasks, each subtask can fail twice before the entire map task would be counted as Failed. If any subtask fails more than the retry limit set, I believe it is reasonable to fail the entire map task and stop any other sub tasks currently running.

[Optional] Additional context
To Reproduce
Steps to reproduce the behavior:
1.
2.

Screenshots
If applicable, add screenshots to help explain your problem.

georgesnelling · 2021-08-06T21:26:12Z

@EngHabu @katrogan: Is this propeller or admin or both?

EngHabu · 2021-08-09T23:33:57Z

There is precedence for us passing the "retry count" down to the downstream system (e.g. Qubole) to do the retries there... I think the right thing here is to follow that and do the retrying in the K8s Array Plugin. One way to do this is:

The ArrayStatuses we track in the plugin should be expanded by the factor of retry count...
We should append the attempt number to the Pod name.
We should track logs... etc. separately for separate retries...
The most delicate change is to construct an output writer with the retry attempt and then modifying the output assembler to know to pick the outputs of all successful attempts...

P.S. We should not use the native Pod retries because there is no way to avoid clobbering the output directory and noway to separate out the logs... besides, they only restart failing container not the entire pod...

migueltol22 added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Jul 22, 2021

kumare3 removed the untriaged This issues has not yet been looked at by the Maintainers label Aug 9, 2021

katrogan mentioned this issue Aug 11, 2021

[K8s-Array ] Retries in array tasks retry the entire array task instead of the subtasks #190

Closed

EngHabu added the bugSquash-Cascade label Sep 1, 2021

EngHabu added this to the 0.18.0 milestone Sep 1, 2021

EngHabu assigned migueltol22 Sep 1, 2021

migueltol22 mentioned this issue Sep 15, 2021

K8s-Array - Retry at the subtask level instead of overall job flyteorg/flyteplugins#210

Closed

8 tasks

eapolinario removed this from the 0.18.0 milestone Oct 6, 2021

kumare3 removed the bugSquash-Cascade label Oct 14, 2021

kumare3 added this to the 0.18.2 milestone Nov 10, 2021

EngHabu assigned eapolinario and EngHabu and unassigned eapolinario Nov 17, 2021

EngHabu modified the milestones: 0.18.2, 0.19.0 - Eagle Nov 24, 2021

EngHabu modified the milestones: 0.19.0 - Eagle, 1.0.0 - Phoenix! Dec 8, 2021

EngHabu modified the milestones: 1.0.0 - Phoenix!, 0.19.0 - Eagle Jan 5, 2022

EngHabu added this to the 0.19.1 - Jan 2021 milestone Jan 5, 2022

EngHabu assigned hamersaw Jan 19, 2022

hamersaw modified the milestones: 0.19.2 - Jan 2021, 0.19.3 - Feb 2021 Jan 26, 2022

hamersaw mentioned this issue Jan 28, 2022

Retry map task subtasks flyteorg/flyteplugins#236

Merged

8 tasks

EngHabu unassigned EngHabu and migueltol22 Feb 2, 2022

EngHabu closed this as completed in flyteorg/flyteplugins#236 Feb 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Flyte Map Tasks should only retry subtasks #1276

[BUG] Flyte Map Tasks should only retry subtasks #1276

migueltol22 commented Jul 22, 2021

georgesnelling commented Aug 6, 2021

EngHabu commented Aug 9, 2021

[BUG] Flyte Map Tasks should only retry subtasks #1276

[BUG] Flyte Map Tasks should only retry subtasks #1276

Comments

migueltol22 commented Jul 22, 2021

georgesnelling commented Aug 6, 2021

EngHabu commented Aug 9, 2021