Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Retry map task subtasks #236

Merged
merged 9 commits into from
Feb 3, 2022
Merged

Retry map task subtasks #236

merged 9 commits into from
Feb 3, 2022

Conversation

hamersaw
Copy link
Contributor

@hamersaw hamersaw commented Jan 28, 2022

TL;DR

Currently, k8s array tasks (map tasks) retry the entire collection of subtasks when one of them fails. This PR enables retries over individual subtasks.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

A previous PR added tracking of retry attempts for each subtask. As the current implementation retries all subtasks if one fails, this wasn't particularly useful because they were all the same. With this PR we enable subtasks to be retried individually, using this retry attempts array to inform unique pod names and logs for each subtask retry.

When a subtask fails, the current implementation reports the failure (as retryable) to the parent map task. Now we attempt to retry by incrementing the retry attempt value and transitioning the subtask to the "Undefined" phase as if it had never been executed. This is recognized during the next evaluation and another attempt is executed.

We use the max attempts defined on the parent map task to determine the number of times a subtask may be executed. As we track subtask retries internally (rather than within the parent map task), if a subtask exceeds the maximum number of retries we report a permanent failure to the parent map task. This ensure that the entire collection of subtasks is not retried again (with each subtask potentially retried multiple more times). Other retryable failures reported to the map task will still fallthrough.

Tracking Issue

fixes flyteorg/flyte#1276

Follow-up issue

  • [BUG] Flyte K8s Array jobs do not take into account Interruptible failures flyte#1533: This solution does not fix any issues with interruptible nodes within k8s array tasks. Will have to be addressed separately. Discerning what is an interrupted failure is very difficult, the current approach handles all "system" failures as interrupted and enables a different configurable number of retries. We might need to track the failure type as well?

@codecov
Copy link

codecov bot commented Jan 28, 2022

Codecov Report

Merging #236 (38e293a) into master (0637c34) will increase coverage by 0.48%.
The diff coverage is 91.83%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #236      +/-   ##
==========================================
+ Coverage   62.45%   62.94%   +0.48%     
==========================================
  Files         142      142              
  Lines        8845     8862      +17     
==========================================
+ Hits         5524     5578      +54     
+ Misses       2818     2785      -33     
+ Partials      503      499       -4     
Flag Coverage Δ
unittests 62.60% <91.48%> (+0.55%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
go/tasks/plugins/array/k8s/executor.go 38.73% <0.00%> (ø)
go/tasks/plugins/array/k8s/task.go 54.91% <87.50%> (-4.41%) ⬇️
go/tasks/plugins/array/k8s/monitor.go 71.26% <88.88%> (+7.28%) ⬆️
go/tasks/plugins/array/core/state.go 71.09% <100.00%> (+19.92%) ⬆️
go/tasks/plugins/array/k8s/launcher.go 45.71% <100.00%> (+7.00%) ⬆️
go/tasks/plugins/k8s/pod/plugin.go 88.88% <0.00%> (-0.14%) ⬇️
go/tasks/pluginmachinery/flytek8s/pod_helper.go 77.66% <0.00%> (+1.25%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0637c34...38e293a. Read the comment docs.

@hamersaw hamersaw requested review from EngHabu and kumare3 January 28, 2022 18:41
Copy link
Contributor

@EngHabu EngHabu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks good! let's just make it backward compatible and do some testing in our demo environment...

go/tasks/plugins/array/k8s/launcher.go Outdated Show resolved Hide resolved
@hamersaw hamersaw requested a review from EngHabu February 1, 2022 17:12
EngHabu
EngHabu previously approved these changes Feb 1, 2022
Copy link
Contributor

@EngHabu EngHabu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great!

Comment on lines +42 to +44
if retryAttempt == 0 {
return utils.ConvertToDNS1123SubdomainCompatibleString(fmt.Sprintf("%v-%v", parentName, indexStr))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

@EngHabu
Copy link
Contributor

EngHabu commented Feb 1, 2022

Can we add some more unit tests to up the coverage for this patch?

Signed-off-by: Daniel Rammer <[email protected]>
@hamersaw
Copy link
Contributor Author

hamersaw commented Feb 1, 2022

Can we add some more unit tests to up the coverage for this patch?

Done.

@hamersaw hamersaw requested a review from EngHabu February 1, 2022 20:51
@EngHabu EngHabu merged commit 14ed6a8 into master Feb 3, 2022
eapolinario pushed a commit that referenced this pull request Sep 6, 2023
* handling phase transitions and retry attempts to retry only failed subtasks

Signed-off-by: Daniel Rammer <[email protected]>

* fixed tests and linter

Signed-off-by: Daniel Rammer <[email protected]>

* added subtask retry attempt to log link id

Signed-off-by: Daniel Rammer <[email protected]>

* fixed allowing 1 more retry than the maximum number of attempts

Signed-off-by: Daniel Rammer <[email protected]>

* fixed lint issues

Signed-off-by: Daniel Rammer <[email protected]>

* updating podName generation to ensure backwards compatibility.

Signed-off-by: Daniel Rammer <[email protected]>

* fixed lint

Signed-off-by: Daniel Rammer <[email protected]>

* using existing retryAttempt number when transition running tasks to using subtasks retry attempt array

Signed-off-by: Daniel Rammer <[email protected]>

* added unit tests

Signed-off-by: Daniel Rammer <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Flyte Map Tasks should only retry subtasks
2 participants