-
Notifications
You must be signed in to change notification settings - Fork 53
Conversation
…btasks Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #236 +/- ##
==========================================
+ Coverage 62.45% 62.94% +0.48%
==========================================
Files 142 142
Lines 8845 8862 +17
==========================================
+ Hits 5524 5578 +54
+ Misses 2818 2785 -33
+ Partials 503 499 -4
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it looks good! let's just make it backward compatible and do some testing in our demo environment...
Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
…sing subtasks retry attempt array Signed-off-by: Daniel Rammer <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great!
if retryAttempt == 0 { | ||
return utils.ConvertToDNS1123SubdomainCompatibleString(fmt.Sprintf("%v-%v", parentName, indexStr)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
Can we add some more unit tests to up the coverage for this patch? |
Signed-off-by: Daniel Rammer <[email protected]>
Done. |
* handling phase transitions and retry attempts to retry only failed subtasks Signed-off-by: Daniel Rammer <[email protected]> * fixed tests and linter Signed-off-by: Daniel Rammer <[email protected]> * added subtask retry attempt to log link id Signed-off-by: Daniel Rammer <[email protected]> * fixed allowing 1 more retry than the maximum number of attempts Signed-off-by: Daniel Rammer <[email protected]> * fixed lint issues Signed-off-by: Daniel Rammer <[email protected]> * updating podName generation to ensure backwards compatibility. Signed-off-by: Daniel Rammer <[email protected]> * fixed lint Signed-off-by: Daniel Rammer <[email protected]> * using existing retryAttempt number when transition running tasks to using subtasks retry attempt array Signed-off-by: Daniel Rammer <[email protected]> * added unit tests Signed-off-by: Daniel Rammer <[email protected]>
TL;DR
Currently, k8s array tasks (map tasks) retry the entire collection of subtasks when one of them fails. This PR enables retries over individual subtasks.
Type
Are all requirements met?
Complete description
A previous PR added tracking of retry attempts for each subtask. As the current implementation retries all subtasks if one fails, this wasn't particularly useful because they were all the same. With this PR we enable subtasks to be retried individually, using this retry attempts array to inform unique pod names and logs for each subtask retry.
When a subtask fails, the current implementation reports the failure (as retryable) to the parent map task. Now we attempt to retry by incrementing the retry attempt value and transitioning the subtask to the "Undefined" phase as if it had never been executed. This is recognized during the next evaluation and another attempt is executed.
We use the max attempts defined on the parent map task to determine the number of times a subtask may be executed. As we track subtask retries internally (rather than within the parent map task), if a subtask exceeds the maximum number of retries we report a permanent failure to the parent map task. This ensure that the entire collection of subtasks is not retried again (with each subtask potentially retried multiple more times). Other retryable failures reported to the map task will still fallthrough.
Tracking Issue
fixes flyteorg/flyte#1276
Follow-up issue