-
Notifications
You must be signed in to change notification settings - Fork 53
Transition to Queue if the JobCondition is empty #387
Conversation
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #387 +/- ##
==========================================
+ Coverage 63.00% 64.18% +1.17%
==========================================
Files 154 156 +2
Lines 13084 10643 -2441
==========================================
- Hits 8244 6831 -1413
+ Misses 4222 3191 -1031
- Partials 618 621 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
Signed-off-by: Kevin Su <[email protected]>
@fg91 @yubofredwang mind taking a look? |
@pingsutw without this fix, if the kf operator is not deployed the same failure will occur right? namely there will be no conditions because no operator picks up the CR. However, with this change, if the operator is not deployed the task will be forever stuck in the |
I can confirm that our propeller logs are full of lines such as: E0816 12:20:37.986535 1 workers.go:102] error syncing 'development/acf2v4xvgh6f84qncfnx': failed at Node[n3]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [pytorch]: found no current condition. Conditions: [] That being said, our pytorch tasks don't fail because of this as shown in your first screen shot. Is my understanding correct that:
|
@fg91 correct. Which version of Pytorch operator you was using? seems like the error only happens when using pytorch-operator 1.5+ |
The PR itself looks good. However, can we use StartTime as an indication of whether operator is installed? |
Yes, we use 1.5+ 👍 |
app.status is None when the task is started. |
I was thinking maybe we can leverage
|
I think that's a good idea, and we should make timeout configurable. wdyt @hamersaw |
Signed-off-by: Kevin Su <[email protected]>
Looks perfect! Thanks for the quick implementation! |
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
@@ -231,6 +231,9 @@ func (pytorchOperatorResourceHandler) GetTaskPhase(_ context.Context, pluginCont | |||
return pluginsCore.PhaseInfoUndefined, err | |||
} | |||
|
|||
if app.Status.StartTime == nil && app.CreationTimestamp.Add(common.GetConfig().Timeout.Duration).Before(time.Now()) { | |||
return pluginsCore.PhaseInfoUndefined, fmt.Errorf("kubeflow operator hasn't updated the pytorch coustum resource since creation time %v", app.CreationTimestamp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"coustum" -> "custom"
same in tensorflow / mpi
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
TL;DR
The plugin manager keeps trying to recreate CRD since ExtractCurrentCondition returns an error when the condition is empty.
Create CRD -> Condiction is empty -> return error -> retry -> retry limit exceeds -> fail.
I've seen this error many times, it seems like it only happens when using the training operator 1.5.0+.
Type
Are all requirements met?
Complete description
Tracking Issue
NA
Follow-up issue
NA