Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Transition to Queue if the JobCondition is empty #387

Merged
merged 12 commits into from
Sep 1, 2023
Merged

Conversation

pingsutw
Copy link
Member

TL;DR

image

The plugin manager keeps trying to recreate CRD since ExtractCurrentCondition returns an error when the condition is empty.

Create CRD -> Condiction is empty -> return error -> retry -> retry limit exceeds -> fail.

I've seen this error many times, it seems like it only happens when using the training operator 1.5.0+.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

image

Tracking Issue

NA

Follow-up issue

NA

Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
@codecov
Copy link

codecov bot commented Aug 12, 2023

Codecov Report

Patch coverage: 23.68% and project coverage change: +1.17% 🎉

Comparison is base (03f5b5c) 63.00% compared to head (3a6de3b) 64.18%.
Report is 2 commits behind head on master.

❗ Current head 3a6de3b differs from pull request most recent head d5e4b59. Consider uploading reports for the commit d5e4b59 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #387      +/-   ##
==========================================
+ Coverage   63.00%   64.18%   +1.17%     
==========================================
  Files         154      156       +2     
  Lines       13084    10643    -2441     
==========================================
- Hits         8244     6831    -1413     
+ Misses       4222     3191    -1031     
- Partials      618      621       +3     
Flag Coverage Δ
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
go/tasks/plugins/k8s/kfoperators/common/config.go 0.00% <0.00%> (ø)
...o/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go 78.80% <0.00%> (+0.11%) ⬆️
...s/plugins/k8s/kfoperators/tensorflow/tensorflow.go 76.71% <0.00%> (+0.98%) ⬆️
...sks/plugins/k8s/kfoperators/common/config_flags.go 17.39% <17.39%> (ø)
go/tasks/plugins/k8s/kfoperators/mpi/mpi.go 73.97% <33.33%> (+1.40%) ⬆️
.../plugins/k8s/kfoperators/common/common_operator.go 68.45% <100.00%> (+4.13%) ⬆️

... and 131 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Kevin Su <[email protected]>
@hamersaw
Copy link
Contributor

@fg91 @yubofredwang mind taking a look?

@hamersaw
Copy link
Contributor

@pingsutw without this fix, if the kf operator is not deployed the same failure will occur right? namely there will be no conditions because no operator picks up the CR. However, with this change, if the operator is not deployed the task will be forever stuck in the QUEUED phase right?

@fg91
Copy link
Member

fg91 commented Aug 16, 2023

I can confirm that our propeller logs are full of lines such as:

E0816 12:20:37.986535       1 workers.go:102] error syncing 'development/acf2v4xvgh6f84qncfnx': failed at Node[n3]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [pytorch]: found no current condition. Conditions: []

That being said, our pytorch tasks don't fail because of this as shown in your first screen shot.

Is my understanding correct that:

  1. If the operator is installed, the "empty conditions" logs are temporary until the task starts and once there are conditions, the current error message (or in the future the "queued" job status) goes away?
  2. If the operator is not installed, currently propeller keeps logging "no current conditions found" an in the future will keep logging "queued"?

@pingsutw
Copy link
Member Author

@fg91 correct. Which version of Pytorch operator you was using? seems like the error only happens when using pytorch-operator 1.5+

@yubofredwang
Copy link
Contributor

yubofredwang commented Aug 17, 2023

The PR itself looks good. However, can we use StartTime as an indication of whether operator is installed?
Like if the StartTime is set, then that means the operator is not missing. We are only waiting for conditions to be updated.

@fg91
Copy link
Member

fg91 commented Aug 17, 2023

@fg91 correct. Which version of Pytorch operator you was using? seems like the error only happens when using pytorch-operator 1.5+

Yes, we use 1.5+ 👍

@pingsutw
Copy link
Member Author

The PR itself looks good. However, can we use StartTime as an indication of whether operator is installed?

app.status is None when the task is started.

@yubofredwang
Copy link
Contributor

yubofredwang commented Aug 21, 2023

The PR itself looks good. However, can we use StartTime as an indication of whether operator is installed?

app.status is None when the task is started.

I was thinking maybe we can leverage objectMeta.CreationTimestamp and set a timeout for the creation of the TFJob resource? Something like:

if StartTime is null &&  time.now() - objectMeta.CreationTimestamp > 1min:
     raise Error

@pingsutw
Copy link
Member Author

I think that's a good idea, and we should make timeout configurable. wdyt @hamersaw

Signed-off-by: Kevin Su <[email protected]>
@pingsutw
Copy link
Member Author

updated it.
image

@fg91
Copy link
Member

fg91 commented Aug 22, 2023

updated it. image

Nice descriptive error message 👌
(Only nit-pick: most of - at least our - users won't know what CR means, let's maybe not abbreviate)

@yubofredwang
Copy link
Contributor

Looks perfect! Thanks for the quick implementation!

Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
@@ -231,6 +231,9 @@ func (pytorchOperatorResourceHandler) GetTaskPhase(_ context.Context, pluginCont
return pluginsCore.PhaseInfoUndefined, err
}

if app.Status.StartTime == nil && app.CreationTimestamp.Add(common.GetConfig().Timeout.Duration).Before(time.Now()) {
return pluginsCore.PhaseInfoUndefined, fmt.Errorf("kubeflow operator hasn't updated the pytorch coustum resource since creation time %v", app.CreationTimestamp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"coustum" -> "custom"

same in tensorflow / mpi

go/tasks/plugins/k8s/kfoperators/common/config.go Outdated Show resolved Hide resolved
Signed-off-by: Kevin Su <[email protected]>
@pingsutw pingsutw merged commit 552f145 into master Sep 1, 2023
eapolinario pushed a commit that referenced this pull request Sep 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants