Transition to Queue if the JobCondition is empty #387

pingsutw · 2023-08-12T19:24:49Z

TL;DR

The plugin manager keeps trying to recreate CRD since ExtractCurrentCondition returns an error when the condition is empty.

Create CRD -> Condiction is empty -> return error -> retry -> retry limit exceeds -> fail.

I've seen this error many times, it seems like it only happens when using the training operator 1.5.0+.

Type

Bug Fix
Feature
Plugin

Are all requirements met?

Complete description

Tracking Issue

NA

Follow-up issue

NA

Signed-off-by: Kevin Su <[email protected]>

codecov · 2023-08-12T19:30:52Z

Codecov Report

Patch coverage: 23.68% and project coverage change: +1.17% 🎉

Comparison is base (03f5b5c) 63.00% compared to head (3a6de3b) 64.18%.
Report is 2 commits behind head on master.

❗ Current head 3a6de3b differs from pull request most recent head d5e4b59. Consider uploading reports for the commit d5e4b59 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #387      +/-   ##
==========================================
+ Coverage   63.00%   64.18%   +1.17%     
==========================================
  Files         154      156       +2     
  Lines       13084    10643    -2441     
==========================================
- Hits         8244     6831    -1413     
+ Misses       4222     3191    -1031     
- Partials      618      621       +3

Flag	Coverage Δ
unittests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
go/tasks/plugins/k8s/kfoperators/common/config.go	`0.00% <0.00%> (ø)`
...o/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go	`78.80% <0.00%> (+0.11%)`	⬆️
...s/plugins/k8s/kfoperators/tensorflow/tensorflow.go	`76.71% <0.00%> (+0.98%)`	⬆️
...sks/plugins/k8s/kfoperators/common/config_flags.go	`17.39% <17.39%> (ø)`
go/tasks/plugins/k8s/kfoperators/mpi/mpi.go	`73.97% <33.33%> (+1.40%)`	⬆️
.../plugins/k8s/kfoperators/common/common_operator.go	`68.45% <100.00%> (+4.13%)`	⬆️

... and 131 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Kevin Su <[email protected]>

hamersaw · 2023-08-15T15:01:46Z

@fg91 @yubofredwang mind taking a look?

hamersaw · 2023-08-15T15:04:04Z

@pingsutw without this fix, if the kf operator is not deployed the same failure will occur right? namely there will be no conditions because no operator picks up the CR. However, with this change, if the operator is not deployed the task will be forever stuck in the QUEUED phase right?

fg91 · 2023-08-16T14:25:02Z

I can confirm that our propeller logs are full of lines such as:

E0816 12:20:37.986535       1 workers.go:102] error syncing 'development/acf2v4xvgh6f84qncfnx': failed at Node[n3]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [pytorch]: found no current condition. Conditions: []

That being said, our pytorch tasks don't fail because of this as shown in your first screen shot.

Is my understanding correct that:

If the operator is installed, the "empty conditions" logs are temporary until the task starts and once there are conditions, the current error message (or in the future the "queued" job status) goes away?
If the operator is not installed, currently propeller keeps logging "no current conditions found" an in the future will keep logging "queued"?

pingsutw · 2023-08-16T23:11:48Z

@fg91 correct. Which version of Pytorch operator you was using? seems like the error only happens when using pytorch-operator 1.5+

yubofredwang · 2023-08-17T00:18:16Z

The PR itself looks good. However, can we use StartTime as an indication of whether operator is installed?
Like if the StartTime is set, then that means the operator is not missing. We are only waiting for conditions to be updated.

fg91 · 2023-08-17T07:35:16Z

@fg91 correct. Which version of Pytorch operator you was using? seems like the error only happens when using pytorch-operator 1.5+

Yes, we use 1.5+ 👍

pingsutw · 2023-08-21T17:50:16Z

The PR itself looks good. However, can we use StartTime as an indication of whether operator is installed?

app.status is None when the task is started.

yubofredwang · 2023-08-21T18:46:30Z

The PR itself looks good. However, can we use StartTime as an indication of whether operator is installed?

app.status is None when the task is started.

I was thinking maybe we can leverage objectMeta.CreationTimestamp and set a timeout for the creation of the TFJob resource? Something like:

if StartTime is null &&  time.now() - objectMeta.CreationTimestamp > 1min:
     raise Error

pingsutw · 2023-08-21T23:22:47Z

I think that's a good idea, and we should make timeout configurable. wdyt @hamersaw

Signed-off-by: Kevin Su <[email protected]>

pingsutw · 2023-08-22T00:33:38Z

updated it.

fg91 · 2023-08-22T08:14:16Z

updated it.

Nice descriptive error message 👌
(Only nit-pick: most of - at least our - users won't know what CR means, let's maybe not abbreviate)

yubofredwang · 2023-08-22T15:03:51Z

Looks perfect! Thanks for the quick implementation!

go/tasks/plugins/k8s/kfoperators/pytorch/config.go

Signed-off-by: Kevin Su <[email protected]>

…torch-plugin

Signed-off-by: Kevin Su <[email protected]>

go/tasks/plugins/k8s/kfoperators/pytorch/config.go

Signed-off-by: Kevin Su <[email protected]>

go/tasks/plugins/k8s/kfoperators/common/config.go

Signed-off-by: Kevin Su <[email protected]>

hamersaw · 2023-08-31T15:22:33Z

go/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go

@@ -231,6 +231,9 @@ func (pytorchOperatorResourceHandler) GetTaskPhase(_ context.Context, pluginCont
 		return pluginsCore.PhaseInfoUndefined, err
 	}

+	if app.Status.StartTime == nil && app.CreationTimestamp.Add(common.GetConfig().Timeout.Duration).Before(time.Now()) {
+		return pluginsCore.PhaseInfoUndefined, fmt.Errorf("kubeflow operator hasn't updated the pytorch coustum resource since creation time %v", app.CreationTimestamp)


"coustum" -> "custom"

same in tensorflow / mpi

go/tasks/plugins/k8s/kfoperators/common/config.go

Signed-off-by: Kevin Su <[email protected]>

pingsutw added 4 commits August 12, 2023 11:48

Transition to Queue if the pytorch JobCondition is empty

2506b96

Signed-off-by: Kevin Su <[email protected]>

nit

0d2aa97

Signed-off-by: Kevin Su <[email protected]>

nit

fff6632

Signed-off-by: Kevin Su <[email protected]>

fix tests

c7be349

Signed-off-by: Kevin Su <[email protected]>

lint

c97c418

Signed-off-by: Kevin Su <[email protected]>

add timeout

8180cda

Signed-off-by: Kevin Su <[email protected]>

yubofredwang reviewed Aug 22, 2023

View reviewed changes

go/tasks/plugins/k8s/kfoperators/pytorch/config.go Outdated Show resolved Hide resolved

pingsutw added 3 commits August 22, 2023 10:14

updated comment

3acb2b8

Signed-off-by: Kevin Su <[email protected]>

Merge branch 'master' of github.com:flyteorg/flyteplugins into fix-py…

f64f7e2

…torch-plugin

fixed tests

a53c761

Signed-off-by: Kevin Su <[email protected]>

hamersaw reviewed Aug 29, 2023

View reviewed changes

go/tasks/plugins/k8s/kfoperators/pytorch/config.go Outdated Show resolved Hide resolved

move config to common

eeb9f2b

Signed-off-by: Kevin Su <[email protected]>

hamersaw reviewed Aug 30, 2023

View reviewed changes

go/tasks/plugins/k8s/kfoperators/common/config.go Outdated Show resolved Hide resolved

update config name

0a0e00d

Signed-off-by: Kevin Su <[email protected]>

hamersaw reviewed Aug 31, 2023

View reviewed changes

update config

d5e4b59

Signed-off-by: Kevin Su <[email protected]>

hamersaw approved these changes Sep 1, 2023

View reviewed changes

pingsutw merged commit 552f145 into master Sep 1, 2023

eapolinario pushed a commit that referenced this pull request Sep 6, 2023

Transition to Queue if the JobCondition is empty (#387)

fdb4c55

Signed-off-by: Kevin Su <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transition to Queue if the JobCondition is empty #387

Transition to Queue if the JobCondition is empty #387

pingsutw commented Aug 12, 2023

codecov bot commented Aug 12, 2023 •

edited

Loading

hamersaw commented Aug 15, 2023

hamersaw commented Aug 15, 2023

fg91 commented Aug 16, 2023 •

edited

Loading

pingsutw commented Aug 16, 2023

yubofredwang commented Aug 17, 2023 •

edited

Loading

fg91 commented Aug 17, 2023

pingsutw commented Aug 21, 2023

yubofredwang commented Aug 21, 2023 •

edited

Loading

pingsutw commented Aug 21, 2023

pingsutw commented Aug 22, 2023

fg91 commented Aug 22, 2023

yubofredwang commented Aug 22, 2023

hamersaw Aug 31, 2023

Transition to Queue if the JobCondition is empty #387

Transition to Queue if the JobCondition is empty #387

Conversation

pingsutw commented Aug 12, 2023

TL;DR

Type

Are all requirements met?

Complete description

Tracking Issue

Follow-up issue

codecov bot commented Aug 12, 2023 • edited Loading

Codecov Report

hamersaw commented Aug 15, 2023

hamersaw commented Aug 15, 2023

fg91 commented Aug 16, 2023 • edited Loading

pingsutw commented Aug 16, 2023

yubofredwang commented Aug 17, 2023 • edited Loading

fg91 commented Aug 17, 2023

pingsutw commented Aug 21, 2023

yubofredwang commented Aug 21, 2023 • edited Loading

pingsutw commented Aug 21, 2023

pingsutw commented Aug 22, 2023

fg91 commented Aug 22, 2023

yubofredwang commented Aug 22, 2023

hamersaw Aug 31, 2023

Choose a reason for hiding this comment

codecov bot commented Aug 12, 2023 •

edited

Loading

fg91 commented Aug 16, 2023 •

edited

Loading

yubofredwang commented Aug 17, 2023 •

edited

Loading

yubofredwang commented Aug 21, 2023 •

edited

Loading