[BUG] Pod is not deleted if task failed due to "Back-off pulling image.." and might auto-restart in the background #3239

flixr · 2023-01-17T18:28:39Z

Describe the bug

I just had a failed task (due to image pull backoff, since some pull creds were missing in that namespace), but the Pod was not removed.
Once I added the pull creds, we relaunched the task, but now there were two Pods running!
The "old" one which was marked as failed in Flyte and the new relaunched one.

Had to manually remove the Pod that initially failed, but was started in k8s again once the pull secret was there...
Unless the cluster admin has a look at this, normal flyte users have no way of noticing this or doing anything about it.

Snip of flytepropeller log:

{"json":{"exec_id":"af89l2cwzfjfnjpbxqfh","node":"trainaae","ns":"cadmatch-training-development","res_ver":"44393559","routine":"worker-0","wf":"cadmatch-training:development:.flytegen.train_aae"},"level":"warning","msg":"Failed to record taskEvent, error [EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = invalid phase change from FAILED to ABORTED for task execution {resource_type:TASK project:\"cadmatch-training\" domain:\"development\" name:\"train_aae\" version:\"2fbd809\"  node_id:\"trainaae\" execution_id:\u003cproject:\"cadmatch-training\" domain:\"development\" name:\"af89l2cwzfjfnjpbxqfh\" \u003e  0 {} [] 0}]]. Trying to record state: ABORTED. Ignoring this error!","ts":"2023-01-17T14:26:57Z"}
{"json":{"exec_id":"af89l2cwzfjfnjpbxqfh","node":"trainaae","ns":"cadmatch-training-development","res_ver":"44393628","routine":"worker-1","wf":"cadmatch-training:development:.flytegen.train_aae"},"level":"warning","msg":"Failed to record taskEvent, error [EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = invalid phase change from FAILED to ABORTED for task execution {resource_type:TASK project:\"cadmatch-training\" domain:\"development\" name:\"train_aae\" version:\"2fbd809\"  node_id:\"trainaae\" execution_id:\u003cproject:\"cadmatch-training\" domain:\"development\" name:\"af89l2cwzfjfnjpbxqfh\" \u003e  1 {} [] 0}]]. Trying to record state: ABORTED. Ignoring this error!","ts":"2023-01-17T14:27:02Z"}
{"json":{"exec_id":"af89l2cwzfjfnjpbxqfh","node":"trainaae","ns":"cadmatch-training-development","res_ver":"44393690","routine":"worker-0","wf":"cadmatch-training:development:.flytegen.train_aae"},"level":"warning","msg":"Failed to record taskEvent, error [EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = invalid phase change from FAILED to ABORTED for task execution {resource_type:TASK project:\"cadmatch-training\" domain:\"development\" name:\"train_aae\" version:\"2fbd809\"  node_id:\"trainaae\" execution_id:\u003cproject:\"cadmatch-training\" domain:\"development\" name:\"af89l2cwzfjfnjpbxqfh\" \u003e  2 {} [] 0}]]. Trying to record state: ABORTED. Ignoring this error!","ts":"2023-01-17T14:27:08Z"}
{"json":{"exec_id":"af89l2cwzfjfnjpbxqfh","node":"trainaae","ns":"cadmatch-training-development","res_ver":"44393750","routine":"worker-1","wf":"cadmatch-training:development:.flytegen.train_aae"},"level":"warning","msg":"Trying to abort a node in state [Failed]","ts":"2023-01-17T14:27:14Z"}
{"json":{"exec_id":"af89l2cwzfjfnjpbxqfh","ns":"cadmatch-training-development","routine":"worker-0"},"level":"warning","msg":"Workflow namespace[cadmatch-training-development]/name[af89l2cwzfjfnjpbxqfh] has already been terminated.","ts":"2023-01-17T14:27:48Z"}
{"json":{"exec_id":"af89l2cwzfjfnjpbxqfh","ns":"cadmatch-training-development","routine":"worker-1"},"level":"warning","msg":"Workflow namespace[cadmatch-training-development]/name[af89l2cwzfjfnjpbxqfh] has already been terminated.","ts":"2023-01-17T14:28:02Z"}

Expected behavior

If a task failed, the Pod should be stopped or removed so it is not automatically started by k8s again in the background without anything noticing this in flyte.

Additional context to reproduce

Launch a task with an image for which pull creds are missing and hence results in a back-off pulling the image
Wait until workflow/task failed in flyte
Pod is still there, but should be removed.

Screenshots

No response

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

ahlgol · 2023-01-17T19:39:45Z

Had a very similar issue in a demo cluster today when the container_image was misspelled. The task was marked failed, but the pod was stuck retrying the pull.

Since the pod still consumed resources subsequent tasks were pending as "queued", without explaining that the resource request limit was met (was visible in the k8s event log however).

flixr added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Jan 17, 2023

flixr changed the title ~~[BUG] Pod is not deleted if task failed due to "Back-off pulling image.."~~ [BUG] Pod is not deleted if task failed due to "Back-off pulling image.." and might auto-restart in the background Jan 17, 2023

hamersaw mentioned this issue Jan 25, 2023

Abort tasks on non-success phase flyteorg/flytepropeller#520

Closed

8 tasks

hamersaw removed the untriaged This issues has not yet been looked at by the Maintainers label Jan 31, 2023

hamersaw self-assigned this Jan 31, 2023

hamersaw added this to the 1.4.0 milestone Jan 31, 2023

cosmicBboy modified the milestones: 1.4.0, 1.5.0 Mar 1, 2023

This was referenced Mar 21, 2023

Adding cleanupOnFailure to PhaseInfo flyteorg/flyteplugins#333

Merged

Added support for aborting task nodes reported as failures flyteorg/flytepropeller#541

Merged

cosmicBboy modified the milestones: 1.5.0, 1.6.0 Apr 20, 2023

hamersaw closed this as completed Apr 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Pod is not deleted if task failed due to "Back-off pulling image.." and might auto-restart in the background #3239

[BUG] Pod is not deleted if task failed due to "Back-off pulling image.." and might auto-restart in the background #3239

flixr commented Jan 17, 2023

ahlgol commented Jan 17, 2023

[BUG] Pod is not deleted if task failed due to "Back-off pulling image.." and might auto-restart in the background #3239

[BUG] Pod is not deleted if task failed due to "Back-off pulling image.." and might auto-restart in the background #3239

Comments

flixr commented Jan 17, 2023

Describe the bug

Expected behavior

Additional context to reproduce

Screenshots

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

ahlgol commented Jan 17, 2023