Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Pod is not deleted if task failed due to "Back-off pulling image.." and might auto-restart in the background #3239

Closed
2 tasks done
flixr opened this issue Jan 17, 2023 · 1 comment
Assignees
Labels
bug Something isn't working
Milestone

Comments

@flixr
Copy link
Contributor

flixr commented Jan 17, 2023

Describe the bug

I just had a failed task (due to image pull backoff, since some pull creds were missing in that namespace), but the Pod was not removed.
Once I added the pull creds, we relaunched the task, but now there were two Pods running!
The "old" one which was marked as failed in Flyte and the new relaunched one.

Had to manually remove the Pod that initially failed, but was started in k8s again once the pull secret was there...
Unless the cluster admin has a look at this, normal flyte users have no way of noticing this or doing anything about it.

Snip of flytepropeller log:

{"json":{"exec_id":"af89l2cwzfjfnjpbxqfh","node":"trainaae","ns":"cadmatch-training-development","res_ver":"44393559","routine":"worker-0","wf":"cadmatch-training:development:.flytegen.train_aae"},"level":"warning","msg":"Failed to record taskEvent, error [EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = invalid phase change from FAILED to ABORTED for task execution {resource_type:TASK project:\"cadmatch-training\" domain:\"development\" name:\"train_aae\" version:\"2fbd809\"  node_id:\"trainaae\" execution_id:\u003cproject:\"cadmatch-training\" domain:\"development\" name:\"af89l2cwzfjfnjpbxqfh\" \u003e  0 {} [] 0}]]. Trying to record state: ABORTED. Ignoring this error!","ts":"2023-01-17T14:26:57Z"}
{"json":{"exec_id":"af89l2cwzfjfnjpbxqfh","node":"trainaae","ns":"cadmatch-training-development","res_ver":"44393628","routine":"worker-1","wf":"cadmatch-training:development:.flytegen.train_aae"},"level":"warning","msg":"Failed to record taskEvent, error [EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = invalid phase change from FAILED to ABORTED for task execution {resource_type:TASK project:\"cadmatch-training\" domain:\"development\" name:\"train_aae\" version:\"2fbd809\"  node_id:\"trainaae\" execution_id:\u003cproject:\"cadmatch-training\" domain:\"development\" name:\"af89l2cwzfjfnjpbxqfh\" \u003e  1 {} [] 0}]]. Trying to record state: ABORTED. Ignoring this error!","ts":"2023-01-17T14:27:02Z"}
{"json":{"exec_id":"af89l2cwzfjfnjpbxqfh","node":"trainaae","ns":"cadmatch-training-development","res_ver":"44393690","routine":"worker-0","wf":"cadmatch-training:development:.flytegen.train_aae"},"level":"warning","msg":"Failed to record taskEvent, error [EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = invalid phase change from FAILED to ABORTED for task execution {resource_type:TASK project:\"cadmatch-training\" domain:\"development\" name:\"train_aae\" version:\"2fbd809\"  node_id:\"trainaae\" execution_id:\u003cproject:\"cadmatch-training\" domain:\"development\" name:\"af89l2cwzfjfnjpbxqfh\" \u003e  2 {} [] 0}]]. Trying to record state: ABORTED. Ignoring this error!","ts":"2023-01-17T14:27:08Z"}
{"json":{"exec_id":"af89l2cwzfjfnjpbxqfh","node":"trainaae","ns":"cadmatch-training-development","res_ver":"44393750","routine":"worker-1","wf":"cadmatch-training:development:.flytegen.train_aae"},"level":"warning","msg":"Trying to abort a node in state [Failed]","ts":"2023-01-17T14:27:14Z"}
{"json":{"exec_id":"af89l2cwzfjfnjpbxqfh","ns":"cadmatch-training-development","routine":"worker-0"},"level":"warning","msg":"Workflow namespace[cadmatch-training-development]/name[af89l2cwzfjfnjpbxqfh] has already been terminated.","ts":"2023-01-17T14:27:48Z"}
{"json":{"exec_id":"af89l2cwzfjfnjpbxqfh","ns":"cadmatch-training-development","routine":"worker-1"},"level":"warning","msg":"Workflow namespace[cadmatch-training-development]/name[af89l2cwzfjfnjpbxqfh] has already been terminated.","ts":"2023-01-17T14:28:02Z"}

Expected behavior

If a task failed, the Pod should be stopped or removed so it is not automatically started by k8s again in the background without anything noticing this in flyte.

Additional context to reproduce

  1. Launch a task with an image for which pull creds are missing and hence results in a back-off pulling the image
  2. Wait until workflow/task failed in flyte
  3. Pod is still there, but should be removed.

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@flixr flixr added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Jan 17, 2023
@flixr flixr changed the title [BUG] Pod is not deleted if task failed due to "Back-off pulling image.." [BUG] Pod is not deleted if task failed due to "Back-off pulling image.." and might auto-restart in the background Jan 17, 2023
@ahlgol
Copy link

ahlgol commented Jan 17, 2023

Had a very similar issue in a demo cluster today when the container_image was misspelled. The task was marked failed, but the pod was stuck retrying the pull.

Since the pod still consumed resources subsequent tasks were pending as "queued", without explaining that the resource request limit was met (was visible in the k8s event log however).

@hamersaw hamersaw removed the untriaged This issues has not yet been looked at by the Maintainers label Jan 31, 2023
@hamersaw hamersaw self-assigned this Jan 31, 2023
@hamersaw hamersaw added this to the 1.4.0 milestone Jan 31, 2023
@cosmicBboy cosmicBboy modified the milestones: 1.4.0, 1.5.0 Mar 1, 2023
@cosmicBboy cosmicBboy modified the milestones: 1.5.0, 1.6.0 Apr 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants