You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I just had a failed task (due to image pull backoff, since some pull creds were missing in that namespace), but the Pod was not removed.
Once I added the pull creds, we relaunched the task, but now there were two Pods running!
The "old" one which was marked as failed in Flyte and the new relaunched one.
Had to manually remove the Pod that initially failed, but was started in k8s again once the pull secret was there...
Unless the cluster admin has a look at this, normal flyte users have no way of noticing this or doing anything about it.
Snip of flytepropeller log:
{"json":{"exec_id":"af89l2cwzfjfnjpbxqfh","node":"trainaae","ns":"cadmatch-training-development","res_ver":"44393559","routine":"worker-0","wf":"cadmatch-training:development:.flytegen.train_aae"},"level":"warning","msg":"Failed to record taskEvent, error [EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = invalid phase change from FAILED to ABORTED for task execution {resource_type:TASK project:\"cadmatch-training\" domain:\"development\" name:\"train_aae\" version:\"2fbd809\" node_id:\"trainaae\" execution_id:\u003cproject:\"cadmatch-training\" domain:\"development\" name:\"af89l2cwzfjfnjpbxqfh\"\u003e 0 {} [] 0}]]. Trying to record state: ABORTED. Ignoring this error!","ts":"2023-01-17T14:26:57Z"}
{"json":{"exec_id":"af89l2cwzfjfnjpbxqfh","node":"trainaae","ns":"cadmatch-training-development","res_ver":"44393628","routine":"worker-1","wf":"cadmatch-training:development:.flytegen.train_aae"},"level":"warning","msg":"Failed to record taskEvent, error [EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = invalid phase change from FAILED to ABORTED for task execution {resource_type:TASK project:\"cadmatch-training\" domain:\"development\" name:\"train_aae\" version:\"2fbd809\" node_id:\"trainaae\" execution_id:\u003cproject:\"cadmatch-training\" domain:\"development\" name:\"af89l2cwzfjfnjpbxqfh\"\u003e 1 {} [] 0}]]. Trying to record state: ABORTED. Ignoring this error!","ts":"2023-01-17T14:27:02Z"}
{"json":{"exec_id":"af89l2cwzfjfnjpbxqfh","node":"trainaae","ns":"cadmatch-training-development","res_ver":"44393690","routine":"worker-0","wf":"cadmatch-training:development:.flytegen.train_aae"},"level":"warning","msg":"Failed to record taskEvent, error [EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = invalid phase change from FAILED to ABORTED for task execution {resource_type:TASK project:\"cadmatch-training\" domain:\"development\" name:\"train_aae\" version:\"2fbd809\" node_id:\"trainaae\" execution_id:\u003cproject:\"cadmatch-training\" domain:\"development\" name:\"af89l2cwzfjfnjpbxqfh\"\u003e 2 {} [] 0}]]. Trying to record state: ABORTED. Ignoring this error!","ts":"2023-01-17T14:27:08Z"}
{"json":{"exec_id":"af89l2cwzfjfnjpbxqfh","node":"trainaae","ns":"cadmatch-training-development","res_ver":"44393750","routine":"worker-1","wf":"cadmatch-training:development:.flytegen.train_aae"},"level":"warning","msg":"Trying to abort a node in state [Failed]","ts":"2023-01-17T14:27:14Z"}
{"json":{"exec_id":"af89l2cwzfjfnjpbxqfh","ns":"cadmatch-training-development","routine":"worker-0"},"level":"warning","msg":"Workflow namespace[cadmatch-training-development]/name[af89l2cwzfjfnjpbxqfh] has already been terminated.","ts":"2023-01-17T14:27:48Z"}
{"json":{"exec_id":"af89l2cwzfjfnjpbxqfh","ns":"cadmatch-training-development","routine":"worker-1"},"level":"warning","msg":"Workflow namespace[cadmatch-training-development]/name[af89l2cwzfjfnjpbxqfh] has already been terminated.","ts":"2023-01-17T14:28:02Z"}
Expected behavior
If a task failed, the Pod should be stopped or removed so it is not automatically started by k8s again in the background without anything noticing this in flyte.
Additional context to reproduce
Launch a task with an image for which pull creds are missing and hence results in a back-off pulling the image
Wait until workflow/task failed in flyte
Pod is still there, but should be removed.
Screenshots
No response
Are you sure this issue hasn't been raised already?
Yes
Have you read the Code of Conduct?
Yes
The text was updated successfully, but these errors were encountered:
flixr
changed the title
[BUG] Pod is not deleted if task failed due to "Back-off pulling image.."
[BUG] Pod is not deleted if task failed due to "Back-off pulling image.." and might auto-restart in the background
Jan 17, 2023
Had a very similar issue in a demo cluster today when the container_image was misspelled. The task was marked failed, but the pod was stuck retrying the pull.
Since the pod still consumed resources subsequent tasks were pending as "queued", without explaining that the resource request limit was met (was visible in the k8s event log however).
Describe the bug
I just had a failed task (due to image pull backoff, since some pull creds were missing in that namespace), but the Pod was not removed.
Once I added the pull creds, we relaunched the task, but now there were two Pods running!
The "old" one which was marked as failed in Flyte and the new relaunched one.
Had to manually remove the Pod that initially failed, but was started in k8s again once the pull secret was there...
Unless the cluster admin has a look at this, normal flyte users have no way of noticing this or doing anything about it.
Snip of flytepropeller log:
Expected behavior
If a task failed, the Pod should be stopped or removed so it is not automatically started by k8s again in the background without anything noticing this in flyte.
Additional context to reproduce
Screenshots
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: