-
Notifications
You must be signed in to change notification settings - Fork 674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] OOMKilled Pod is not identified as such #4704
Comments
🤖 Hey @EngHabu! I'm Dosu, a friendly bot here to help you out while we wait for a human maintainer to join us. Just so you know, I'm here to assist with bug troubleshooting, answering questions, and helping you contribute to the project. I'm diving into your issue now and will have a more detailed response for you soon! Hang tight. |
Simple repro for this
Task closure
Pod info
|
🤖 Thank you for the detailed bug report. The issue seems to be related to how Flytepropeller is handling the termination state of the containers in the pod. Flytepropeller checks for OOM errors only if a container's state is terminated. If the container's state is not terminated, the OOM error check will not be executed. Here is the relevant code snippet from pod_helper.go: var containerState v1.ContainerState
if c.LastTerminationState.Terminated != nil {
containerState = c.LastTerminationState
} else if c.State.Terminated != nil {
containerState = c.State
}
if containerState.Terminated != nil {
if strings.Contains(containerState.Terminated.Reason, OOMKilled) {
code = OOMKilled
} else if containerState.Terminated.ExitCode == SIGKILL {
// in some setups, node termination sends SIGKILL to all the containers running on that node. Capturing and
// tagging that correctly.
code = Interrupted
}
if containerState.Terminated.ExitCode == 0 {
message += fmt.Sprintf("\r\n[%v] terminated with ExitCode 0.", c.Name)
} else {
message += fmt.Sprintf("\r\n[%v] terminated with exit code (%v). Reason [%v]. Message: \n%v.",
c.Name,
containerState.Terminated.ExitCode,
containerState.Terminated.Reason,
containerState.Terminated.Message)
}
} In this code, If the OOM error is not causing the container to terminate but is still causing issues, you may need to modify the code to check for OOM errors in other container states as well. I hope this helps clarify the issue. If you have any further questions, please let me know. SourcesThis response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Describe the bug
Flytepropeller is supposed to detect if a pod failed due to OOM error. However, this particular one didn't.
Pod Spec after failure:
Pod Spec after failure:
A few notes:
Expected behavior
Correctly mark the task as OOMKilled
Additional context to reproduce
No response
Screenshots
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: