Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] OOMKilled Pod is not identified as such #4704

Closed
2 tasks done
EngHabu opened this issue Jan 9, 2024 · 4 comments
Closed
2 tasks done

[BUG] OOMKilled Pod is not identified as such #4704

EngHabu opened this issue Jan 9, 2024 · 4 comments
Assignees
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working exo

Comments

@EngHabu
Copy link
Contributor

EngHabu commented Jan 9, 2024

Describe the bug

Flytepropeller is supposed to detect if a pod failed due to OOM error. However, this particular one didn't.

Pod Spec after failure:

Pod Spec after failure:
apiVersion: v1
kind: Pod
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
    primary_container_name: a85hcgdcb7vxhrllnls9-n1-0
  creationTimestamp: "2024-01-09T21:29:40Z"
  labels:
    domain: development
    execution-id: a85hcgdcb7vxhrllnls9
    interruptible: "false"
    node-id: n1
    project: autodoc
    shard-key: "11"
    task-name: <task_name>
    workflow-name: <wf_name>
  name: a85hcgdcb7vxhrllnls9-n1-0
  namespace: <project>-development
  ownerReferences:
  - apiVersion: flyte.lyft.com/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: flyteworkflow
    name: a85hcgdcb7vxhrllnls9
    uid: 222eb674-2b22-4f6e-823d-94f4de99a67b
  resourceVersion: "162737957"
  uid: 7de8a153-2352-4ace-8c3b-a69ee13d1233
spec:
  containers:
  - args:
    - pyflyte-fast-execute
    - --additional-distribution
    - s3://<task_path>.tar.gz
    - --dest-dir
    - /root
    - --
    - pyflyte-execute
    - --inputs
    - s3://<metadata bucket>/inputs.pb
    - --output-prefix
    - s3://<metadata bucket>/n1/data/0
    - --raw-output-data-prefix
    - s3://<metadata bucket>
    - --checkpoint-path
    - s3://<metadata bucket>/_flytecheckpoints
    - --prev-checkpoint
    - '""'
    - --resolver
    - flytekit.core.python_auto_container.default_task_resolver
    - --
    - task-module
    - <task_module>
    - task-name
    - <task_name>
    env:
     ...
    image: <image>
    imagePullPolicy: IfNotPresent
    name: a85hcgdcb7vxhrllnls9-n1-0
    resources:
      limits:
        cpu: "1"
        memory: 1000Mi
      requests:
        cpu: "1"
        memory: 1000Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-f2756
      readOnly: true
    - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
      name: aws-iam-token
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: <machine>.us-east-2.compute.internal
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  volumes:
  - name: aws-iam-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: sts.amazonaws.com
          expirationSeconds: 86400
          path: token
  - name: kube-api-access-f2756
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-01-09T21:29:40Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-01-09T21:30:29Z"
    reason: PodFailed
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-01-09T21:30:29Z"
    reason: PodFailed
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-01-09T21:29:40Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://32c7f47d5efe144ff5bee42069b05b5ada4d87cd944f0737654864caebaad388
    image: <container image>
    imageID: <container_image>
    lastState: {}
    name: a85hcgdcb7vxhrllnls9-n1-0
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://32c7f47d5efe144ff5bee42069b05b5ada4d87cd944f0737654864caebaad388
        exitCode: 247
        finishedAt: "2024-01-09T21:30:14Z"
        message: |
          tar: Removing leading `/' from member names

          /usr/local/lib/python3.11/site-packages/langchain/__init__.py:34: UserWarning: Importing BasePromptTemplate from langchain root module is no longer supported. Please use langchain.schema.prompt_template.BasePromptTemplate instead.
            warnings.warn(
          /usr/local/lib/python3.11/site-packages/langchain/__init__.py:34: UserWarning: Importing LLMChain from langchain root module is no longer supported. Please use langchain.chains.LLMChain instead.
            warnings.warn(
          /usr/local/lib/python3.11/site-packages/langchain/__init__.py:34: UserWarning: Importing OpenAI from langchain root module is no longer supported. Please use langchain.llms.OpenAI instead.
            warnings.warn(
          Getting s3://<metadata_bucket>.tar.gz to /root/
        reason: OOMKilled
        startedAt: "2024-01-09T21:30:01Z"
  hostIP: 10.0.133.40
  phase: Failed
  podIP: 10.0.128.230
  podIPs:
  - ip: 10.0.128.230
  qosClass: Guaranteed
  startTime: "2024-01-09T21:29:40Z"

A few notes:

  1. The pod ran for merely 13 seconds before it OOMKilled
  2. The code that is supposed to catch that

Expected behavior

Correctly mark the task as OOMKilled

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@EngHabu EngHabu added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Jan 9, 2024
Copy link

dosubot bot commented Jan 9, 2024

🤖

Hey @EngHabu! I'm Dosu, a friendly bot here to help you out while we wait for a human maintainer to join us. Just so you know, I'm here to assist with bug troubleshooting, answering questions, and helping you contribute to the project. I'm diving into your issue now and will have a more detailed response for you soon! Hang tight.

@EngHabu
Copy link
Contributor Author

EngHabu commented Jan 9, 2024

cc @hamersaw @pmahindrakar-oss

@hamersaw hamersaw added exo backlogged For internal use. Reserved for contributor team workflow. and removed untriaged This issues has not yet been looked at by the Maintainers labels Jan 10, 2024
@andrewwdye
Copy link
Contributor

Simple repro for this

import subprocess
import os

from flytekit import Resources, task, workflow


@task(requests=Resources(mem="500Mi"))
def subproc_oom():
    try:
        subprocess.run(["tail", "/dev/zero"], check=True)
    except subprocess.CalledProcessError as e:
        print(f"Return code {e.returncode}")
        raise e


@workflow
def workflow():
    subproc_oom()

Task closure

      reasons:
      - message: task submitted to K8s
        occurredAt: "2024-01-19T23:08:52.227398005Z"
      - message: Scheduling
        occurredAt: "2024-01-19T23:08:52.319220617Z"
      - message: Successfully assigned andrew-development/fb914714b21564db2842-n0-0
          to ip-10-105-5-28.us-east-2.compute.internal
        occurredAt: "2024-01-19T23:08:52Z"
      - message: '[ContainersNotReady|ContainerCreating]: containers with unready
          status: [fb914714b21564db2842-n0-0]|'
        occurredAt: "2024-01-19T23:08:52Z"

Pod info

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-01-20T00:42:08Z"
    reason: PodCompleted
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-01-20T00:42:16Z"
    reason: PodCompleted
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-01-20T00:42:16Z"
    reason: PodCompleted
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-01-20T00:42:08Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://1387a658b7918d37d8494b7c2114dc20625c60a47e7ba5179810f9213cd021ca
    image: cr.flyte.org/flyteorg/flytekit:py3.10-1.10.0
    imageID: cr.flyte.org/flyteorg/flytekit@sha256:373e685133affa49aae76e2067c650eea94cad7d674213177b5d2cc57e7517b5
    lastState: {}
    name: a468cm9zzsww88vtss89-n0-0
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://1387a658b7918d37d8494b7c2114dc20625c60a47e7ba5179810f9213cd021ca
        exitCode: 0
        finishedAt: "2024-01-20T00:42:16Z"
        reason: OOMKilled
        startedAt: "2024-01-20T00:42:09Z"
  hostIP: 10.105.5.28
  phase: Succeeded

Copy link

dosubot bot commented Jan 20, 2024

🤖

Thank you for the detailed bug report. The issue seems to be related to how Flytepropeller is handling the termination state of the containers in the pod.

Flytepropeller checks for OOM errors only if a container's state is terminated. If the container's state is not terminated, the OOM error check will not be executed. Here is the relevant code snippet from pod_helper.go:

var containerState v1.ContainerState
if c.LastTerminationState.Terminated != nil {
    containerState = c.LastTerminationState
} else if c.State.Terminated != nil {
    containerState = c.State
}
if containerState.Terminated != nil {
    if strings.Contains(containerState.Terminated.Reason, OOMKilled) {
        code = OOMKilled
    } else if containerState.Terminated.ExitCode == SIGKILL {
        // in some setups, node termination sends SIGKILL to all the containers running on that node. Capturing and
        // tagging that correctly.
        code = Interrupted
    }

    if containerState.Terminated.ExitCode == 0 {
        message += fmt.Sprintf("\r\n[%v] terminated with ExitCode 0.", c.Name)
    } else {
        message += fmt.Sprintf("\r\n[%v] terminated with exit code (%v). Reason [%v]. Message: \n%v.",
            c.Name,
            containerState.Terminated.ExitCode,
            containerState.Terminated.Reason,
            containerState.Terminated.Message)
    }
}

In this code, containerState.Terminated is checked for nil and only if it's not nil, the code checks if the termination reason contains OOMKilled. If the container's state is not terminated, this check will not be executed and the OOM error will not be detected.

If the OOM error is not causing the container to terminate but is still causing issues, you may need to modify the code to check for OOM errors in other container states as well.

I hope this helps clarify the issue. If you have any further questions, please let me know.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working exo
Projects
None yet
Development

No branches or pull requests

4 participants