[BUG] OOMKilled Pod is not identified as such #4704

EngHabu · 2024-01-09T21:48:24Z

Describe the bug

Flytepropeller is supposed to detect if a pod failed due to OOM error. However, this particular one didn't.

Pod Spec after failure:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
    primary_container_name: a85hcgdcb7vxhrllnls9-n1-0
  creationTimestamp: "2024-01-09T21:29:40Z"
  labels:
    domain: development
    execution-id: a85hcgdcb7vxhrllnls9
    interruptible: "false"
    node-id: n1
    project: autodoc
    shard-key: "11"
    task-name: <task_name>
    workflow-name: <wf_name>
  name: a85hcgdcb7vxhrllnls9-n1-0
  namespace: <project>-development
  ownerReferences:
  - apiVersion: flyte.lyft.com/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: flyteworkflow
    name: a85hcgdcb7vxhrllnls9
    uid: 222eb674-2b22-4f6e-823d-94f4de99a67b
  resourceVersion: "162737957"
  uid: 7de8a153-2352-4ace-8c3b-a69ee13d1233
spec:
  containers:
  - args:
    - pyflyte-fast-execute
    - --additional-distribution
    - s3://<task_path>.tar.gz
    - --dest-dir
    - /root
    - --
    - pyflyte-execute
    - --inputs
    - s3://<metadata bucket>/inputs.pb
    - --output-prefix
    - s3://<metadata bucket>/n1/data/0
    - --raw-output-data-prefix
    - s3://<metadata bucket>
    - --checkpoint-path
    - s3://<metadata bucket>/_flytecheckpoints
    - --prev-checkpoint
    - '""'
    - --resolver
    - flytekit.core.python_auto_container.default_task_resolver
    - --
    - task-module
    - <task_module>
    - task-name
    - <task_name>
    env:
     ...
    image: <image>
    imagePullPolicy: IfNotPresent
    name: a85hcgdcb7vxhrllnls9-n1-0
    resources:
      limits:
        cpu: "1"
        memory: 1000Mi
      requests:
        cpu: "1"
        memory: 1000Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-f2756
      readOnly: true
    - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
      name: aws-iam-token
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: <machine>.us-east-2.compute.internal
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  volumes:
  - name: aws-iam-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: sts.amazonaws.com
          expirationSeconds: 86400
          path: token
  - name: kube-api-access-f2756
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-01-09T21:29:40Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-01-09T21:30:29Z"
    reason: PodFailed
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-01-09T21:30:29Z"
    reason: PodFailed
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-01-09T21:29:40Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://32c7f47d5efe144ff5bee42069b05b5ada4d87cd944f0737654864caebaad388
    image: <container image>
    imageID: <container_image>
    lastState: {}
    name: a85hcgdcb7vxhrllnls9-n1-0
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://32c7f47d5efe144ff5bee42069b05b5ada4d87cd944f0737654864caebaad388
        exitCode: 247
        finishedAt: "2024-01-09T21:30:14Z"
        message: |
          tar: Removing leading `/' from member names

          /usr/local/lib/python3.11/site-packages/langchain/__init__.py:34: UserWarning: Importing BasePromptTemplate from langchain root module is no longer supported. Please use langchain.schema.prompt_template.BasePromptTemplate instead.
            warnings.warn(
          /usr/local/lib/python3.11/site-packages/langchain/__init__.py:34: UserWarning: Importing LLMChain from langchain root module is no longer supported. Please use langchain.chains.LLMChain instead.
            warnings.warn(
          /usr/local/lib/python3.11/site-packages/langchain/__init__.py:34: UserWarning: Importing OpenAI from langchain root module is no longer supported. Please use langchain.llms.OpenAI instead.
            warnings.warn(
          Getting s3://<metadata_bucket>.tar.gz to /root/
        reason: OOMKilled
        startedAt: "2024-01-09T21:30:01Z"
  hostIP: 10.0.133.40
  phase: Failed
  podIP: 10.0.128.230
  podIPs:
  - ip: 10.0.128.230
  qosClass: Guaranteed
  startTime: "2024-01-09T21:29:40Z"

A few notes:

The pod ran for merely 13 seconds before it OOMKilled
The code that is supposed to catch that

Expected behavior

Correctly mark the task as OOMKilled

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

The text was updated successfully, but these errors were encountered:

dosubot · 2024-01-09T21:48:40Z

🤖

Hey @EngHabu! I'm Dosu, a friendly bot here to help you out while we wait for a human maintainer to join us. Just so you know, I'm here to assist with bug troubleshooting, answering questions, and helping you contribute to the project. I'm diving into your issue now and will have a more detailed response for you soon! Hang tight.

EngHabu · 2024-01-09T21:49:19Z

cc @hamersaw @pmahindrakar-oss

andrewwdye · 2024-01-20T01:28:21Z

Simple repro for this

import subprocess
import os

from flytekit import Resources, task, workflow


@task(requests=Resources(mem="500Mi"))
def subproc_oom():
    try:
        subprocess.run(["tail", "/dev/zero"], check=True)
    except subprocess.CalledProcessError as e:
        print(f"Return code {e.returncode}")
        raise e


@workflow
def workflow():
    subproc_oom()

Task closure

      reasons:
      - message: task submitted to K8s
        occurredAt: "2024-01-19T23:08:52.227398005Z"
      - message: Scheduling
        occurredAt: "2024-01-19T23:08:52.319220617Z"
      - message: Successfully assigned andrew-development/fb914714b21564db2842-n0-0
          to ip-10-105-5-28.us-east-2.compute.internal
        occurredAt: "2024-01-19T23:08:52Z"
      - message: '[ContainersNotReady|ContainerCreating]: containers with unready
          status: [fb914714b21564db2842-n0-0]|'
        occurredAt: "2024-01-19T23:08:52Z"

Pod info

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-01-20T00:42:08Z"
    reason: PodCompleted
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-01-20T00:42:16Z"
    reason: PodCompleted
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-01-20T00:42:16Z"
    reason: PodCompleted
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-01-20T00:42:08Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://1387a658b7918d37d8494b7c2114dc20625c60a47e7ba5179810f9213cd021ca
    image: cr.flyte.org/flyteorg/flytekit:py3.10-1.10.0
    imageID: cr.flyte.org/flyteorg/flytekit@sha256:373e685133affa49aae76e2067c650eea94cad7d674213177b5d2cc57e7517b5
    lastState: {}
    name: a468cm9zzsww88vtss89-n0-0
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://1387a658b7918d37d8494b7c2114dc20625c60a47e7ba5179810f9213cd021ca
        exitCode: 0
        finishedAt: "2024-01-20T00:42:16Z"
        reason: OOMKilled
        startedAt: "2024-01-20T00:42:09Z"
  hostIP: 10.105.5.28
  phase: Succeeded

dosubot · 2024-01-20T01:33:26Z

🤖

Thank you for the detailed bug report. The issue seems to be related to how Flytepropeller is handling the termination state of the containers in the pod.

Flytepropeller checks for OOM errors only if a container's state is terminated. If the container's state is not terminated, the OOM error check will not be executed. Here is the relevant code snippet from pod_helper.go:

var containerState v1.ContainerState
if c.LastTerminationState.Terminated != nil {
    containerState = c.LastTerminationState
} else if c.State.Terminated != nil {
    containerState = c.State
}
if containerState.Terminated != nil {
    if strings.Contains(containerState.Terminated.Reason, OOMKilled) {
        code = OOMKilled
    } else if containerState.Terminated.ExitCode == SIGKILL {
        // in some setups, node termination sends SIGKILL to all the containers running on that node. Capturing and
        // tagging that correctly.
        code = Interrupted
    }

    if containerState.Terminated.ExitCode == 0 {
        message += fmt.Sprintf("\r\n[%v] terminated with ExitCode 0.", c.Name)
    } else {
        message += fmt.Sprintf("\r\n[%v] terminated with exit code (%v). Reason [%v]. Message: \n%v.",
            c.Name,
            containerState.Terminated.ExitCode,
            containerState.Terminated.Reason,
            containerState.Terminated.Message)
    }
}

In this code, containerState.Terminated is checked for nil and only if it's not nil, the code checks if the termination reason contains OOMKilled. If the container's state is not terminated, this check will not be executed and the OOM error will not be detected.

If the OOM error is not causing the container to terminate but is still causing issues, you may need to modify the code to check for OOM errors in other container states as well.

I hope this helps clarify the issue. If you have any further questions, please let me know.

Sources

flyteplugins/go/tasks/pluginmachinery/flytek8s/pod_helper.go

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

EngHabu added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Jan 9, 2024

hamersaw added exo backlogged For internal use. Reserved for contributor team workflow. and removed untriaged This issues has not yet been looked at by the Maintainers labels Jan 10, 2024

pvditt self-assigned this Jan 26, 2024

pvditt mentioned this issue Jan 30, 2024

[BUG] handle potential uncaught OOMKilled terminations #4793

Merged

3 tasks

pvditt closed this as completed Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] OOMKilled Pod is not identified as such #4704

[BUG] OOMKilled Pod is not identified as such #4704

EngHabu commented Jan 9, 2024 •

edited

Loading

dosubot bot commented Jan 9, 2024

EngHabu commented Jan 9, 2024

andrewwdye commented Jan 20, 2024

dosubot bot commented Jan 20, 2024

[BUG] OOMKilled Pod is not identified as such #4704

[BUG] OOMKilled Pod is not identified as such #4704

Comments

EngHabu commented Jan 9, 2024 • edited Loading

Describe the bug

Expected behavior

Additional context to reproduce

Screenshots

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

dosubot bot commented Jan 9, 2024

EngHabu commented Jan 9, 2024

andrewwdye commented Jan 20, 2024

dosubot bot commented Jan 20, 2024

Sources

EngHabu commented Jan 9, 2024 •

edited

Loading