GPU Workloads failing to run on the BottleRocket Nodes #4309

uni-raghavendra · 2024-11-21T14:24:30Z

Image I'm using:

OS Image
Bottlerocket OS 1.27.1 (aws-k8s-1.30-nvidia)
Kernel version
6.1.115
Container runtime
containerd://1.7.22+bottlerocket
Kubelet version
v1.30.4-eks-16b398d

What I expected to happen:

We are running GPU workloads on k8 AWS EKS cluster, The pods are running fine for the first time when it is scheduled on BottleRocket OS Node, when we restart the pod it sits on the same node eventually it should comeup and work fine.

What actually happened:

But the workload is failing with below error

W1120 16:31:10.566164 1 pinned_memory_manager.cc:273] "Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version"
I1120 16:31:10.566217 1 cuda_memory_manager.cc:117] "CUDA memory pool disabled"
E1120 16:31:10.566309 1 server.cc:241] "CudaDriverHelper has not been initialized."
I1120 16:31:10.766788 1 model_lifecycle.cc:472] "loading: summarization:400"

How to reproduce the problem:

You can try to pull any of the llama related workloads/ sherpa_onnx workloads

uni-raghavendra · 2024-11-25T08:15:02Z

Can i get some update on the CUDA memory issue?

yeazelm · 2024-11-25T16:23:50Z

Hello @uni-raghavendra, thanks for cutting this issue. Can you confirm what version of CUDA you are using and what features are attempting to be used? Bottlerocket includes the R535 branch of Tesla drivers and the error "CUDA driver version is insufficient for CUDA runtime version" normally indicates that you are using a version of CUDA that needs a different driver but if it is working once, that might be something else. Can you confirm if this was working in previous versions of Bottlerocket or is this a new workload?

uni-raghavendra · 2024-11-26T15:04:59Z

Basically we are trying to switch our workloads from Amazon Linux OS to BottleRocket OS for the first time.

We are using CUDA 12.2 and current BottleRocket Supports it as well, As I see for the first time it is able to come up and run successfully without any issues, issue comes when we restart the pod. The node is Unable to allocate CUDA memory for the restarted pod. And we only have 1 workload on that node

ytsssun · 2024-11-28T00:11:39Z

Hi @uni-raghavendra do you mind providing a bit more information about your setup for a minimal reproducible one? Can you provide the instance type, and the pod spec? I can help with further troubleshooting.

uni-raghavendra · 2024-11-28T04:51:54Z

Hey @ytsssun ,

here is the spec, but we need to mount some models, try some dummy model. Currently we fetch it from our own s3 bucket, try placing some files into /models and try to bring it up for the frrst time it will work fine and from second time onwards it will go into error state with CUDA error. We need g5 instance type for this. Let me know if you need anything more we can sync

gpu.txt

uni-raghavendra · 2024-12-02T04:39:09Z

@ytsssun , Did you find anything on the CUDA memory allocation?

ytsssun · 2024-12-04T04:47:16Z

Hi @uni-raghavendra, I did give this a spin. However, I was not able to reproduce the issue. Here is my setup:

I deployed the node group by running eksctl create nodegroup -f cluster.yaml with below spec:

# cluster.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: my-cluster
  region: us-west-2
  version: '1.30'

iam:
  withOIDC: true

nodeGroups:
  - name: my-cluster-ng-g5-bottlerocket
    instanceType: g5.2xlarge
    minSize: 0
    desiredCapacity: 1
    maxSize: 3
    availabilityZones: ["us-west-2a"]
    amiFamily: Bottlerocket
    volumeSize: 400
    privateNetworking: true

Deployed below pod with spec:
command: kubectl apply -f k8s/triton-pod.yaml

# Directory structure:
# .
# ├── k8s
# │   └── triton-pod.yaml
# └── model_repository
#     └── dummy
#         ├── 1
#         │   └── model.py
#         └── config.pbtxt

---
# k8s/triton-pod.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: dummy-model-files
data:
  "model.py": |
    import triton_python_backend_utils as pb_utils
    import numpy as np
    
    class TritonPythonModel:
        def execute(self, requests):
            responses = []
            for request in requests:
                # Always return "hello"
                output = np.array(["hello"], dtype=np.object_)
                responses.append(pb_utils.InferenceResponse([
                    pb_utils.Tensor("output", output)
                ]))
            return responses
            
  "config.pbtxt": |
    name: "dummy"
    backend: "python"
    max_batch_size: 0
    
    input [
      {
        name: "input"
        data_type: TYPE_STRING
        dims: [ 1 ]
      }
    ]
    
    output [
      {
        name: "output"
        data_type: TYPE_STRING
        dims: [ 1 ]
      }
    ]
    
    instance_group [
      {
        count: 1
        kind: KIND_GPU
      }
    ]

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: triton-server
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 100%
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: triton-server
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.10-vllm-python-py3
        command:
          - /bin/sh
          - -c
        args:
          - /opt/tritonserver/bin/tritonserver --model-store=/models --model-control-mode=EXPLICIT --load-model=dummy --allow-gpu-metrics=true --allow-metrics=true
        resources:
          limits:
            nvidia.com/gpu: "1"
          requests:
            cpu: "1"
            memory: 8Gi
        ports:
          - containerPort: 8000
            name: http
        volumeMounts:
          - mountPath: /models/dummy/1/model.py
            name: model-files
            subPath: model.py
          - mountPath: /models/dummy/config.pbtxt
            name: model-files
            subPath: config.pbtxt
      nodeSelector:
        node.kubernetes.io/instance-type: g5.2xlarge
      volumes:
        - name: model-files
          configMap:
            name: dummy-model-files

The pod is running correctly, with CUDA memory allocated.

kubectl logs triton-server 
I1204 04:40:39.762512 1 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x7fb4ca000000' with size 268435456"
I1204 04:40:39.766218 1 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I1204 04:40:39.772456 1 model_lifecycle.cc:472] "loading: dummy:1"
I1204 04:40:41.396654 1 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: dummy_0_0 (GPU device 0)"
I1204 04:40:41.628058 1 model_lifecycle.cc:839] "successfully loaded 'dummy'"
I1204 04:40:41.628166 1 server.cc:604] 
+------------------+------+

Trigger redeployment

kubectl rollout restart deployment triton-server

The pod is running fine with no issue.

kubectl logs triton-server
I1204 04:45:30.186895 1 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x7fa338000000' with size 268435456"
I1204 04:45:30.189883 1 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I1204 04:45:30.195667 1 model_lifecycle.cc:472] "loading: dummy:1"
I1204 04:45:31.797296 1 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: dummy_0_0 (GPU device 0)"
I1204 04:45:32.029037 1 model_lifecycle.cc:839] "successfully loaded 'dummy'"
...

I1204 04:45:32.096156 1 grpc_server.cc:2558] "Started GRPCInferenceService at 0.0.0.0:8001"
I1204 04:45:32.096411 1 http_server.cc:4713] "Started HTTPService at 0.0.0.0:8000"
I1204 04:45:32.258006 1 http_server.cc:362] "Started Metrics Service at 0.0.0.0:8002"

I tried this on g5.2xlarge and g5.12xlarge (2 GPUs). Both worked for me.

uni-raghavendra · 2024-12-05T06:19:14Z

@ytsssun , I have the setup and it doesn't work, May be if you are available for a call I would be able to walk you through it.

Please let me know what is the good time to connect? I can share a google invite

yeazelm · 2024-12-05T15:21:02Z

Thanks for the offer for a call @uni-raghavendra. If you have access to AWS support you can reach out through them to get a call scheduled or if you are on Kubernetes or CNCF slack, you can find me there and we can find a slot that works. It would be easier to sort out times in a slack dm.

uni-raghavendra added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Workloads failing to run on the BottleRocket Nodes #4309

GPU Workloads failing to run on the BottleRocket Nodes #4309

uni-raghavendra commented Nov 21, 2024

uni-raghavendra commented Nov 25, 2024

yeazelm commented Nov 25, 2024

uni-raghavendra commented Nov 26, 2024

ytsssun commented Nov 28, 2024

uni-raghavendra commented Nov 28, 2024

uni-raghavendra commented Dec 2, 2024

ytsssun commented Dec 4, 2024 •

edited

Loading

uni-raghavendra commented Dec 5, 2024

yeazelm commented Dec 5, 2024

GPU Workloads failing to run on the BottleRocket Nodes #4309

GPU Workloads failing to run on the BottleRocket Nodes #4309

Comments

uni-raghavendra commented Nov 21, 2024

uni-raghavendra commented Nov 25, 2024

yeazelm commented Nov 25, 2024

uni-raghavendra commented Nov 26, 2024

ytsssun commented Nov 28, 2024

uni-raghavendra commented Nov 28, 2024

uni-raghavendra commented Dec 2, 2024

ytsssun commented Dec 4, 2024 • edited Loading

uni-raghavendra commented Dec 5, 2024

yeazelm commented Dec 5, 2024

ytsssun commented Dec 4, 2024 •

edited

Loading