Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Workloads failing to run on the BottleRocket Nodes #4309

Open
uni-raghavendra opened this issue Nov 21, 2024 · 9 comments
Open

GPU Workloads failing to run on the BottleRocket Nodes #4309

uni-raghavendra opened this issue Nov 21, 2024 · 9 comments
Labels
status/needs-triage Pending triage or re-evaluation type/bug Something isn't working

Comments

@uni-raghavendra
Copy link

Image I'm using:

OS Image
Bottlerocket OS 1.27.1 (aws-k8s-1.30-nvidia)
Kernel version
6.1.115
Container runtime
containerd://1.7.22+bottlerocket
Kubelet version
v1.30.4-eks-16b398d

What I expected to happen:

We are running GPU workloads on k8 AWS EKS cluster, The pods are running fine for the first time when it is scheduled on BottleRocket OS Node, when we restart the pod it sits on the same node eventually it should comeup and work fine.

What actually happened:

But the workload is failing with below error

W1120 16:31:10.566164 1 pinned_memory_manager.cc:273] "Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version"
I1120 16:31:10.566217 1 cuda_memory_manager.cc:117] "CUDA memory pool disabled"
E1120 16:31:10.566309 1 server.cc:241] "CudaDriverHelper has not been initialized."
I1120 16:31:10.766788 1 model_lifecycle.cc:472] "loading: summarization:400"

How to reproduce the problem:

You can try to pull any of the llama related workloads/ sherpa_onnx workloads

@uni-raghavendra uni-raghavendra added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Nov 21, 2024
@uni-raghavendra
Copy link
Author

Can i get some update on the CUDA memory issue?

@yeazelm
Copy link
Contributor

yeazelm commented Nov 25, 2024

Hello @uni-raghavendra, thanks for cutting this issue. Can you confirm what version of CUDA you are using and what features are attempting to be used? Bottlerocket includes the R535 branch of Tesla drivers and the error "CUDA driver version is insufficient for CUDA runtime version" normally indicates that you are using a version of CUDA that needs a different driver but if it is working once, that might be something else. Can you confirm if this was working in previous versions of Bottlerocket or is this a new workload?

@uni-raghavendra
Copy link
Author

Basically we are trying to switch our workloads from Amazon Linux OS to BottleRocket OS for the first time.

We are using CUDA 12.2 and current BottleRocket Supports it as well, As I see for the first time it is able to come up and run successfully without any issues, issue comes when we restart the pod. The node is Unable to allocate CUDA memory for the restarted pod. And we only have 1 workload on that node

@ytsssun
Copy link
Contributor

ytsssun commented Nov 28, 2024

Hi @uni-raghavendra do you mind providing a bit more information about your setup for a minimal reproducible one? Can you provide the instance type, and the pod spec? I can help with further troubleshooting.

@uni-raghavendra
Copy link
Author

Hey @ytsssun ,

here is the spec, but we need to mount some models, try some dummy model. Currently we fetch it from our own s3 bucket, try placing some files into /models and try to bring it up for the frrst time it will work fine and from second time onwards it will go into error state with CUDA error. We need g5 instance type for this. Let me know if you need anything more we can sync

gpu.txt

@uni-raghavendra
Copy link
Author

@ytsssun , Did you find anything on the CUDA memory allocation?

@ytsssun
Copy link
Contributor

ytsssun commented Dec 4, 2024

Hi @uni-raghavendra, I did give this a spin. However, I was not able to reproduce the issue. Here is my setup:

  1. I deployed the node group by running eksctl create nodegroup -f cluster.yaml with below spec:
# cluster.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: my-cluster
  region: us-west-2
  version: '1.30'

iam:
  withOIDC: true

nodeGroups:
  - name: my-cluster-ng-g5-bottlerocket
    instanceType: g5.2xlarge
    minSize: 0
    desiredCapacity: 1
    maxSize: 3
    availabilityZones: ["us-west-2a"]
    amiFamily: Bottlerocket
    volumeSize: 400
    privateNetworking: true
  1. Deployed below pod with spec:
    command: kubectl apply -f k8s/triton-pod.yaml
# Directory structure:
# .
# ├── k8s
# │   └── triton-pod.yaml
# └── model_repository
#     └── dummy
#         ├── 1
#         │   └── model.py
#         └── config.pbtxt

---
# k8s/triton-pod.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: dummy-model-files
data:
  "model.py": |
    import triton_python_backend_utils as pb_utils
    import numpy as np
    
    class TritonPythonModel:
        def execute(self, requests):
            responses = []
            for request in requests:
                # Always return "hello"
                output = np.array(["hello"], dtype=np.object_)
                responses.append(pb_utils.InferenceResponse([
                    pb_utils.Tensor("output", output)
                ]))
            return responses
            
  "config.pbtxt": |
    name: "dummy"
    backend: "python"
    max_batch_size: 0
    
    input [
      {
        name: "input"
        data_type: TYPE_STRING
        dims: [ 1 ]
      }
    ]
    
    output [
      {
        name: "output"
        data_type: TYPE_STRING
        dims: [ 1 ]
      }
    ]
    
    instance_group [
      {
        count: 1
        kind: KIND_GPU
      }
    ]

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: triton-server
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 100%
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: triton-server
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.10-vllm-python-py3
        command:
          - /bin/sh
          - -c
        args:
          - /opt/tritonserver/bin/tritonserver --model-store=/models --model-control-mode=EXPLICIT --load-model=dummy --allow-gpu-metrics=true --allow-metrics=true
        resources:
          limits:
            nvidia.com/gpu: "1"
          requests:
            cpu: "1"
            memory: 8Gi
        ports:
          - containerPort: 8000
            name: http
        volumeMounts:
          - mountPath: /models/dummy/1/model.py
            name: model-files
            subPath: model.py
          - mountPath: /models/dummy/config.pbtxt
            name: model-files
            subPath: config.pbtxt
      nodeSelector:
        node.kubernetes.io/instance-type: g5.2xlarge
      volumes:
        - name: model-files
          configMap:
            name: dummy-model-files
  1. The pod is running correctly, with CUDA memory allocated.
kubectl logs triton-server 
I1204 04:40:39.762512 1 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x7fb4ca000000' with size 268435456"
I1204 04:40:39.766218 1 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I1204 04:40:39.772456 1 model_lifecycle.cc:472] "loading: dummy:1"
I1204 04:40:41.396654 1 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: dummy_0_0 (GPU device 0)"
I1204 04:40:41.628058 1 model_lifecycle.cc:839] "successfully loaded 'dummy'"
I1204 04:40:41.628166 1 server.cc:604] 
+------------------+------+
  1. Trigger redeployment
kubectl rollout restart deployment triton-server
  1. The pod is running fine with no issue.
kubectl logs triton-server
I1204 04:45:30.186895 1 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x7fa338000000' with size 268435456"
I1204 04:45:30.189883 1 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I1204 04:45:30.195667 1 model_lifecycle.cc:472] "loading: dummy:1"
I1204 04:45:31.797296 1 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: dummy_0_0 (GPU device 0)"
I1204 04:45:32.029037 1 model_lifecycle.cc:839] "successfully loaded 'dummy'"
...

I1204 04:45:32.096156 1 grpc_server.cc:2558] "Started GRPCInferenceService at 0.0.0.0:8001"
I1204 04:45:32.096411 1 http_server.cc:4713] "Started HTTPService at 0.0.0.0:8000"
I1204 04:45:32.258006 1 http_server.cc:362] "Started Metrics Service at 0.0.0.0:8002"

I tried this on g5.2xlarge and g5.12xlarge (2 GPUs). Both worked for me.

@uni-raghavendra
Copy link
Author

@ytsssun , I have the setup and it doesn't work, May be if you are available for a call I would be able to walk you through it.

Please let me know what is the good time to connect? I can share a google invite

@yeazelm
Copy link
Contributor

yeazelm commented Dec 5, 2024

Thanks for the offer for a call @uni-raghavendra. If you have access to AWS support you can reach out through them to get a call scheduled or if you are on Kubernetes or CNCF slack, you can find me there and we can find a slot that works. It would be easier to sort out times in a slack dm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/needs-triage Pending triage or re-evaluation type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants