-
Notifications
You must be signed in to change notification settings - Fork 522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Workloads failing to run on the BottleRocket Nodes #4309
Comments
Can i get some update on the CUDA memory issue? |
Hello @uni-raghavendra, thanks for cutting this issue. Can you confirm what version of CUDA you are using and what features are attempting to be used? Bottlerocket includes the R535 branch of Tesla drivers and the error "CUDA driver version is insufficient for CUDA runtime version" normally indicates that you are using a version of CUDA that needs a different driver but if it is working once, that might be something else. Can you confirm if this was working in previous versions of Bottlerocket or is this a new workload? |
Basically we are trying to switch our workloads from Amazon Linux OS to BottleRocket OS for the first time. We are using CUDA 12.2 and current BottleRocket Supports it as well, As I see for the first time it is able to come up and run successfully without any issues, issue comes when we restart the pod. The node is Unable to allocate CUDA memory for the restarted pod. And we only have 1 workload on that node |
Hi @uni-raghavendra do you mind providing a bit more information about your setup for a minimal reproducible one? Can you provide the instance type, and the pod spec? I can help with further troubleshooting. |
Hey @ytsssun , here is the spec, but we need to mount some models, try some dummy model. Currently we fetch it from our own s3 bucket, try placing some files into /models and try to bring it up for the frrst time it will work fine and from second time onwards it will go into error state with CUDA error. We need g5 instance type for this. Let me know if you need anything more we can sync |
@ytsssun , Did you find anything on the CUDA memory allocation? |
Hi @uni-raghavendra, I did give this a spin. However, I was not able to reproduce the issue. Here is my setup:
# cluster.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: my-cluster
region: us-west-2
version: '1.30'
iam:
withOIDC: true
nodeGroups:
- name: my-cluster-ng-g5-bottlerocket
instanceType: g5.2xlarge
minSize: 0
desiredCapacity: 1
maxSize: 3
availabilityZones: ["us-west-2a"]
amiFamily: Bottlerocket
volumeSize: 400
privateNetworking: true
I tried this on g5.2xlarge and g5.12xlarge (2 GPUs). Both worked for me. |
@ytsssun , I have the setup and it doesn't work, May be if you are available for a call I would be able to walk you through it. Please let me know what is the good time to connect? I can share a google invite |
Thanks for the offer for a call @uni-raghavendra. If you have access to AWS support you can reach out through them to get a call scheduled or if you are on Kubernetes or CNCF slack, you can find me there and we can find a slot that works. It would be easier to sort out times in a slack dm. |
Image I'm using:
OS Image
Bottlerocket OS 1.27.1 (aws-k8s-1.30-nvidia)
Kernel version
6.1.115
Container runtime
containerd://1.7.22+bottlerocket
Kubelet version
v1.30.4-eks-16b398d
What I expected to happen:
We are running GPU workloads on k8 AWS EKS cluster, The pods are running fine for the first time when it is scheduled on BottleRocket OS Node, when we restart the pod it sits on the same node eventually it should comeup and work fine.
What actually happened:
But the workload is failing with below error
W1120 16:31:10.566164 1 pinned_memory_manager.cc:273] "Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version"
I1120 16:31:10.566217 1 cuda_memory_manager.cc:117] "CUDA memory pool disabled"
E1120 16:31:10.566309 1 server.cc:241] "CudaDriverHelper has not been initialized."
I1120 16:31:10.766788 1 model_lifecycle.cc:472] "loading: summarization:400"
How to reproduce the problem:
You can try to pull any of the llama related workloads/ sherpa_onnx workloads
The text was updated successfully, but these errors were encountered: