-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nim service deployment fails when "nvidia.com/gpu" toleration is specified #288
Comments
@nmartorell thanks for reporting this issue. From the code i don't see we are adding this toleration automatically, might be the admission controller adding this based on GPU requests. We will try to reproduce and verify. |
i just verified that adding toleration in the spec works fine. Need to debug more on the actual issue.
|
@nmartorell can you paste the spec that was causing the error, i am not able to repro with the above spec you pasted. |
1. Quick Debug Information
2. Issue or feature description
NIM Service fails to start when a toleration with key "nvidia.com/gpu" is specified.
3. Steps to reproduce the issue
Deploy a NIM Service with the following YAML:
4. Information to attach
The NIM Service pod fails to start, the only event in the logs is:
error converting unstructured object to Deployment: unrecognized type: string
The NIM Operator pod logs show the following error:
I;m not sure, but I think that the issue is that the NIM Operator is automatically adding a toleration with the same key (i.e. when I
kubectl edit
a NIM Service pod that successfully starts (i.e. when I don't manually add the toleration to the yaml file), I can see the following toleration automatically added):The text was updated successfully, but these errors were encountered: