nim service deployment fails when "nvidia.com/gpu" toleration is specified #288

nmartorell · 2025-01-17T20:30:48Z

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): AL2023 (AMI amazon-eks-node-al2023-x86_64-nvidia-1.30-v20250103)
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): EKS (v1.30)
GPU Operator Version: v24.9.1
NIM Operator Version: 1.0.1
LLM NIM Versions: N/A
NeMo Service Versions: N/A

2. Issue or feature description

NIM Service fails to start when a toleration with key "nvidia.com/gpu" is specified.

3. Steps to reproduce the issue

Deploy a NIM Service with the following YAML:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
  namespace: nim-service
spec:
  image:
    repository: nvcr.io/nim/meta/llama3-8b-instruct
    tag: 1.0.3
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    pvc:
      create: true
      storageClass: efs-sc
      size: 30Gi
      volumeAccessMode: ReadWriteMany
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP

4. Information to attach

The NIM Service pod fails to start, the only event in the logs is:
error converting unstructured object to Deployment: unrecognized type: string

The NIM Operator pod logs show the following error:

github.com/NVIDIA/k8s-nim-operator/internal/controller/platform/standalone.(*NIMServiceReconciler).renderAndSyncResource
	/workspace/internal/controller/platform/standalone/nimservice.go:285
github.com/NVIDIA/k8s-nim-operator/internal/controller/platform/standalone.(*NIMServiceReconciler).reconcileNIMService
	/workspace/internal/controller/platform/standalone/nimservice.go:245
github.com/NVIDIA/k8s-nim-operator/internal/controller/platform/standalone.(*Standalone).Sync
	/workspace/internal/controller/platform/standalone/standalone.go:115
github.com/NVIDIA/k8s-nim-operator/internal/controller.(*NIMServiceReconciler).Reconcile
	/workspace/internal/controller/nimservice_controller.go:158
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:116
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:303
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:224
2025-01-17T20:14:19Z	DEBUG	events	NIMService metallama38binstruct2 failed, msg: error converting unstructured object to Deployment: unrecognized type: string	{"type": "Warning", "object": {"kind":"NIMService","namespace":"nim-service","name":"metallama38binstruct2","uid":"a0fb54b7-e4c5-4caa-a99f-cf29bf09f929","apiVersion":"apps.nvidia.com/v1alpha1","resourceVersion":"78747"}, "reason": "Failed"}
2025-01-17T20:14:19Z	DEBUG	events	NIMService metallama38binstruct2 failed, msg: error converting unstructured object to Deployment: unrecognized type: string	{"type": "Warning", "object": {"kind":"NIMService","namespace":"nim-service","name":"metallama38binstruct2","uid":"a0fb54b7-e4c5-4caa-a99f-cf29bf09f929","apiVersion":"apps.nvidia.com/v1alpha1","resourceVersion":"78747"}, "reason": "ReconcileFailed"}
2025-01-17T20:14:19Z	ERROR	controllers.NIMService	Unable to update status	{"error": "error converting unstructured object to Deployment: unrecognized type: string"}

I;m not sure, but I think that the issue is that the NIM Operator is automatically adding a toleration with the same key (i.e. when I kubectl edit a NIM Service pod that successfully starts (i.e. when I don't manually add the toleration to the yaml file), I can see the following toleration automatically added):

  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists

The text was updated successfully, but these errors were encountered:

shivamerla · 2025-01-17T21:33:03Z

@nmartorell thanks for reporting this issue. From the code i don't see we are adding this toleration automatically, might be the admission controller adding this based on GPU requests. We will try to reproduce and verify.

shivamerla · 2025-01-17T21:41:12Z

i just verified that adding toleration in the spec works fine. Need to debug more on the actual issue.

$ kubectl get nimservice meta-llama3-8b-instruct -o yaml
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  annotations:
  creationTimestamp: "2025-01-17T21:34:38Z"
  finalizers:
  - finalizer.nimservice.apps.nvidia.com
  generation: 2
  name: meta-llama3-8b-instruct
  namespace: nim-operator
  resourceVersion: "412206048"
  uid: 09317777-c993-4160-ae1c-015bbf75ff19
spec:
  authSecret: ngc-api-secret
  expose:
    ingress:
      spec: {}
    service:
      port: 8000
      type: ClusterIP
  image:
    pullPolicy: IfNotPresent
    pullSecrets:
    - ngc-secret
    repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
    tag: 1.3.3
  livenessProbe: {}
  metrics:
    serviceMonitor: {}
  readinessProbe: {}
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: "1"
  scale:
    hpa:
      maxReplicas: 0
  startupProbe: {}
  storage:
    nimCache:
      name: meta-llama3-8b-instruct
    pvc: {}
  tolerations:
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
status:
  conditions:
  - lastTransitionTime: "2025-01-17T21:37:09Z"
    message: |
      deployment "meta-llama3-8b-instruct" successfully rolled out
    reason: Ready
    status: "True"
    type: Ready
  - lastTransitionTime: "2025-01-17T21:34:39Z"
    message: ""
    reason: Ready
    status: "False"
    type: Failed
  state: Ready

$ kubectl get deployment meta-llama3-8b-instruct -o yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
    nvidia.com/last-applied-hash: c730bf8928c66e37233cb76ffb0009aa4601604410ed85e714532eaf542e50a1
    openshift.io/scc: nonroot
  creationTimestamp: "2025-01-17T21:34:39Z"
  generation: 1
  labels:
    app.kubernetes.io/instance: meta-llama3-8b-instruct
    app.kubernetes.io/managed-by: k8s-nim-operator
    app.kubernetes.io/name: meta-llama3-8b-instruct
    app.kubernetes.io/operator-version: ""
    app.kubernetes.io/part-of: nim-service
  name: meta-llama3-8b-instruct
  namespace: nim-operator
  ownerReferences:
  - apiVersion: apps.nvidia.com/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: NIMService
    name: meta-llama3-8b-instruct
    uid: 09317777-c993-4160-ae1c-015bbf75ff19
  resourceVersion: "412215257"
  uid: eeb4bf5c-ada3-4cb9-b060-a57b9b222b8f
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: meta-llama3-8b-instruct
  strategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        openshift.io/scc: nonroot
      creationTimestamp: null
      labels:
        app: meta-llama3-8b-instruct
        app.kubernetes.io/instance: meta-llama3-8b-instruct
        app.kubernetes.io/managed-by: k8s-nim-operator
        app.kubernetes.io/name: meta-llama3-8b-instruct
        app.kubernetes.io/operator-version: ""
        app.kubernetes.io/part-of: nim-service
    spec:
      containers:
      - env:
        - name: NIM_JSONL_LOGGING
          value: "1"
        - name: NIM_LOG_LEVEL
          value: INFO
        - name: NIM_CACHE_PATH
          value: /model-store
        - name: NGC_API_KEY
          valueFrom:
            secretKeyRef:
              key: NGC_API_KEY
              name: ngc-api-secret
        - name: OUTLINES_CACHE_DIR
          value: /tmp/outlines
        - name: NIM_SERVER_PORT
          value: "8000"
        image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.3
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /v1/health/live
            port: 8000
            scheme: HTTP
          initialDelaySeconds: 15
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: meta-llama3-8b-instruct-ctr
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /v1/health/ready
            port: 8000
            scheme: HTTP
          initialDelaySeconds: 15
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            nvidia.com/gpu: "1"
        startupProbe:
          failureThreshold: 180
          httpGet:
            path: /v1/health/ready
            port: 8000
            scheme: HTTP
          initialDelaySeconds: 40
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /model-store
          name: model-store
        - mountPath: /dev/shm
          name: dshm
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: ngc-secret
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: meta-llama3-8b-instruct
      serviceAccountName: meta-llama3-8b-instruct
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists
      volumes:
      - emptyDir:
          medium: Memory
        name: dshm
      - name: model-store
        persistentVolumeClaim:
          claimName: meta-llama3-8b-instruct-pvc
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2025-01-17T21:34:39Z"
    lastUpdateTime: "2025-01-17T21:34:39Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2025-01-17T21:34:39Z"
    lastUpdateTime: "2025-01-17T21:37:09Z"
    message: ReplicaSet "meta-llama3-8b-instruct-cc65bd5c8" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

shivamerla · 2025-01-17T22:21:32Z

@nmartorell can you paste the spec that was causing the error, i am not able to repro with the above spec you pasted.

nmartorell assigned shivamerla, slu2011 and visheshtanksale Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nim service deployment fails when "nvidia.com/gpu" toleration is specified #288

nim service deployment fails when "nvidia.com/gpu" toleration is specified #288

nmartorell commented Jan 17, 2025

shivamerla commented Jan 17, 2025

shivamerla commented Jan 17, 2025

shivamerla commented Jan 17, 2025

nim service deployment fails when "nvidia.com/gpu" toleration is specified #288

nim service deployment fails when "nvidia.com/gpu" toleration is specified #288

Comments

nmartorell commented Jan 17, 2025

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach

shivamerla commented Jan 17, 2025

shivamerla commented Jan 17, 2025

shivamerla commented Jan 17, 2025