Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nim service deployment fails when "nvidia.com/gpu" toleration is specified #288

Open
nmartorell opened this issue Jan 17, 2025 · 3 comments
Assignees

Comments

@nmartorell
Copy link

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): AL2023 (AMI amazon-eks-node-al2023-x86_64-nvidia-1.30-v20250103)
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): EKS (v1.30)
  • GPU Operator Version: v24.9.1
  • NIM Operator Version: 1.0.1
  • LLM NIM Versions: N/A
  • NeMo Service Versions: N/A

2. Issue or feature description

NIM Service fails to start when a toleration with key "nvidia.com/gpu" is specified.

3. Steps to reproduce the issue

Deploy a NIM Service with the following YAML:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
  namespace: nim-service
spec:
  image:
    repository: nvcr.io/nim/meta/llama3-8b-instruct
    tag: 1.0.3
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    pvc:
      create: true
      storageClass: efs-sc
      size: 30Gi
      volumeAccessMode: ReadWriteMany
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP

4. Information to attach

The NIM Service pod fails to start, the only event in the logs is:
error converting unstructured object to Deployment: unrecognized type: string

The NIM Operator pod logs show the following error:

github.com/NVIDIA/k8s-nim-operator/internal/controller/platform/standalone.(*NIMServiceReconciler).renderAndSyncResource
	/workspace/internal/controller/platform/standalone/nimservice.go:285
github.com/NVIDIA/k8s-nim-operator/internal/controller/platform/standalone.(*NIMServiceReconciler).reconcileNIMService
	/workspace/internal/controller/platform/standalone/nimservice.go:245
github.com/NVIDIA/k8s-nim-operator/internal/controller/platform/standalone.(*Standalone).Sync
	/workspace/internal/controller/platform/standalone/standalone.go:115
github.com/NVIDIA/k8s-nim-operator/internal/controller.(*NIMServiceReconciler).Reconcile
	/workspace/internal/controller/nimservice_controller.go:158
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:116
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:303
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:224
2025-01-17T20:14:19Z	DEBUG	events	NIMService metallama38binstruct2 failed, msg: error converting unstructured object to Deployment: unrecognized type: string	{"type": "Warning", "object": {"kind":"NIMService","namespace":"nim-service","name":"metallama38binstruct2","uid":"a0fb54b7-e4c5-4caa-a99f-cf29bf09f929","apiVersion":"apps.nvidia.com/v1alpha1","resourceVersion":"78747"}, "reason": "Failed"}
2025-01-17T20:14:19Z	DEBUG	events	NIMService metallama38binstruct2 failed, msg: error converting unstructured object to Deployment: unrecognized type: string	{"type": "Warning", "object": {"kind":"NIMService","namespace":"nim-service","name":"metallama38binstruct2","uid":"a0fb54b7-e4c5-4caa-a99f-cf29bf09f929","apiVersion":"apps.nvidia.com/v1alpha1","resourceVersion":"78747"}, "reason": "ReconcileFailed"}
2025-01-17T20:14:19Z	ERROR	controllers.NIMService	Unable to update status	{"error": "error converting unstructured object to Deployment: unrecognized type: string"}

I;m not sure, but I think that the issue is that the NIM Operator is automatically adding a toleration with the same key (i.e. when I kubectl edit a NIM Service pod that successfully starts (i.e. when I don't manually add the toleration to the yaml file), I can see the following toleration automatically added):

  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
@shivamerla
Copy link
Collaborator

@nmartorell thanks for reporting this issue. From the code i don't see we are adding this toleration automatically, might be the admission controller adding this based on GPU requests. We will try to reproduce and verify.

@shivamerla
Copy link
Collaborator

i just verified that adding toleration in the spec works fine. Need to debug more on the actual issue.

$ kubectl get nimservice meta-llama3-8b-instruct -o yaml
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  annotations:
  creationTimestamp: "2025-01-17T21:34:38Z"
  finalizers:
  - finalizer.nimservice.apps.nvidia.com
  generation: 2
  name: meta-llama3-8b-instruct
  namespace: nim-operator
  resourceVersion: "412206048"
  uid: 09317777-c993-4160-ae1c-015bbf75ff19
spec:
  authSecret: ngc-api-secret
  expose:
    ingress:
      spec: {}
    service:
      port: 8000
      type: ClusterIP
  image:
    pullPolicy: IfNotPresent
    pullSecrets:
    - ngc-secret
    repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
    tag: 1.3.3
  livenessProbe: {}
  metrics:
    serviceMonitor: {}
  readinessProbe: {}
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: "1"
  scale:
    hpa:
      maxReplicas: 0
  startupProbe: {}
  storage:
    nimCache:
      name: meta-llama3-8b-instruct
    pvc: {}
  tolerations:
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
status:
  conditions:
  - lastTransitionTime: "2025-01-17T21:37:09Z"
    message: |
      deployment "meta-llama3-8b-instruct" successfully rolled out
    reason: Ready
    status: "True"
    type: Ready
  - lastTransitionTime: "2025-01-17T21:34:39Z"
    message: ""
    reason: Ready
    status: "False"
    type: Failed
  state: Ready
$ kubectl get deployment meta-llama3-8b-instruct -o yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
    nvidia.com/last-applied-hash: c730bf8928c66e37233cb76ffb0009aa4601604410ed85e714532eaf542e50a1
    openshift.io/scc: nonroot
  creationTimestamp: "2025-01-17T21:34:39Z"
  generation: 1
  labels:
    app.kubernetes.io/instance: meta-llama3-8b-instruct
    app.kubernetes.io/managed-by: k8s-nim-operator
    app.kubernetes.io/name: meta-llama3-8b-instruct
    app.kubernetes.io/operator-version: ""
    app.kubernetes.io/part-of: nim-service
  name: meta-llama3-8b-instruct
  namespace: nim-operator
  ownerReferences:
  - apiVersion: apps.nvidia.com/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: NIMService
    name: meta-llama3-8b-instruct
    uid: 09317777-c993-4160-ae1c-015bbf75ff19
  resourceVersion: "412215257"
  uid: eeb4bf5c-ada3-4cb9-b060-a57b9b222b8f
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: meta-llama3-8b-instruct
  strategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        openshift.io/scc: nonroot
      creationTimestamp: null
      labels:
        app: meta-llama3-8b-instruct
        app.kubernetes.io/instance: meta-llama3-8b-instruct
        app.kubernetes.io/managed-by: k8s-nim-operator
        app.kubernetes.io/name: meta-llama3-8b-instruct
        app.kubernetes.io/operator-version: ""
        app.kubernetes.io/part-of: nim-service
    spec:
      containers:
      - env:
        - name: NIM_JSONL_LOGGING
          value: "1"
        - name: NIM_LOG_LEVEL
          value: INFO
        - name: NIM_CACHE_PATH
          value: /model-store
        - name: NGC_API_KEY
          valueFrom:
            secretKeyRef:
              key: NGC_API_KEY
              name: ngc-api-secret
        - name: OUTLINES_CACHE_DIR
          value: /tmp/outlines
        - name: NIM_SERVER_PORT
          value: "8000"
        image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.3
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /v1/health/live
            port: 8000
            scheme: HTTP
          initialDelaySeconds: 15
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: meta-llama3-8b-instruct-ctr
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /v1/health/ready
            port: 8000
            scheme: HTTP
          initialDelaySeconds: 15
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            nvidia.com/gpu: "1"
        startupProbe:
          failureThreshold: 180
          httpGet:
            path: /v1/health/ready
            port: 8000
            scheme: HTTP
          initialDelaySeconds: 40
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /model-store
          name: model-store
        - mountPath: /dev/shm
          name: dshm
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: ngc-secret
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: meta-llama3-8b-instruct
      serviceAccountName: meta-llama3-8b-instruct
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists
      volumes:
      - emptyDir:
          medium: Memory
        name: dshm
      - name: model-store
        persistentVolumeClaim:
          claimName: meta-llama3-8b-instruct-pvc
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2025-01-17T21:34:39Z"
    lastUpdateTime: "2025-01-17T21:34:39Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2025-01-17T21:34:39Z"
    lastUpdateTime: "2025-01-17T21:37:09Z"
    message: ReplicaSet "meta-llama3-8b-instruct-cc65bd5c8" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

@shivamerla
Copy link
Collaborator

@nmartorell can you paste the spec that was causing the error, i am not able to repro with the above spec you pasted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants