AWS: TensorFlow Serving: 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. #727

karlschriek · 2019-05-20T18:53:42Z

When following this guide, https://www.kubeflow.org/docs/components/tfserving_new/, I am unable to serve a model using ks param set ${MODEL_COMPONENT} numGpus 1. Doing so results in an error 0/1 nodes are available: 1 Insufficient nvidia.com/gpu., which presumably means that the nvidia.com/gpu plugin has not been deployed. I am at a loss as to exactly how this should be done. Documentation on the Nvidia website is quite scant, and also the link provided in the guide for a GPU example (https://github.com/kubeflow/examples/blob/master/object_detection/tf_serving_gpu.md) offers no explanation whatsoever.

As a side note, if I leave out ks param set ${MODEL_COMPONENT} numGpus 1 (or set numGpus to 0), it also doesn't work, resulting in:

Error: failed to start container "testmodel": Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"setenv: invalid argument\"": unknown

EDIT

The solution to this is as follows:

When creating the cluster, nodeGroups of type p3.2xlarge must be created. This will automatically create instances using the "EKS Optimized with GPU" AMI, as described here: https://docs.aws.amazon.com/eks/latest/userguide/gpu-ami.html

For example:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: mycluster
  region: us-east-1
  version: '1.12'
availabilityZones: ["us-east-1a", "us-east-1b"]

nodeGroups:
  - name: cpu-nodegroup
    instanceType: m5.2xlarge
    desiredCapacity: 1
    minSize: 0
    maxSize: 2
    volumeSize: 30
  - name: gpu-nodegroup
    instanceType: p3.2xlarge
    desiredCapacity: 1
    minSize: 0
    maxSize: 10
    volumeSize: 50
    availabilityZones: ["us-east-1a"]
    iam:
      withAddonPolicies:
        autoScaler: true
    labels:
      'k8s.amazonaws.com/accelerator': 'nvidia-tesla-v100'

Thereafter, the nvidia/gpu daemonset must be deployed, as follows:
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.12/nvidia-device-plugin.yml

I think it is really necessary that the guide describes these requirements

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2019-05-20T18:53:46Z

Issue-Label Bot is automatically applying the label kind/bug to this issue, with a confidence of 0.69. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

sarahmaddox · 2019-06-06T01:08:58Z

Note that this issue refers to an AWS deployment. The TensorFlow Serving guide should explain the situation in general terms (for clouds other than AWS) and can give an AWS-specific example where useful.

I'm marking this issue for the doc sprint. It will take some testing to ensure the updates are correct.

Jeffwan · 2020-02-23T20:11:33Z

AWS by default install nvidia-device-plugin for 0.7. We can close this issue.

/close

k8s-ci-robot · 2020-02-23T20:11:34Z

@Jeffwan: Closing this issue.

In response to this:

AWS by default install nvidia-device-plugin for 0.7. We can close this issue.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

issue-label-bot bot added the kind/bug label May 20, 2019

sarahmaddox added the doc-sprint Issues to work on during the Kubeflow Doc Sprint label Jun 6, 2019

sarahmaddox mentioned this issue Jun 6, 2019

AWS: TensorFlow Serving: "Sending prediction request" does not work #725

Closed

sarahmaddox changed the title ~~TensorFlow Serving: 0/1 nodes are available: 1 Insufficient nvidia.com/gpu.~~ AWS: TensorFlow Serving: 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. Jan 2, 2020

k8s-ci-robot closed this as completed Feb 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS: TensorFlow Serving: 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. #727

AWS: TensorFlow Serving: 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. #727

karlschriek commented May 20, 2019 •

edited

Loading

issue-label-bot bot commented May 20, 2019

sarahmaddox commented Jun 6, 2019

Jeffwan commented Feb 23, 2020

k8s-ci-robot commented Feb 23, 2020

AWS: TensorFlow Serving: 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. #727

AWS: TensorFlow Serving: 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. #727

Comments

karlschriek commented May 20, 2019 • edited Loading

EDIT

issue-label-bot bot commented May 20, 2019

sarahmaddox commented Jun 6, 2019

Jeffwan commented Feb 23, 2020

k8s-ci-robot commented Feb 23, 2020

karlschriek commented May 20, 2019 •

edited

Loading