Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS: TensorFlow Serving: 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. #727

Closed
karlschriek opened this issue May 20, 2019 · 4 comments
Labels
doc-sprint Issues to work on during the Kubeflow Doc Sprint kind/bug

Comments

@karlschriek
Copy link

karlschriek commented May 20, 2019

When following this guide, https://www.kubeflow.org/docs/components/tfserving_new/, I am unable to serve a model using ks param set ${MODEL_COMPONENT} numGpus 1. Doing so results in an error 0/1 nodes are available: 1 Insufficient nvidia.com/gpu., which presumably means that the nvidia.com/gpu plugin has not been deployed. I am at a loss as to exactly how this should be done. Documentation on the Nvidia website is quite scant, and also the link provided in the guide for a GPU example (https://github.com/kubeflow/examples/blob/master/object_detection/tf_serving_gpu.md) offers no explanation whatsoever.

As a side note, if I leave out ks param set ${MODEL_COMPONENT} numGpus 1 (or set numGpus to 0), it also doesn't work, resulting in:

Error: failed to start container "testmodel": Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"setenv: invalid argument\"": unknown

EDIT

The solution to this is as follows:

  1. When creating the cluster, nodeGroups of type p3.2xlarge must be created. This will automatically create instances using the "EKS Optimized with GPU" AMI, as described here: https://docs.aws.amazon.com/eks/latest/userguide/gpu-ami.html

For example:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: mycluster
  region: us-east-1
  version: '1.12'
availabilityZones: ["us-east-1a", "us-east-1b"]

nodeGroups:
  - name: cpu-nodegroup
    instanceType: m5.2xlarge
    desiredCapacity: 1
    minSize: 0
    maxSize: 2
    volumeSize: 30
  - name: gpu-nodegroup
    instanceType: p3.2xlarge
    desiredCapacity: 1
    minSize: 0
    maxSize: 10
    volumeSize: 50
    availabilityZones: ["us-east-1a"]
    iam:
      withAddonPolicies:
        autoScaler: true
    labels:
      'k8s.amazonaws.com/accelerator': 'nvidia-tesla-v100'
  1. Thereafter, the nvidia/gpu daemonset must be deployed, as follows:
    kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.12/nvidia-device-plugin.yml

I think it is really necessary that the guide describes these requirements

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the label kind/bug to this issue, with a confidence of 0.69. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

@sarahmaddox sarahmaddox added the doc-sprint Issues to work on during the Kubeflow Doc Sprint label Jun 6, 2019
@sarahmaddox
Copy link
Contributor

Note that this issue refers to an AWS deployment. The TensorFlow Serving guide should explain the situation in general terms (for clouds other than AWS) and can give an AWS-specific example where useful.

I'm marking this issue for the doc sprint. It will take some testing to ensure the updates are correct.

@sarahmaddox sarahmaddox changed the title TensorFlow Serving: 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. AWS: TensorFlow Serving: 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. Jan 2, 2020
@Jeffwan
Copy link
Member

Jeffwan commented Feb 23, 2020

AWS by default install nvidia-device-plugin for 0.7. We can close this issue.

/close

@k8s-ci-robot
Copy link
Contributor

@Jeffwan: Closing this issue.

In response to this:

AWS by default install nvidia-device-plugin for 0.7. We can close this issue.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-sprint Issues to work on during the Kubeflow Doc Sprint kind/bug
Projects
None yet
Development

No branches or pull requests

4 participants