Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete in-tree support for NVIDIA GPUs. #61498

Merged

Conversation

rohitagarwal003
Copy link
Member

This removes the alpha Accelerators feature gate which was deprecated in 1.10 (#57384).
The alternative feature DevicePlugins went beta in 1.10 (#60170).

Fixes #54012

Support for "alpha.kubernetes.io/nvidia-gpu" resource which was deprecated in 1.10 is removed. Please use the resource exposed by DevicePlugins instead ("nvidia.com/gpu").

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 21, 2018
@k8s-github-robot k8s-github-robot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Mar 21, 2018
@rohitagarwal003
Copy link
Member Author

/assign @jiayingz @vishh

@rohitagarwal003
Copy link
Member Author

/area hw-accelerators

@rohitagarwal003 rohitagarwal003 force-pushed the delete-in-tree-gpu branch 2 times, most recently from 9152740 to 5c85338 Compare March 21, 2018 22:40
@rohitagarwal003
Copy link
Member Author

/assign @bsalamat @derekwaynecarr @liggitt

@jiayingz
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 22, 2018
@@ -17,6 +17,7 @@ limitations under the License.
package e2e_node

import (
"os/exec"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the below var testPodNamePrefix is useless, will you remove it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That var was not used before as well. I will do unrelated cleanups in another PR, let's keep this PR for just deleting the in-tree support.

@bsalamat
Copy link
Member

changes in the scheduler code LGTM
/lgtm

@dims
Copy link
Member

dims commented Mar 27, 2018

/approve

// IsOvercommitAllowed returns true if the resource is in the default
// namespace and not blacklisted.
// namespace and is not hugepages.
func IsOvercommitAllowed(name core.ResourceName) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is used in validation. this is used in the kubelet, which is allowed to skew two versions older than the apiserver. can we verify that a 1.10-level kubelet fails in a reasonable way if you specify the alpha resource on a pod with overcommit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this is used while validating the pod spec. So, from 1.11 (assuming this PR is merged), the API server will treat alpha.kubernetes.io/nvidia-gpu like any other *kubernetes.io prefixed resource (IsDefaultNamespaceResource()) and will allow pod specs with unequal requests and limits for this resource.

I would hope that when people upgrade their API server to 1.11, they won't submit any new pods using this resource.

But let's say we have a situation with API server running 1.11, kubelet running <1.11 with this alpha feature gate on, and the user submits a pod requesting this resource. In that case, the API server won't check whether requests=limits for this resource. And while assigning resources, kubelet will only look at the limits for this resource (like it does now).

Copy link
Member Author

@rohitagarwal003 rohitagarwal003 Mar 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing to keep in mind would be that currently the scheduler would schedule a pod requesting *kubernetes.io/ prefixed resources to any node whether that node is exposing that resource or not (I don't know why this is the case). #50658 Scenario B

Once we remove the special case for alpha.kubernetes.io/nvidia-gpu, this behavior will apply to alpha.kubernetes.io/nvidia-gpu as well. So, a pod requesting alpha.kubernetes.io/nvidia-gpu could be scheduled to any node.

Note that this is not because of updating IsOvercommitAllowed() but because of removing the special case predicate from the scheduler below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @mindprince mentioned, the current scheduler behavior is to ignore resource request on non-existing resources in the kubernetes.io/ domain. This would cause the pod to be scheduled on a node without that requested resource. On the node, because Kubelet also runs GeneralPredicate, it would fail the pod during admission if it is running 1.10, which is actually the desired behavior for alpha.kubernetes.io/nvidia-gpu. However, if it is running 1.11, the pod would be started without proper gpu device setup.

@mindprince has initiated the discussion on whether we want to change this scheduler behavior on kubernetes.io/ domain resources in #50658 discussion. For now, I wonder whether we want to fail loudly during validation for resource request on alpha.kubernetes.io/nvidia-gpu to make sure that any users who haven't been aware of the deprecation of Accelerators feature can get the clear signal and move to the device plugin based solution. Then maybe after one or two releases, when #50658 is fully resolved, we can remove this special validation logic. Of course, Accelerators is an alpha feature, so it is debatable whether we want to add this special logic in resource validation code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the current scheduler behavior is to ignore resource request on non-existing resources in the kubernetes.io/ domain.

that doesn't seem forward compatible, does it? if a new resource comes along and is requested by a pod, only kubelets that know about and declare they satisfy that resource should be running that pod, right?

For now, I wonder whether we want to fail loudly during validation for resource request on alpha.kubernetes.io/nvidia-gpu to make sure that any users who haven't been aware of the deprecation of Accelerators feature can get the clear signal and move to the device plugin based solution.

tightening validation brings a host of issues we want to avoid. it is better to let a pod in and it sit unscheduled than to prevent API writes because of stricter validation that can disrupt cleaning up the very resources that are newly considered invalid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree the desired behavior should be leaving the pod pending till the requested resource showing up, which is #50658 is about. I think @mindprince is working on a change to resolve #50658 Scenario B. Agree it should be fine to leave the validation part out if both changes are merged in 1.11.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt This is addressed.

@liggitt
Copy link
Member

liggitt commented Mar 27, 2018

one question about behavior of skewed kubelets with a pod that specifies this resource with overcommit, LGTM otherwise

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 29, 2018
This removes the alpha Accelerators feature gate which was deprecated in 1.10.
The alternative feature DevicePlugins went beta in 1.10.
@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Apr 3, 2018
@liggitt
Copy link
Member

liggitt commented Apr 3, 2018

API changes lgtm
/approve

Copy link
Contributor

@vishh vishh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 3, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, dashpole, dims, jiayingz, liggitt, mindprince, vishh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 3, 2018
@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 61498, 62030). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 043204b into kubernetes:master Apr 3, 2018
rohitagarwal003 added a commit to rohitagarwal003/test-infra that referenced this pull request Apr 3, 2018
tengqm added a commit to tengqm/website that referenced this pull request May 3, 2018
The in-tree support to GPU is completely removed in release 1.11.
This PR removes the related docs in release-1.11 branch.

xref: kubernetes/kubernetes#61498
mdlinville pushed a commit to tengqm/website that referenced this pull request May 16, 2018
The in-tree support to GPU is completely removed in release 1.11.
This PR removes the related docs in release-1.11 branch.

xref: kubernetes/kubernetes#61498
mdlinville pushed a commit to tengqm/website that referenced this pull request May 24, 2018
The in-tree support to GPU is completely removed in release 1.11.
This PR removes the related docs in release-1.11 branch.

xref: kubernetes/kubernetes#61498
k8s-ci-robot pushed a commit to kubernetes/website that referenced this pull request May 24, 2018
* Remove docs related to in-tree support to GPU

The in-tree support to GPU is completely removed in release 1.11.
This PR removes the related docs in release-1.11 branch.

xref: kubernetes/kubernetes#61498

* Update content updated by PR to Hugo syntax

Signed-off-by: Misty Stanley-Jones <[email protected]>
k82cn pushed a commit to k82cn/website that referenced this pull request Jun 11, 2018
* Remove docs related to in-tree support to GPU

The in-tree support to GPU is completely removed in release 1.11.
This PR removes the related docs in release-1.11 branch.

xref: kubernetes/kubernetes#61498

* Update content updated by PR to Hugo syntax

Signed-off-by: Misty Stanley-Jones <[email protected]>
mdlinville pushed a commit to kubernetes/website that referenced this pull request Jun 20, 2018
* Remove docs related to in-tree support to GPU

The in-tree support to GPU is completely removed in release 1.11.
This PR removes the related docs in release-1.11 branch.

xref: kubernetes/kubernetes#61498

* Update content updated by PR to Hugo syntax

Signed-off-by: Misty Stanley-Jones <[email protected]>
mdlinville pushed a commit to kubernetes/website that referenced this pull request Jun 27, 2018
* Remove docs related to in-tree support to GPU

The in-tree support to GPU is completely removed in release 1.11.
This PR removes the related docs in release-1.11 branch.

xref: kubernetes/kubernetes#61498

* Update content updated by PR to Hugo syntax

Signed-off-by: Misty Stanley-Jones <[email protected]>
mdlinville pushed a commit to kubernetes/website that referenced this pull request Jun 27, 2018
* Remove docs related to in-tree support to GPU

The in-tree support to GPU is completely removed in release 1.11.
This PR removes the related docs in release-1.11 branch.

xref: kubernetes/kubernetes#61498

* Update content updated by PR to Hugo syntax

Signed-off-by: Misty Stanley-Jones <[email protected]>
k8s-ci-robot pushed a commit to kubernetes/website that referenced this pull request Jun 27, 2018
* Seperate priority and preemption (#8144)

* Doc about PID pressure condition. (#8211)

* Doc about PID pressure condition.

Signed-off-by: Da K. Ma <[email protected]>

* "so" -> "too"

* Update version selector for 1.11

* StorageObjectInUseProtection is GA (#8291)

* Feature gate: StorageObjectInUseProtection is GA

Update feature gate reference for 1.11

* Trivial commit to re-trigger Netlify

* CRIContainerLogRotation is Beta in 1.11 (#8665)

* Seperate priority and preemption (#8144)

* CRIContainerLogRotation is Beta in 1.11

xref: kubernetes/kubernetes#64046

* Bring StorageObjectInUseProtection feature to GA (#8159)

* StorageObjectInUseProtection is GA (#8291)

* Feature gate: StorageObjectInUseProtection is GA

Update feature gate reference for 1.11

* Trivial commit to re-trigger Netlify

* Bring StorageObjectInUseProtection feature to GA

StorageObjectInUseProtection is Beta in K8s 1.10.

It's brought to GA in K8s 1.11.

* Fixed typo and added feature state tags.

* Remove KUBE_API_VERSIONS doc (#8292)

The support to the KUBER_API_VERSIONS environment variable is completely
dropped (no deprecation). This PR removes the related doc in
release-1.11.

xref: kubernetes/kubernetes#63165

* Remove InitialResources from admission controllers (#8293)

The feature (was experimental) is dropped in 1.11.

xref: kubernetes/kubernetes#58784

* Remove docs related to in-tree support to GPU (#8294)

* Remove docs related to in-tree support to GPU

The in-tree support to GPU is completely removed in release 1.11.
This PR removes the related docs in release-1.11 branch.

xref: kubernetes/kubernetes#61498

* Update content updated by PR to Hugo syntax

Signed-off-by: Misty Stanley-Jones <[email protected]>

* Update the doc about extra volume in kubeadm config (#8453)

Signed-off-by: Xianglin Gao <[email protected]>

* Update CRD Subresources for 1.11 (#8519)

* coredns: update notes in administer-cluster/coredns.md (#8697)

CoreDNS is installed by default in 1.11.
Add notes on how to install kube-dns instead.

Update notes about CoreDNS->CoreDNS upgrades as in 1.11
the Corefile is retained.

Add example on upgrading from kube-dns to CoreDNS.

* kubeadm-alpha: CoreDNS related changes (#8727)

Update note about CoreDNS feature gate.

This change also updates a tab as a kubeadm sub-command
will change.

It looks for a new generated file:
generated/kubeadm_alpha_phase_addon_coredns.md
instead of:
generated/kubeadm_alpha_phase_addon_kube-dns.md

* Update cloud controller manager docs to beta 1.11 (#8756)

* Update cloud controller manager docs to beta 1.11

* Use Hugo shortcode for feature state

* kubeadm-upgrade: include new command `kubeadm upgrade diff` (#8617)

Also:
- Include note that this was added in 1.11.
- Modify the note about upgrade guidance.

* independent: update CoreDNS mentions for kubeadm (#8753)

Give CoreDNS instead of kube-dns examples in:
- docs/setup/independent/create-cluster-kubeadm.md
- docs/setup/independent/troubleshooting-kubeadm.md

* update 1.11 --server-print info (#8870)

* update 1.11 --server-print info

* Copyedit

* Mark ExpandPersistentVolumes feature to beta (#8778)

* Update version selector for 1.11

* Mark ExpandPersistentVolumes Beta

xref: kubernetes/kubernetes#64288

* fix shortcode, add placeholder files to fix deploy failures (#8874)

* declare ipvs ga (#8850)

* kubeadm: update info about CoreDNS in kubeadm-init.md (#8728)

Add info to install kube-dns instead of CoreDNS, as CoreDNS
is the default DNS server in 1.11.

Add notes that kubeadm config images can be used to list and pull
the required images in 1.11.

* kubeadm: update implementation-details.md about CoreDNS (#8829)

- Replace examples from kube-dns to CoreDNS
- Add notes about the CoreDNS feature gate status in 1.11
- Add note that the service name for CoreDNS is also
called `kube-dns`

* Update block device support for 1.11 (#8895)

* Update block device support for 1.11

* Copyedits

* Fix typo 'fiber channel' (#8957)

Signed-off-by: Misty Stanley-Jones <[email protected]>

* kubeadm-upgrade: add the 'node [config]' sub-command (#8960)

- Add includes for the generated pages
- Include placeholder generated pages

* kubeadm-init: update the example for the MasterConfiguration (#8958)

- include godocs link for MasterConfiguration
- include example MasterConfiguration
- add note that `kubeadm config print-default` can be used

* kubeadm-config: include new commands (#8862)

Add notes and includes for these new commands in 1.11:
- kubeadm config print-default
- kubeadm config migrate
- kubeadm config images list
- kubeadm config images pull

Include placeholder generated files for the above.

* administer-cluster/coredns: include more changes (#8985)

It was requested that for this page a couple of methods
should be outlined:
- manual installation for CoreDNS explained at the Kubernetes
section of the GitHub project for CoreDNS
- installation and upgrade via kubeadm

Make the above changes and also add a section "About CoreDNS".

This commit also lowercases a section title.

* Update CRD subresources doc for 1.11 (#8918)

* Add docs for volume expansion and online resizing (#8896)

* Add docs for volume expansion going beta

* Copyedit

* Address feedback

* Update exec plugin docs with TLS credentials (#8826)

* Update exec plugin docs with TLS credentials

kubernetes/kubernetes#61803 implements TLS client credential support for
1.11.

* Copyedit

* More copyedits for clarification

* Additional copyedit

* Change token->credential

* NodeRestriction admission prevents kubelet taint removal (#8911)

* dns-custom-namerserver: break down the page into mutliple sections (#8900)

* dns-custom-namerserver: break down the page into mutliple sections

This page is currently about kube-dns and is a bit outdated.
Introduce the heading `# Customizing kube-dns`.

Introduce a separate section about CoreDNS.

* Copyedits, fix headings for customizing DNS

Hey Lubomir,
I coypedited pretty heavily because this workflow is so much easier for docs and because I'm trying to help improve everything touching kubeadm as much as possible.

But there's one outstanding issue wrt headings and intro content: you can't add a heading 1 to a topic to do what you wanted to do. The page title in the front matter is rendered as a heading 1 and everything else has to start at heading 2. (We still need to doc this better in the docs contributing content, I know.)

Instead, I think we need to rewrite the top-of-page intro content to explain better the relationship between kube-dns and CoreDNS. I'm happy to write something, but I thought I'd push this commit first so you can see what I'm doing.

Hope it's all clear -- ping here or on Slack with any questions ~ Jennifer

* Interim fix for talking about CoreDNS

* Fix CoreDNS details

* PSP readOnly hostPath (#8898)

* Add documentation for crictl (#8880)

* Add documentation for crictl

* Copyedit

Signed-off-by: Misty Stanley-Jones <[email protected]>

* Final copyedit

* VolumeSubpathEnvExpansion alpha feature (#8835)

* Note that Heapster is deprecated (#8827)

* Note that Heapster is deprecated

This notes that Heapster is deprecated, and migrates the relevant
docs to talk about metrics-server or other solutions by default.

* Copyedits and improvements

Signed-off-by: Misty Stanley-Jones <[email protected]>

* Address feedback

* fix shortcode to troubleshoot deploy (#9057)

* update dynamic kubelet config docs for v1.11 (#8766)

* update dynamic kubelet config docs for v1.11

* Substantial copyedit

* Address feedback

* Reference doc for kubeadm (release-1.11) (#9044)

* Reference doc for kubeadm (release-1.11)

* fix shortcode to troubleshoot deploy (#9057)

* Reference doc for kube-components (release-1.11) (#9045)

* Reference doc for kube-components (release-1.11)

* Update cloud-controller-manager.md

* fix shortcode to troubleshoot deploy (#9057)

* Documentation on lowercasing kubeadm init apiserver SANs (#9059)

* Documentation on lowercasing kubeadm init apiserver SANs

* fix shortcode to troubleshoot deploy (#9057)

* Clarification in dynamic Kubelet config doc (#9061)

* Promote sysctls to Beta (#8804)

* Promote sysctls to Beta

* Copyedits

Signed-off-by: Misty Stanley-Jones <[email protected]>

* Review comments

* Address feedback

* More feedback

* kubectl reference docs for 1.11 (#9080)

* Update Kubernetes API 1.11 ref docs (#8977)

* Update v1alpha1 to v1beta1.

* Adjust left nav for 1.11 ref docs.

* Trim list of old ref docs.

* Update Federation API ref docs for 1.11. (#9064)

* Update Federation API ref docs for 1.11.

* Add titles.

* Update definitions.html

* CRD versioning Public Documentation (#8834)

* CRD versioning Public Documentation

* Copyedit

Signed-off-by: Misty Stanley-Jones <[email protected]>

* Address feedback

* More rewrites

* Address feedback

* Update main CRD page in light of versioning

* Reorg CRD docs

* Further reorg

* Tweak title

* CSI documentation update for raw block volume support (#8927)

* CSI documetation update for raw block volume support

* minor edits for "CSI raw block volume support"

Some small grammar and style nits.

* minor CSIBlockVolume edits

* Update kubectl component ref page for 1.11. (#9094)

* Update kubectl component ref page for 1.11.

* Add title. Replace stevepe with username.

* crd versioning doc: fix nits (#9142)

* Update `DynamicKubeletConfig` feature to beta (#9110)

xref: kubernetes/kubernetes#64275

* Documentation for dynamic volume limits based on node type (#8871)

* add cos for storage limits

* Update docs specific for aws and gce

* fix some minor things

* Update storage-limits.md

* Add k8s version to feature-state shortcode

* The Doc update for ScheduleDaemonSetPods (#8842)

Signed-off-by: Da K. Ma <[email protected]>

* Update docs related to PersistentVolumeLabel admission control (#9109)

The said admission controller is disabled by default in 1.11
(kubernetes/kubernetes#64326) and scheduled to be removed in future
release.

* client exec auth: updates for 1.11 (#9154)

* Updates HA kubeadm docs (#9066)

* Updates HA kubeadm docs

Signed-off-by: Chuck Ha <[email protected]>

* kubeadm HA - Add stacked control plane steps

* ssh instructions and some typos in the bash scripts

Signed-off-by: Chuck Ha <[email protected]>

* Fix typos and copypasta errors

* Fix rebase issues

* Integrate more changes

Signed-off-by: Chuck Ha <[email protected]>

* copyedits, layout and formatting fixes

* final copyedits

* Adds a sanity check for load balancer connection

Signed-off-by: Chuck Ha <[email protected]>

* formatting fixes, copyedits

* fix typos, formatting

* Document the Pod Ready++ feature (#9180)

Closes: #9107
Xref: kubernetes/kubernetes#64057

* Mention 'KubeletPluginsWatcher' feature (#9177)

* Mention 'KubeletPluginsWatcher' feature

This feature is more developers oriented than users oriented, so simply
mention it in the feature gate should be fine.
In future, when the design doc is migrated from Google doc to the
kubernetes/community repo, we can add links to it for users who want to
dig deeper.

Closes: #9108
Xref: kubernetes/kubernetes#63328, kubernetes/kubernetes#64605

* Copyedit

* Amend dynamic volume list docs (#9181)

The dynamic volume list feature has been documented but the feature gate
related was not there yet.

Closes: #9105

* Document for service account projection (#9182)

This adds docs for the service account projection feature.

Xref: kubernetes/kubernetes#63819, kubernetes/community#1973
Closes: #9102

* Update pod priority and preemption user docs (#9172)

* Update pod priority and preemption user docs

* Copyedit

* Documentation on setting node name with Kubeadm (#8925)

* Documentation on setting node name with Kubeadm

* copyedit

* Add kubeadm upgrade docs for 1.11 (#9089)

* Add kubeadm upgrade docs for 1.11

* Initial docs review feedback

* Add 1-11 to outline

* Fix formatting on tab blocks

* Move file to correct location

* Add `kubeadm upgrade node config` step

* Overzealous ediffing

* copyedit, fix lists and headings

* clarify --force flag for fixing bad state

* Get TOML ready for 1.11 release

* Blog post for 1.11 release (#9254)

* Blog post for 1.11 release

* Update 2018-06-26-kubernetes-1.11-release-announcement.md

* Update 2018-06-26-kubernetes-1.11-release-announcement.md

* Update 2018-06-26-kubernetes-1.11-release-announcement.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/hw-accelerators cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.