-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete in-tree support for NVIDIA GPUs. #61498
Delete in-tree support for NVIDIA GPUs. #61498
Conversation
fb88c3d
to
d2b683e
Compare
/area hw-accelerators |
9152740
to
5c85338
Compare
/assign @bsalamat @derekwaynecarr @liggitt |
/lgtm |
@@ -17,6 +17,7 @@ limitations under the License. | |||
package e2e_node | |||
|
|||
import ( | |||
"os/exec" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the below var testPodNamePrefix is useless, will you remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That var was not used before as well. I will do unrelated cleanups in another PR, let's keep this PR for just deleting the in-tree support.
changes in the scheduler code LGTM |
/approve |
// IsOvercommitAllowed returns true if the resource is in the default | ||
// namespace and not blacklisted. | ||
// namespace and is not hugepages. | ||
func IsOvercommitAllowed(name core.ResourceName) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is used in validation. this is used in the kubelet, which is allowed to skew two versions older than the apiserver. can we verify that a 1.10-level kubelet fails in a reasonable way if you specify the alpha resource on a pod with overcommit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this is used while validating the pod spec. So, from 1.11 (assuming this PR is merged), the API server will treat alpha.kubernetes.io/nvidia-gpu
like any other *kubernetes.io
prefixed resource (IsDefaultNamespaceResource()
) and will allow pod specs with unequal requests and limits for this resource.
I would hope that when people upgrade their API server to 1.11, they won't submit any new pods using this resource.
But let's say we have a situation with API server running 1.11, kubelet running <1.11 with this alpha feature gate on, and the user submits a pod requesting this resource. In that case, the API server won't check whether requests=limits for this resource. And while assigning resources, kubelet will only look at the limits for this resource (like it does now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing to keep in mind would be that currently the scheduler would schedule a pod requesting *kubernetes.io/
prefixed resources to any node whether that node is exposing that resource or not (I don't know why this is the case). #50658 Scenario B
Once we remove the special case for alpha.kubernetes.io/nvidia-gpu
, this behavior will apply to alpha.kubernetes.io/nvidia-gpu
as well. So, a pod requesting alpha.kubernetes.io/nvidia-gpu
could be scheduled to any node.
Note that this is not because of updating IsOvercommitAllowed()
but because of removing the special case predicate from the scheduler below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @mindprince mentioned, the current scheduler behavior is to ignore resource request on non-existing resources in the kubernetes.io/ domain. This would cause the pod to be scheduled on a node without that requested resource. On the node, because Kubelet also runs GeneralPredicate, it would fail the pod during admission if it is running 1.10, which is actually the desired behavior for alpha.kubernetes.io/nvidia-gpu. However, if it is running 1.11, the pod would be started without proper gpu device setup.
@mindprince has initiated the discussion on whether we want to change this scheduler behavior on kubernetes.io/ domain resources in #50658 discussion. For now, I wonder whether we want to fail loudly during validation for resource request on alpha.kubernetes.io/nvidia-gpu to make sure that any users who haven't been aware of the deprecation of Accelerators feature can get the clear signal and move to the device plugin based solution. Then maybe after one or two releases, when #50658 is fully resolved, we can remove this special validation logic. Of course, Accelerators is an alpha feature, so it is debatable whether we want to add this special logic in resource validation code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the current scheduler behavior is to ignore resource request on non-existing resources in the kubernetes.io/ domain.
that doesn't seem forward compatible, does it? if a new resource comes along and is requested by a pod, only kubelets that know about and declare they satisfy that resource should be running that pod, right?
For now, I wonder whether we want to fail loudly during validation for resource request on alpha.kubernetes.io/nvidia-gpu to make sure that any users who haven't been aware of the deprecation of Accelerators feature can get the clear signal and move to the device plugin based solution.
tightening validation brings a host of issues we want to avoid. it is better to let a pod in and it sit unscheduled than to prevent API writes because of stricter validation that can disrupt cleaning up the very resources that are newly considered invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree the desired behavior should be leaving the pod pending till the requested resource showing up, which is #50658 is about. I think @mindprince is working on a change to resolve #50658 Scenario B. Agree it should be fine to leave the validation part out if both changes are merged in 1.11.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt This is addressed.
one question about behavior of skewed kubelets with a pod that specifies this resource with overcommit, LGTM otherwise |
This removes the alpha Accelerators feature gate which was deprecated in 1.10. The alternative feature DevicePlugins went beta in 1.10.
5c85338
to
87dda33
Compare
API changes lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bsalamat, dashpole, dims, jiayingz, liggitt, mindprince, vishh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Automatic merge from submit-queue (batch tested with PRs 61498, 62030). If you want to cherry-pick this change to another branch, please follow the instructions here. |
Support for it was removed in kubernetes/kubernetes#61498
The in-tree support to GPU is completely removed in release 1.11. This PR removes the related docs in release-1.11 branch. xref: kubernetes/kubernetes#61498
The in-tree support to GPU is completely removed in release 1.11. This PR removes the related docs in release-1.11 branch. xref: kubernetes/kubernetes#61498
The in-tree support to GPU is completely removed in release 1.11. This PR removes the related docs in release-1.11 branch. xref: kubernetes/kubernetes#61498
* Remove docs related to in-tree support to GPU The in-tree support to GPU is completely removed in release 1.11. This PR removes the related docs in release-1.11 branch. xref: kubernetes/kubernetes#61498 * Update content updated by PR to Hugo syntax Signed-off-by: Misty Stanley-Jones <[email protected]>
* Remove docs related to in-tree support to GPU The in-tree support to GPU is completely removed in release 1.11. This PR removes the related docs in release-1.11 branch. xref: kubernetes/kubernetes#61498 * Update content updated by PR to Hugo syntax Signed-off-by: Misty Stanley-Jones <[email protected]>
* Remove docs related to in-tree support to GPU The in-tree support to GPU is completely removed in release 1.11. This PR removes the related docs in release-1.11 branch. xref: kubernetes/kubernetes#61498 * Update content updated by PR to Hugo syntax Signed-off-by: Misty Stanley-Jones <[email protected]>
* Remove docs related to in-tree support to GPU The in-tree support to GPU is completely removed in release 1.11. This PR removes the related docs in release-1.11 branch. xref: kubernetes/kubernetes#61498 * Update content updated by PR to Hugo syntax Signed-off-by: Misty Stanley-Jones <[email protected]>
* Remove docs related to in-tree support to GPU The in-tree support to GPU is completely removed in release 1.11. This PR removes the related docs in release-1.11 branch. xref: kubernetes/kubernetes#61498 * Update content updated by PR to Hugo syntax Signed-off-by: Misty Stanley-Jones <[email protected]>
* Seperate priority and preemption (#8144) * Doc about PID pressure condition. (#8211) * Doc about PID pressure condition. Signed-off-by: Da K. Ma <[email protected]> * "so" -> "too" * Update version selector for 1.11 * StorageObjectInUseProtection is GA (#8291) * Feature gate: StorageObjectInUseProtection is GA Update feature gate reference for 1.11 * Trivial commit to re-trigger Netlify * CRIContainerLogRotation is Beta in 1.11 (#8665) * Seperate priority and preemption (#8144) * CRIContainerLogRotation is Beta in 1.11 xref: kubernetes/kubernetes#64046 * Bring StorageObjectInUseProtection feature to GA (#8159) * StorageObjectInUseProtection is GA (#8291) * Feature gate: StorageObjectInUseProtection is GA Update feature gate reference for 1.11 * Trivial commit to re-trigger Netlify * Bring StorageObjectInUseProtection feature to GA StorageObjectInUseProtection is Beta in K8s 1.10. It's brought to GA in K8s 1.11. * Fixed typo and added feature state tags. * Remove KUBE_API_VERSIONS doc (#8292) The support to the KUBER_API_VERSIONS environment variable is completely dropped (no deprecation). This PR removes the related doc in release-1.11. xref: kubernetes/kubernetes#63165 * Remove InitialResources from admission controllers (#8293) The feature (was experimental) is dropped in 1.11. xref: kubernetes/kubernetes#58784 * Remove docs related to in-tree support to GPU (#8294) * Remove docs related to in-tree support to GPU The in-tree support to GPU is completely removed in release 1.11. This PR removes the related docs in release-1.11 branch. xref: kubernetes/kubernetes#61498 * Update content updated by PR to Hugo syntax Signed-off-by: Misty Stanley-Jones <[email protected]> * Update the doc about extra volume in kubeadm config (#8453) Signed-off-by: Xianglin Gao <[email protected]> * Update CRD Subresources for 1.11 (#8519) * coredns: update notes in administer-cluster/coredns.md (#8697) CoreDNS is installed by default in 1.11. Add notes on how to install kube-dns instead. Update notes about CoreDNS->CoreDNS upgrades as in 1.11 the Corefile is retained. Add example on upgrading from kube-dns to CoreDNS. * kubeadm-alpha: CoreDNS related changes (#8727) Update note about CoreDNS feature gate. This change also updates a tab as a kubeadm sub-command will change. It looks for a new generated file: generated/kubeadm_alpha_phase_addon_coredns.md instead of: generated/kubeadm_alpha_phase_addon_kube-dns.md * Update cloud controller manager docs to beta 1.11 (#8756) * Update cloud controller manager docs to beta 1.11 * Use Hugo shortcode for feature state * kubeadm-upgrade: include new command `kubeadm upgrade diff` (#8617) Also: - Include note that this was added in 1.11. - Modify the note about upgrade guidance. * independent: update CoreDNS mentions for kubeadm (#8753) Give CoreDNS instead of kube-dns examples in: - docs/setup/independent/create-cluster-kubeadm.md - docs/setup/independent/troubleshooting-kubeadm.md * update 1.11 --server-print info (#8870) * update 1.11 --server-print info * Copyedit * Mark ExpandPersistentVolumes feature to beta (#8778) * Update version selector for 1.11 * Mark ExpandPersistentVolumes Beta xref: kubernetes/kubernetes#64288 * fix shortcode, add placeholder files to fix deploy failures (#8874) * declare ipvs ga (#8850) * kubeadm: update info about CoreDNS in kubeadm-init.md (#8728) Add info to install kube-dns instead of CoreDNS, as CoreDNS is the default DNS server in 1.11. Add notes that kubeadm config images can be used to list and pull the required images in 1.11. * kubeadm: update implementation-details.md about CoreDNS (#8829) - Replace examples from kube-dns to CoreDNS - Add notes about the CoreDNS feature gate status in 1.11 - Add note that the service name for CoreDNS is also called `kube-dns` * Update block device support for 1.11 (#8895) * Update block device support for 1.11 * Copyedits * Fix typo 'fiber channel' (#8957) Signed-off-by: Misty Stanley-Jones <[email protected]> * kubeadm-upgrade: add the 'node [config]' sub-command (#8960) - Add includes for the generated pages - Include placeholder generated pages * kubeadm-init: update the example for the MasterConfiguration (#8958) - include godocs link for MasterConfiguration - include example MasterConfiguration - add note that `kubeadm config print-default` can be used * kubeadm-config: include new commands (#8862) Add notes and includes for these new commands in 1.11: - kubeadm config print-default - kubeadm config migrate - kubeadm config images list - kubeadm config images pull Include placeholder generated files for the above. * administer-cluster/coredns: include more changes (#8985) It was requested that for this page a couple of methods should be outlined: - manual installation for CoreDNS explained at the Kubernetes section of the GitHub project for CoreDNS - installation and upgrade via kubeadm Make the above changes and also add a section "About CoreDNS". This commit also lowercases a section title. * Update CRD subresources doc for 1.11 (#8918) * Add docs for volume expansion and online resizing (#8896) * Add docs for volume expansion going beta * Copyedit * Address feedback * Update exec plugin docs with TLS credentials (#8826) * Update exec plugin docs with TLS credentials kubernetes/kubernetes#61803 implements TLS client credential support for 1.11. * Copyedit * More copyedits for clarification * Additional copyedit * Change token->credential * NodeRestriction admission prevents kubelet taint removal (#8911) * dns-custom-namerserver: break down the page into mutliple sections (#8900) * dns-custom-namerserver: break down the page into mutliple sections This page is currently about kube-dns and is a bit outdated. Introduce the heading `# Customizing kube-dns`. Introduce a separate section about CoreDNS. * Copyedits, fix headings for customizing DNS Hey Lubomir, I coypedited pretty heavily because this workflow is so much easier for docs and because I'm trying to help improve everything touching kubeadm as much as possible. But there's one outstanding issue wrt headings and intro content: you can't add a heading 1 to a topic to do what you wanted to do. The page title in the front matter is rendered as a heading 1 and everything else has to start at heading 2. (We still need to doc this better in the docs contributing content, I know.) Instead, I think we need to rewrite the top-of-page intro content to explain better the relationship between kube-dns and CoreDNS. I'm happy to write something, but I thought I'd push this commit first so you can see what I'm doing. Hope it's all clear -- ping here or on Slack with any questions ~ Jennifer * Interim fix for talking about CoreDNS * Fix CoreDNS details * PSP readOnly hostPath (#8898) * Add documentation for crictl (#8880) * Add documentation for crictl * Copyedit Signed-off-by: Misty Stanley-Jones <[email protected]> * Final copyedit * VolumeSubpathEnvExpansion alpha feature (#8835) * Note that Heapster is deprecated (#8827) * Note that Heapster is deprecated This notes that Heapster is deprecated, and migrates the relevant docs to talk about metrics-server or other solutions by default. * Copyedits and improvements Signed-off-by: Misty Stanley-Jones <[email protected]> * Address feedback * fix shortcode to troubleshoot deploy (#9057) * update dynamic kubelet config docs for v1.11 (#8766) * update dynamic kubelet config docs for v1.11 * Substantial copyedit * Address feedback * Reference doc for kubeadm (release-1.11) (#9044) * Reference doc for kubeadm (release-1.11) * fix shortcode to troubleshoot deploy (#9057) * Reference doc for kube-components (release-1.11) (#9045) * Reference doc for kube-components (release-1.11) * Update cloud-controller-manager.md * fix shortcode to troubleshoot deploy (#9057) * Documentation on lowercasing kubeadm init apiserver SANs (#9059) * Documentation on lowercasing kubeadm init apiserver SANs * fix shortcode to troubleshoot deploy (#9057) * Clarification in dynamic Kubelet config doc (#9061) * Promote sysctls to Beta (#8804) * Promote sysctls to Beta * Copyedits Signed-off-by: Misty Stanley-Jones <[email protected]> * Review comments * Address feedback * More feedback * kubectl reference docs for 1.11 (#9080) * Update Kubernetes API 1.11 ref docs (#8977) * Update v1alpha1 to v1beta1. * Adjust left nav for 1.11 ref docs. * Trim list of old ref docs. * Update Federation API ref docs for 1.11. (#9064) * Update Federation API ref docs for 1.11. * Add titles. * Update definitions.html * CRD versioning Public Documentation (#8834) * CRD versioning Public Documentation * Copyedit Signed-off-by: Misty Stanley-Jones <[email protected]> * Address feedback * More rewrites * Address feedback * Update main CRD page in light of versioning * Reorg CRD docs * Further reorg * Tweak title * CSI documentation update for raw block volume support (#8927) * CSI documetation update for raw block volume support * minor edits for "CSI raw block volume support" Some small grammar and style nits. * minor CSIBlockVolume edits * Update kubectl component ref page for 1.11. (#9094) * Update kubectl component ref page for 1.11. * Add title. Replace stevepe with username. * crd versioning doc: fix nits (#9142) * Update `DynamicKubeletConfig` feature to beta (#9110) xref: kubernetes/kubernetes#64275 * Documentation for dynamic volume limits based on node type (#8871) * add cos for storage limits * Update docs specific for aws and gce * fix some minor things * Update storage-limits.md * Add k8s version to feature-state shortcode * The Doc update for ScheduleDaemonSetPods (#8842) Signed-off-by: Da K. Ma <[email protected]> * Update docs related to PersistentVolumeLabel admission control (#9109) The said admission controller is disabled by default in 1.11 (kubernetes/kubernetes#64326) and scheduled to be removed in future release. * client exec auth: updates for 1.11 (#9154) * Updates HA kubeadm docs (#9066) * Updates HA kubeadm docs Signed-off-by: Chuck Ha <[email protected]> * kubeadm HA - Add stacked control plane steps * ssh instructions and some typos in the bash scripts Signed-off-by: Chuck Ha <[email protected]> * Fix typos and copypasta errors * Fix rebase issues * Integrate more changes Signed-off-by: Chuck Ha <[email protected]> * copyedits, layout and formatting fixes * final copyedits * Adds a sanity check for load balancer connection Signed-off-by: Chuck Ha <[email protected]> * formatting fixes, copyedits * fix typos, formatting * Document the Pod Ready++ feature (#9180) Closes: #9107 Xref: kubernetes/kubernetes#64057 * Mention 'KubeletPluginsWatcher' feature (#9177) * Mention 'KubeletPluginsWatcher' feature This feature is more developers oriented than users oriented, so simply mention it in the feature gate should be fine. In future, when the design doc is migrated from Google doc to the kubernetes/community repo, we can add links to it for users who want to dig deeper. Closes: #9108 Xref: kubernetes/kubernetes#63328, kubernetes/kubernetes#64605 * Copyedit * Amend dynamic volume list docs (#9181) The dynamic volume list feature has been documented but the feature gate related was not there yet. Closes: #9105 * Document for service account projection (#9182) This adds docs for the service account projection feature. Xref: kubernetes/kubernetes#63819, kubernetes/community#1973 Closes: #9102 * Update pod priority and preemption user docs (#9172) * Update pod priority and preemption user docs * Copyedit * Documentation on setting node name with Kubeadm (#8925) * Documentation on setting node name with Kubeadm * copyedit * Add kubeadm upgrade docs for 1.11 (#9089) * Add kubeadm upgrade docs for 1.11 * Initial docs review feedback * Add 1-11 to outline * Fix formatting on tab blocks * Move file to correct location * Add `kubeadm upgrade node config` step * Overzealous ediffing * copyedit, fix lists and headings * clarify --force flag for fixing bad state * Get TOML ready for 1.11 release * Blog post for 1.11 release (#9254) * Blog post for 1.11 release * Update 2018-06-26-kubernetes-1.11-release-announcement.md * Update 2018-06-26-kubernetes-1.11-release-announcement.md * Update 2018-06-26-kubernetes-1.11-release-announcement.md
This removes the alpha Accelerators feature gate which was deprecated in 1.10 (#57384).
The alternative feature DevicePlugins went beta in 1.10 (#60170).
Fixes #54012