[DRAFT] Nvidia Settings API changes #4125

monirul · 2024-08-03T02:16:31Z

Issue number:

Closes #

Description of changes:
This PR introduces new settings API for Nvidia GPUs for Kubernetes Nvidia variants.

New settings are

Bottlerocket Settings	Impact	Value
`settings.nvidia-container-runtime.visible-devices-as-volume-mounts`	allows to change the `accept-nvidia-visible-devices-as-volume-mounts` value for k8s container-toolkit	`true` \| `false` default: `true`
`settings.nvidia-container-runtime.visible-devices-envvar-when-unprivileged`	allows to set value of `accept-nvidia-visible-devices-envvar-when-unprivileged` settings of nvidia container runtime for k8s varient	`true` \| `false` default: `false`
`settings.kubernetes.device-plugins.nvidia.pass-device-specs`	sets the value of the `pass-device-specs` settings of the device plugin that pass the list of DeviceSpecs to the kubelet on Allocate	`true` \| `false` default: `true`
`settings.kubernetes.device-plugins.nvidia.device-id-strategy`	sets the value of the `device-id-strategy` settings of the device plugin which specifies how GPUs are identified and selected for workloads running in a Kubernetes cluster	`uuid` \| `index` Default: `index`
`settings.kubernetes.device-plugins.nvidia.device-list-strategy`	sets the value of `device-list-strategy` setting in NVIDIA Kubernetes device plugins. It is used to configure how GPUs are listed and allocated to pods in a Kubernetes cluster	`envvar` \| `volume-mounts` default: `volume-mounts`

Testing done:
Yes.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

bcressey · 2024-08-05T22:03:04Z

Release.toml

@@ -1,4 +1,4 @@
-version = "1.21.0"
+version = "1.21.1"


Typically we would not add new settings to the API in a point release. These changes should target 1.22. But don't do the release version bump in a feature PR, it's not really related to your feature and creates some churn.

Twoliter.toml

sources/shared-defaults/nvidia-k8s-device-plugin.toml

sources/settings-migrations/v1.21.1/nvidia-k8s-device-plugin-settings/src/main.rs

sources/settings-migrations/v1.21.1/nvidia-k8s-device-plugin-metadata/src/main.rs

sources/settings-migrations/v1.21.1/nvidia-container-runtime-settings/src/main.rs

sources/settings-migrations/v1.21.1/nvidia-k8s-device-plugin-settings/src/main.rs

packages/settings-plugins/settings-plugins.spec

sources/settings-plugins/aws-k8s-nvidia/Cargo.toml

bcressey

Looks good apart from the parts that need to be reverted or cleaned up.

It'd be good to test a non-nvidia aws-k8s variant to confirm that the device plugin settings aren't recognized, which would indicate that the feature flag wasn't used at build time.

Release.toml

bcressey · 2024-08-08T21:43:01Z

packages/settings-plugins/settings-plugins.spec

+  -p settings-plugin-aws-k8s-nvidia \
+  %{nil}
+
+


remove one of the two newlines:

Suggested change

bcressey · 2024-08-08T21:43:36Z

packages/settings-plugins/settings-plugins.spec

+
+%description aws-k8s-nvidia
+%{summary}.
+


avoid adding unnecessary whitespace:

Suggested change

cbgbt · 2024-08-09T18:51:12Z

I spoke with @monirul yesterday about an idea to programmatically verify that feature unification has not taken place. Since we need to do a settings-sdk release for the new models anyways, I think it would be a good idea to make the requisite changes their too.

The basic idea is:

Add conditionally-compiled const booleans for the enabled feature
Statically assert in the settings models that those flags are as-expected.

arnaldo2792 · 2024-09-18T02:11:43Z

This was superseded by #4182.

monirul requested review from bcressey and cbgbt August 3, 2024 02:19

monirul mentioned this pull request Aug 5, 2024

Allow changes to the NVIDIA device plugin configurations #2347

Open

monirul marked this pull request as draft August 5, 2024 22:12

bcressey reviewed Aug 5, 2024

View reviewed changes

monirul force-pushed the nvidia-api-kit branch 5 times, most recently from 4b97b65 to 20f5ffc Compare August 8, 2024 20:23

bcressey reviewed Aug 8, 2024

View reviewed changes

monirul force-pushed the nvidia-api-kit branch from 20f5ffc to 6c7ab5f Compare August 14, 2024 18:59

Nvidia Settings API changes

8adc19c

monirul force-pushed the nvidia-api-kit branch from 6c7ab5f to 8adc19c Compare August 14, 2024 19:30

bcressey mentioned this pull request Aug 19, 2024

variants: add k8s-1.31 variants boilerplate #4142

Merged

ginglis13 mentioned this pull request Aug 30, 2024

v1.22.0 ⛰️ Tracking Issue #4170

Closed

5 tasks

arnaldo2792 closed this Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Nvidia Settings API changes #4125

[DRAFT] Nvidia Settings API changes #4125

monirul commented Aug 3, 2024 •

edited

Loading

bcressey Aug 5, 2024

bcressey left a comment

bcressey Aug 8, 2024

bcressey Aug 8, 2024

cbgbt commented Aug 9, 2024

arnaldo2792 commented Sep 18, 2024


		%description aws-k8s-nvidia
		%{summary}.

[DRAFT] Nvidia Settings API changes #4125

[DRAFT] Nvidia Settings API changes #4125

Conversation

monirul commented Aug 3, 2024 • edited Loading

bcressey Aug 5, 2024

Choose a reason for hiding this comment

bcressey left a comment

Choose a reason for hiding this comment

bcressey Aug 8, 2024

Choose a reason for hiding this comment

bcressey Aug 8, 2024

Choose a reason for hiding this comment

cbgbt commented Aug 9, 2024

arnaldo2792 commented Sep 18, 2024

monirul commented Aug 3, 2024 •

edited

Loading