Fix broken gpu resource override when using pod templates #4925

fg91 · 2024-02-21T14:21:37Z

Why are the changes needed?

@task(
    requests=Resources(cpu="11", mem="11Gi", gpu="11"),
)
def foo():
    ...

@workflow
def test_wf():
    foo().with_overrides(requests=Resources(cpu="13", mem="13Gi", gpu="13"), limits=Resources(cpu="13", mem="13Gi", gpu="13"))

The resulting task pod - as expected - uses the following resources:

resources:
  limits:
    cpu: "13"
    memory: 13Gi
    nvidia.com/gpu: "13"
  requests:
    cpu: "13"
    ephemeral-storage: 2Gi
    memory: 13Gi
    nvidia.com/gpu: "13"

If I now add pod_template=PodTemplate(labels={"foo": "bar"}) to the task decorator, I would expect the same pod resources but the creation of the pod actually fails:

E0221 13:50:14.805583  427772 workers.go:103] error syncing 'development/f454d2131ec484221b70': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [Invalid] failed to create resource, caused by: Pod "f454d2131ec484221b70-n0-0" is invalid: spec.containers[0].resources.requests: Invalid value: "11": must be equal to nvidia.com/gpu limit

This PR fixes this.

What changes were proposed in this pull request?

The bug happens as follows:

When not using the pod template, here, from our example above, we create a container with cpu: 11 , memory: 11Gi, and nvidia.com/gpu: 11 (both for requests and limits).
However, when using a pod template, here, we retrieve a pod spec from the task proto which sets cpu: 11 , memory: 11Gi, and gpu: 11 (not nvidia.com/gpu!) as requests and which doesn't have limits. This pod spec was built on the flytekit side here.

In the latter case, when merging the container resources with the resource overrides here, we end up with the following requests:

cpu: 13
memory: 13Gi
nvidia.com/gpu: 13
gpu: 11  # This value should have been overridden but wasn't because of the wrong resource name

Finally, here, we override nvidia.com/gpu: 13 with the wrong value 11 and delete the gpu key from the resource requirements.

(The actual error message above states that the gpu request and limit are not the same. The root cause for this is what I described here.)

Proposed fix:

I propose to fix this by cleaning up the wrong gpu resource name before merging the resource overrides.

How was this patch tested?

I ran propeller with my proposed change. The resource overriding in the example above works as expected, also when using the pod template.

I will still adapt the tests

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Signed-off-by: Fabio Graetz <[email protected]>

codecov · 2024-02-21T14:25:16Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.92%. Comparing base (22b0005) to head (bc68144).
Report is 5 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4925      +/-   ##
==========================================
- Coverage   58.93%   58.92%   -0.02%     
==========================================
  Files         645      645              
  Lines       55414    55418       +4     
==========================================
- Hits        32661    32654       -7     
- Misses      20172    20181       +9     
- Partials     2581     2583       +2

Flag	Coverage Δ
unittests	`58.92% <100.00%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Fabio Graetz <[email protected]>

hamersaw

tbh I'm having some difficulty following this through, which is probably an indicator that it is far too complex. Obviously this is a bug, but I'm concerned about breaking existing functionality. Do you think there are cases where moving the sanitization after adjusting would break? I'm going to have to dive much deeper and ask to get a few eyes here.

hamersaw · 2024-02-28T20:47:59Z

flyteplugins/go/tasks/pluginmachinery/flytek8s/container_helper_test.go

@@ -237,19 +237,37 @@ func TestApplyResourceOverrides_OverrideGpu(t *testing.T) {
 	gpuRequest := resource.MustParse("1")
 	overrides := ApplyResourceOverrides(v1.ResourceRequirements{
 		Requests: v1.ResourceList{
-			resourceGPU: gpuRequest,
+			ResourceNvidiaGPU: gpuRequest,


Is it safe to make this change? IIUC the arguments to ApplyResourceOverrides will have "gpu" rather than "nvidia/gpu" currently?

hamersaw · 2024-02-28T20:48:49Z

flyteplugins/go/tasks/plugins/array/awsbatch/transformer.go

@@ -95,6 +95,8 @@ func FlyteTaskToBatchInput(ctx context.Context, tCtx pluginCore.TaskExecutionCon
 	if platformResources == nil {
 		platformResources = &v1.ResourceRequirements{}
 	}
+
+	flytek8s.SanitizeGPUResourceRequirements(res)


This function is called internally in the ApplyResourceOverrides below I think. Are we then sanitizing in both locations?

In the current state of the PR, the sanitation doesn't happen in ApplyResourceOverrides anymore but is moved to the new SanitizeGPUResourceRequirements which is called in AddFlyteCustomizationsToContainer directly before calling ApplyResourceOverrides.
This is why I think SanitizeGPUResourceRequirements would have to be called here as well.

fg91 · 2024-03-01T18:40:42Z

tbh I'm having some difficulty following this through, which is probably an indicator that it is far too complex. Obviously this is a bug, but I'm concerned about breaking existing functionality. Do you think there are cases where moving the sanitization after adjusting would break? I'm going to have to dive much deeper and ask to get a few eyes here.

Hey @hamersaw,
thank you for taking a look. I understand your concerns and difficulty to follow this through, it definitely took me a while to figure out what is happening here ...

I first considered sanitizing the wrong "gpu" resouce name in the pod template spec here directly after unmarshalling it from the protobuf. But if I'm not overlooking something, the logic here was supposed to do exactly that - only that it is done too late, causing the GPU override issue.
If we additionally patched the resource name directly after unmarshalling the pod template spec, without touching the old logic, we'd leave the old logic as dead code if I'm not mistaken. This is why I feel it would be nicer to confirm whether this change is safe.

We've been running a propeller version with this change in prod for a week now and haven't observed any issues but I certainly don't claim that we have 100% coverage of all scenarios. For instance I cannot test aws batch.

Do you think there is somebody else who can judge this change as well?
I'd also be happy to jump on a quick call with you or sb else to step through the master and modified code with a debugger while screensharing. This might make it easier to understand what is happening.

hamersaw · 2024-04-25T17:52:58Z

Really appreciate the patience here. Dove through this, and think I understand everything now. What I'm not seeing is how just the container works - neither case successfully overrides the gpu resource in my dev environment. Here's what I'm seeing:

When we call the ApplyResourceOverrides function the container example has "nvidia/gpu" for the resource type because the ToK8sResourceRequirements function converts this internally. Alternatively the PodTemplate means that it will be "gpu".

Regardless, both are being updated to "nvidia/gpu" as you mentioned because of this logic.

I'm seeing neither being updated by the overrides because in the call to adjustResourceRequirement the gpuResourceName is "nvidia/gpu" and the resources in platformResources are "gpu" - so they do not apply the override.

It would be good to understand where I'm diverging from what you saw here.

hamersaw · 2024-04-25T18:44:59Z

Disregard everything ^^, I had my GPU platform limits set to 1 in the configuration so it was lowering the value each time. Diving back in here.

hamersaw

I think I've convinced myself that this is correct. The only surprising thing here is how long it took me to come to this conclusion. Great PR!

Basically, leaving dangling "gpu" resource names caused things to override because sometimes we worked on the sanitized names (ie. "nvidia/gpu") and others on non-sanitized (ie. "gpu"). So this PR sanitizes everything before any processing. It looks like we can guarantee platform resources to be sanitized as well.

fg91 · 2024-04-26T08:50:37Z

The only surprising thing here is how long it took me to come to this conclusion. Great PR!

Definitely took me a while to come up with this fix as well, the logic is just very complicated :)

So this PR sanitizes everything before any processing. It looks like we can guarantee platform resources to be sanitized as well.

Yes, exactly 👌

) * Fix broken gpu resource override when using pod templates Signed-off-by: Fabio Graetz <[email protected]> * Adapt existing tests and add test that would have caught bug Signed-off-by: Fabio Graetz <[email protected]> --------- Signed-off-by: Fabio Graetz <[email protected]>

Fix broken gpu resource override when using pod templates

0c877b5

Signed-off-by: Fabio Graetz <[email protected]>

Adapt existing tests and add test that would have caught bug

bc68144

Signed-off-by: Fabio Graetz <[email protected]>

fg91 marked this pull request as ready for review February 23, 2024 17:39

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. bug Something isn't working labels Feb 23, 2024

fg91 requested a review from hamersaw February 23, 2024 17:39

hamersaw reviewed Feb 28, 2024

View reviewed changes

hamersaw approved these changes Apr 25, 2024

View reviewed changes

fg91 merged commit 9853abe into master Apr 26, 2024
48 of 49 checks passed

fg91 deleted the fg91/fix/fix-broken-gpu-overrides-w-pod-template branch April 26, 2024 09:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix broken gpu resource override when using pod templates #4925

Fix broken gpu resource override when using pod templates #4925

fg91 commented Feb 21, 2024 •

edited

Loading

codecov bot commented Feb 21, 2024 •

edited

Loading

hamersaw left a comment

hamersaw Feb 28, 2024

hamersaw Feb 28, 2024

fg91 Mar 1, 2024

fg91 commented Mar 1, 2024

hamersaw commented Apr 25, 2024

hamersaw commented Apr 25, 2024

hamersaw left a comment •

edited

Loading

fg91 commented Apr 26, 2024 •

edited

Loading

Fix broken gpu resource override when using pod templates #4925

Fix broken gpu resource override when using pod templates #4925

Conversation

fg91 commented Feb 21, 2024 • edited Loading

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Check all the applicable boxes

codecov bot commented Feb 21, 2024 • edited Loading

Codecov Report

hamersaw left a comment

Choose a reason for hiding this comment

hamersaw Feb 28, 2024

Choose a reason for hiding this comment

hamersaw Feb 28, 2024

Choose a reason for hiding this comment

fg91 Mar 1, 2024

Choose a reason for hiding this comment

fg91 commented Mar 1, 2024

hamersaw commented Apr 25, 2024

hamersaw commented Apr 25, 2024

hamersaw left a comment • edited Loading

Choose a reason for hiding this comment

fg91 commented Apr 26, 2024 • edited Loading

fg91 commented Feb 21, 2024 •

edited

Loading

codecov bot commented Feb 21, 2024 •

edited

Loading

hamersaw left a comment •

edited

Loading

fg91 commented Apr 26, 2024 •

edited

Loading