Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken gpu resource override when using pod templates #4925

Merged
merged 2 commits into from
Apr 26, 2024

Conversation

fg91
Copy link
Member

@fg91 fg91 commented Feb 21, 2024

Why are the changes needed?

@task(
    requests=Resources(cpu="11", mem="11Gi", gpu="11"),
)
def foo():
    ...

@workflow
def test_wf():
    foo().with_overrides(requests=Resources(cpu="13", mem="13Gi", gpu="13"), limits=Resources(cpu="13", mem="13Gi", gpu="13"))

The resulting task pod - as expected - uses the following resources:

resources:
  limits:
    cpu: "13"
    memory: 13Gi
    nvidia.com/gpu: "13"
  requests:
    cpu: "13"
    ephemeral-storage: 2Gi
    memory: 13Gi
    nvidia.com/gpu: "13"

If I now add pod_template=PodTemplate(labels={"foo": "bar"}) to the task decorator, I would expect the same pod resources but the creation of the pod actually fails:

E0221 13:50:14.805583  427772 workers.go:103] error syncing 'development/f454d2131ec484221b70': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [Invalid] failed to create resource, caused by: Pod "f454d2131ec484221b70-n0-0" is invalid: spec.containers[0].resources.requests: Invalid value: "11": must be equal to nvidia.com/gpu limit

This PR fixes this.

What changes were proposed in this pull request?

The bug happens as follows:

  • When not using the pod template, here, from our example above, we create a container with cpu: 11 , memory: 11Gi, and nvidia.com/gpu: 11 (both for requests and limits).
  • However, when using a pod template, here, we retrieve a pod spec from the task proto which sets cpu: 11 , memory: 11Gi, and gpu: 11 (not nvidia.com/gpu!) as requests and which doesn't have limits. This pod spec was built on the flytekit side here.

In the latter case, when merging the container resources with the resource overrides here, we end up with the following requests:

cpu: 13
memory: 13Gi
nvidia.com/gpu: 13
gpu: 11  # This value should have been overridden but wasn't because of the wrong resource name

Finally, here, we override nvidia.com/gpu: 13 with the wrong value 11 and delete the gpu key from the resource requirements.

(The actual error message above states that the gpu request and limit are not the same. The root cause for this is what I described here.)

Proposed fix:

I propose to fix this by cleaning up the wrong gpu resource name before merging the resource overrides.

How was this patch tested?

I ran propeller with my proposed change. The resource overriding in the example above works as expected, also when using the pod template.

I will still adapt the tests

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Copy link

codecov bot commented Feb 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.92%. Comparing base (22b0005) to head (bc68144).
Report is 5 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4925      +/-   ##
==========================================
- Coverage   58.93%   58.92%   -0.02%     
==========================================
  Files         645      645              
  Lines       55414    55418       +4     
==========================================
- Hits        32661    32654       -7     
- Misses      20172    20181       +9     
- Partials     2581     2583       +2     
Flag Coverage Δ
unittests 58.92% <100.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@fg91 fg91 marked this pull request as ready for review February 23, 2024 17:39
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. bug Something isn't working labels Feb 23, 2024
@fg91 fg91 requested a review from hamersaw February 23, 2024 17:39
Copy link
Contributor

@hamersaw hamersaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh I'm having some difficulty following this through, which is probably an indicator that it is far too complex. Obviously this is a bug, but I'm concerned about breaking existing functionality. Do you think there are cases where moving the sanitization after adjusting would break? I'm going to have to dive much deeper and ask to get a few eyes here.

@@ -237,19 +237,37 @@ func TestApplyResourceOverrides_OverrideGpu(t *testing.T) {
gpuRequest := resource.MustParse("1")
overrides := ApplyResourceOverrides(v1.ResourceRequirements{
Requests: v1.ResourceList{
resourceGPU: gpuRequest,
ResourceNvidiaGPU: gpuRequest,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to make this change? IIUC the arguments to ApplyResourceOverrides will have "gpu" rather than "nvidia/gpu" currently?

@@ -95,6 +95,8 @@ func FlyteTaskToBatchInput(ctx context.Context, tCtx pluginCore.TaskExecutionCon
if platformResources == nil {
platformResources = &v1.ResourceRequirements{}
}

flytek8s.SanitizeGPUResourceRequirements(res)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is called internally in the ApplyResourceOverrides below I think. Are we then sanitizing in both locations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current state of the PR, the sanitation doesn't happen in ApplyResourceOverrides anymore but is moved to the new SanitizeGPUResourceRequirements which is called in AddFlyteCustomizationsToContainer directly before calling ApplyResourceOverrides.
This is why I think SanitizeGPUResourceRequirements would have to be called here as well.

@fg91
Copy link
Member Author

fg91 commented Mar 1, 2024

tbh I'm having some difficulty following this through, which is probably an indicator that it is far too complex. Obviously this is a bug, but I'm concerned about breaking existing functionality. Do you think there are cases where moving the sanitization after adjusting would break? I'm going to have to dive much deeper and ask to get a few eyes here.

Hey @hamersaw,
thank you for taking a look. I understand your concerns and difficulty to follow this through, it definitely took me a while to figure out what is happening here ...

I first considered sanitizing the wrong "gpu" resouce name in the pod template spec here directly after unmarshalling it from the protobuf. But if I'm not overlooking something, the logic here was supposed to do exactly that - only that it is done too late, causing the GPU override issue.
If we additionally patched the resource name directly after unmarshalling the pod template spec, without touching the old logic, we'd leave the old logic as dead code if I'm not mistaken. This is why I feel it would be nicer to confirm whether this change is safe.

We've been running a propeller version with this change in prod for a week now and haven't observed any issues but I certainly don't claim that we have 100% coverage of all scenarios. For instance I cannot test aws batch.

Do you think there is somebody else who can judge this change as well?
I'd also be happy to jump on a quick call with you or sb else to step through the master and modified code with a debugger while screensharing. This might make it easier to understand what is happening.

@hamersaw
Copy link
Contributor

Really appreciate the patience here. Dove through this, and think I understand everything now. What I'm not seeing is how just the container works - neither case successfully overrides the gpu resource in my dev environment. Here's what I'm seeing:

When we call the ApplyResourceOverrides function the container example has "nvidia/gpu" for the resource type because the ToK8sResourceRequirements function converts this internally. Alternatively the PodTemplate means that it will be "gpu".

Regardless, both are being updated to "nvidia/gpu" as you mentioned because of this logic.

I'm seeing neither being updated by the overrides because in the call to adjustResourceRequirement the gpuResourceName is "nvidia/gpu" and the resources in platformResources are "gpu" - so they do not apply the override.

It would be good to understand where I'm diverging from what you saw here.

@hamersaw
Copy link
Contributor

Disregard everything ^^, I had my GPU platform limits set to 1 in the configuration so it was lowering the value each time. Diving back in here.

Copy link
Contributor

@hamersaw hamersaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've convinced myself that this is correct. The only surprising thing here is how long it took me to come to this conclusion. Great PR!

Basically, leaving dangling "gpu" resource names caused things to override because sometimes we worked on the sanitized names (ie. "nvidia/gpu") and others on non-sanitized (ie. "gpu"). So this PR sanitizes everything before any processing. It looks like we can guarantee platform resources to be sanitized as well.

@fg91
Copy link
Member Author

fg91 commented Apr 26, 2024

The only surprising thing here is how long it took me to come to this conclusion. Great PR!

Definitely took me a while to come up with this fix as well, the logic is just very complicated :)

So this PR sanitizes everything before any processing. It looks like we can guarantee platform resources to be sanitized as well.

Yes, exactly 👌

@fg91 fg91 merged commit 9853abe into master Apr 26, 2024
48 of 49 checks passed
@fg91 fg91 deleted the fg91/fix/fix-broken-gpu-overrides-w-pod-template branch April 26, 2024 09:27
austin362667 pushed a commit to austin362667/flyte that referenced this pull request May 7, 2024
)

* Fix broken gpu resource override when using pod templates

Signed-off-by: Fabio Graetz <[email protected]>

* Adapt existing tests and add test that would have caught bug

Signed-off-by: Fabio Graetz <[email protected]>

---------

Signed-off-by: Fabio Graetz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants