-
Notifications
You must be signed in to change notification settings - Fork 672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix broken gpu resource override when using pod templates #4925
Fix broken gpu resource override when using pod templates #4925
Conversation
Signed-off-by: Fabio Graetz <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #4925 +/- ##
==========================================
- Coverage 58.93% 58.92% -0.02%
==========================================
Files 645 645
Lines 55414 55418 +4
==========================================
- Hits 32661 32654 -7
- Misses 20172 20181 +9
- Partials 2581 2583 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Signed-off-by: Fabio Graetz <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh I'm having some difficulty following this through, which is probably an indicator that it is far too complex. Obviously this is a bug, but I'm concerned about breaking existing functionality. Do you think there are cases where moving the sanitization after adjusting would break? I'm going to have to dive much deeper and ask to get a few eyes here.
@@ -237,19 +237,37 @@ func TestApplyResourceOverrides_OverrideGpu(t *testing.T) { | |||
gpuRequest := resource.MustParse("1") | |||
overrides := ApplyResourceOverrides(v1.ResourceRequirements{ | |||
Requests: v1.ResourceList{ | |||
resourceGPU: gpuRequest, | |||
ResourceNvidiaGPU: gpuRequest, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe to make this change? IIUC the arguments to ApplyResourceOverrides
will have "gpu" rather than "nvidia/gpu" currently?
@@ -95,6 +95,8 @@ func FlyteTaskToBatchInput(ctx context.Context, tCtx pluginCore.TaskExecutionCon | |||
if platformResources == nil { | |||
platformResources = &v1.ResourceRequirements{} | |||
} | |||
|
|||
flytek8s.SanitizeGPUResourceRequirements(res) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is called internally in the ApplyResourceOverrides
below I think. Are we then sanitizing in both locations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current state of the PR, the sanitation doesn't happen in ApplyResourceOverrides
anymore but is moved to the new SanitizeGPUResourceRequirements
which is called in AddFlyteCustomizationsToContainer
directly before calling ApplyResourceOverrides
.
This is why I think SanitizeGPUResourceRequirements
would have to be called here as well.
Hey @hamersaw, I first considered sanitizing the wrong We've been running a propeller version with this change in prod for a week now and haven't observed any issues but I certainly don't claim that we have 100% coverage of all scenarios. For instance I cannot test aws batch. Do you think there is somebody else who can judge this change as well? |
Really appreciate the patience here. Dove through this, and think I understand everything now. What I'm not seeing is how just the container works - neither case successfully overrides the gpu resource in my dev environment. Here's what I'm seeing: When we call the ApplyResourceOverrides function the container example has "nvidia/gpu" for the resource type because the ToK8sResourceRequirements function converts this internally. Alternatively the PodTemplate means that it will be "gpu". Regardless, both are being updated to "nvidia/gpu" as you mentioned because of this logic. I'm seeing neither being updated by the overrides because in the call to adjustResourceRequirement the It would be good to understand where I'm diverging from what you saw here. |
Disregard everything ^^, I had my GPU platform limits set to 1 in the configuration so it was lowering the value each time. Diving back in here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've convinced myself that this is correct. The only surprising thing here is how long it took me to come to this conclusion. Great PR!
Basically, leaving dangling "gpu" resource names caused things to override because sometimes we worked on the sanitized names (ie. "nvidia/gpu") and others on non-sanitized (ie. "gpu"). So this PR sanitizes everything before any processing. It looks like we can guarantee platform resources to be sanitized as well.
Definitely took me a while to come up with this fix as well, the logic is just very complicated :)
Yes, exactly 👌 |
) * Fix broken gpu resource override when using pod templates Signed-off-by: Fabio Graetz <[email protected]> * Adapt existing tests and add test that would have caught bug Signed-off-by: Fabio Graetz <[email protected]> --------- Signed-off-by: Fabio Graetz <[email protected]>
Why are the changes needed?
The resulting task pod - as expected - uses the following resources:
If I now add
pod_template=PodTemplate(labels={"foo": "bar"})
to the task decorator, I would expect the same pod resources but the creation of the pod actually fails:This PR fixes this.
What changes were proposed in this pull request?
The bug happens as follows:
cpu: 11
,memory: 11Gi
, andnvidia.com/gpu: 11
(both for requests and limits).cpu: 11
,memory: 11Gi
, andgpu: 11
(notnvidia.com/gpu
!) as requests and which doesn't have limits. This pod spec was built on the flytekit side here.In the latter case, when merging the container resources with the resource overrides here, we end up with the following requests:
Finally, here, we override
nvidia.com/gpu: 13
with the wrong value11
and delete thegpu
key from the resource requirements.(The actual error message above states that the gpu request and limit are not the same. The root cause for this is what I described here.)
Proposed fix:
I propose to fix this by cleaning up the wrong gpu resource name before merging the resource overrides.
How was this patch tested?
I ran propeller with my proposed change. The resource overriding in the example above works as expected, also when using the pod template.
I will still adapt the tests
Check all the applicable boxes