Account GPUs as list of GPU IDs for resource accounting #3852
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
In the current implementation of task resource accounting in ECS Agent, the way we account GPUs is by counting the number of consumed GPUs on the host at any given time. But in ACS payload Agent receives, Agent gets the exact GPU ID on the instance on which to schedule the containers. Consider a scenario when a task is running on a multi-GPU machine, and using a GPU (say
gpu1
). If it gets in stopping state (desiredStatus=stopped
) by an ACS StopTask,gpu1
gets released by ECS backend, and a new task can get scheduled ongpu1
, and agent can start the task on thegpu1
if it just considers GPU accounting using counts - as available GPUs can be >= 1 on multi GPU machine. This may cause GPU OOM issues in application whenTo fix this, in host resource manager, and when accounting a task's resources, GPU accounting must be accounted as list of individual GPU IDs which the task consumes/will consume.
Implementation details
agent/engine/host_resource_manager.go
- Changed resource type ofGPU
fromINTEGER
toSTRINGSET
agent/api/task/task.go
- Changed resource type ofGPU
fromINTEGER
toSTRINGSET
for each task's resourcesagent/app/agent.go
- When initializing host_resource_manager, initialize with list of host gpu ids instead of countTesting
Redacted Agent debug logs for 2nd case from above manual testing
T1(e2332bb033774dbbbc7fc5b9566dd229) on (GPU-72371e2b-be81-46d5-d4b4-e3406546ae6b):
T2(1fa908d7dd6449878e0f9a1efa31764b) on same GPU (GPU-72371e2b-be81-46d5-d4b4-e3406546ae6b), task stays in PENDING until first stops:
After first task Stops and GPU is freed up, T2 starts up:
New tests cover the changes:
Yes
Description for the changelog
Bug - Fixed gpu resource accounting by maintaining list of used gpu ids instead of count of used gpus, to prevent possible gpu OOM in some situations
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.