Account GPUs as list of GPU IDs for resource accounting #3852

prateekchaudhry · 2023-08-17T00:38:09Z

Summary

In the current implementation of task resource accounting in ECS Agent, the way we account GPUs is by counting the number of consumed GPUs on the host at any given time. But in ACS payload Agent receives, Agent gets the exact GPU ID on the instance on which to schedule the containers. Consider a scenario when a task is running on a multi-GPU machine, and using a GPU (say gpu1). If it gets in stopping state (desiredStatus=stopped) by an ACS StopTask, gpu1 gets released by ECS backend, and a new task can get scheduled on gpu1, and agent can start the task on the gpu1 if it just considers GPU accounting using counts - as available GPUs can be >= 1 on multi GPU machine. This may cause GPU OOM issues in application when

both tasks end up on same GPU (for single GPU, the count acts as a boolean so the issue is not reproducible on single GPU machine)
if stopping task GPU memory + new running task GPU memory > available GPU memory

To fix this, in host resource manager, and when accounting a task's resources, GPU accounting must be accounted as list of individual GPU IDs which the task consumes/will consume.

Implementation details

agent/engine/host_resource_manager.go - Changed resource type of GPU from INTEGER to STRINGSET
agent/api/task/task.go - Changed resource type of GPU from INTEGER to STRINGSET for each task's resources
agent/app/agent.go - When initializing host_resource_manager, initialize with list of host gpu ids instead of count

Testing

Made unit test changes to test for the required changes
End to end manual testing - On multi gpu machine, start a gpu task with long stopTimeout and which occupies 90% of gpu memory and stop it. Immediately start a new gpu task
- If new task ends up on different gpu, the task immediately begins
- If new task ends on same gpu, the task waits until the first task stops (desired change here)

Redacted Agent debug logs for 2nd case from above manual testing
T1(e2332bb033774dbbbc7fc5b9566dd229) on (GPU-72371e2b-be81-46d5-d4b4-e3406546ae6b):

level=debug time=2023-08-16T23:51:44Z msg="Enqueued task in Waiting Task Queue" taskARN="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/e2332bb033774dbbbc7fc5b9566dd229"
level=debug time=2023-08-16T23:51:44Z msg="Waiting for task event" task="e2332bb033774dbbbc7fc5b9566dd229"
level=debug time=2023-08-16T23:51:44Z msg="Task host resources to account for" PORTS_UDP=[] GPU=[GPU-72371e2b-be81-46d5-d4b4-e3406546ae6b] taskArn="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/e2332bb033774dbbbc7fc5b9566dd229" CPU=1024 MEMORY=8000 PORTS_TCP=[]
level=info time=2023-08-16T23:51:44Z msg="Resources successfully consumed, continue to task creation" taskArn="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/e2332bb033774dbbbc7fc5b9566dd229"
level=debug time=2023-08-16T23:51:44Z msg="Consumed resources after task consume call" taskArn="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/e2332bb033774dbbbc7fc5b9566dd229" CPU=2048 MEMORY=16000 PORTS_TCP=[22 2375 2376 51678 51679] PORTS_UDP=[] GPU=[GPU-7e35a581-c6b0-dafd-7d55-22dbe796a93c GPU-72371e2b-be81-46d5-d4b4-e3406546ae6b]
level=debug time=2023-08-16T23:51:44Z msg="Dequeued task from Waiting Task Queue" taskARN="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/e2332bb033774dbbbc7fc5b9566dd229"
level=info time=2023-08-16T23:51:44Z msg="Host resources consumed, progressing task" taskARN="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/e2332bb033774dbbbc7fc5b9566dd229"

T2(1fa908d7dd6449878e0f9a1efa31764b) on same GPU (GPU-72371e2b-be81-46d5-d4b4-e3406546ae6b), task stays in PENDING until first stops:

level=debug time=2023-08-16T23:52:52Z msg="Enqueued task in Waiting Task Queue" taskARN="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/1fa908d7dd6449878e0f9a1efa31764b"
level=debug time=2023-08-16T23:52:52Z msg="Waiting for task event" task="1fa908d7dd6449878e0f9a1efa31764b"
level=debug time=2023-08-16T23:52:52Z msg="Task host resources to account for" PORTS_TCP=[] PORTS_UDP=[] GPU=[GPU-72371e2b-be81-46d5-d4b4-e3406546ae6b] taskArn="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/1fa908d7dd6449878e0f9a1efa31764b" CPU=1024 MEMORY=8000
level=info time=2023-08-16T23:52:52Z msg="Resources not consumed, enough resources not available" taskArn="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/1fa908d7dd6449878e0f9a1efa31764b"
level=debug time=2023-08-16T23:52:52Z msg="Consumed resources after task consume call" PORTS_TCP=[22 2375 2376 51678 51679] PORTS_UDP=[] GPU=[GPU-72371e2b-be81-46d5-d4b4-e3406546ae6b] taskArn="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/1fa908d7dd6449878e0f9a1efa31764b" CPU=1024 MEMORY=8000

After first task Stops and GPU is freed up, T2 starts up:

level=debug time=2023-08-16T23:54:11Z msg="Task host resources to account for" PORTS_TCP=[] PORTS_UDP=[] GPU=[GPU-72371e2b-be81-46d5-d4b4-e3406546ae6b] taskArn="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/1fa908d7dd6449878e0f9a1efa31764b" CPU=1024 MEMORY=8000
level=info time=2023-08-16T23:54:11Z msg="Resources successfully consumed, continue to task creation" taskArn="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/1fa908d7dd6449878e0f9a1efa31764b"
level=debug time=2023-08-16T23:54:11Z msg="Consumed resources after task consume call" MEMORY=8000 PORTS_TCP=[22 2375 2376 51678 51679] PORTS_UDP=[] GPU=[GPU-72371e2b-be81-46d5-d4b4-e3406546ae6b] taskArn="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/1fa908d7dd6449878e0f9a1efa31764b" CPU=1024
level=debug time=2023-08-16T23:54:11Z msg="Dequeued task from Waiting Task Queue" taskARN="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/1fa908d7dd6449878e0f9a1efa31764b"
level=info time=2023-08-16T23:54:11Z msg="Host resources consumed, progressing task" taskARN="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/1fa908d7dd6449878e0f9a1efa31764b"
level=debug time=2023-08-16T23:54:11Z msg="Skipping event emission for task" task="1fa908d7dd6449878e0f9a1efa31764b" error="status not recognized by ECS"
level=debug time=2023-08-16T23:54:11Z msg="Task not steady state or terminal; progressing it" task="1fa908d7dd6449878e0f9a1efa31764b"

New tests cover the changes:
Yes

Description for the changelog

Bug - Fixed gpu resource accounting by maintaining list of used gpu ids instead of count of used gpus, to prevent possible gpu OOM in some situations

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Realmonia

Looks good. Is there a case that user intentionally make multiple tasks share the same GPU?

prateekchaudhry · 2023-08-17T22:26:13Z

Is there a case that user intentionally make multiple tasks share the same GPU?

This should not be possible - (it was not possible before task resource accounting). If we try to launch multiple tasks (3 single container gpu tasks on a 2 gpu instance), ECS does not allow the task placement Reasons : ["RESOURCE:GPU"]. ECS verifies each task gets assigned different gpus.

Within a task, agent checks if a container is associated with distinct gpu.

So each GPU is supposed to get it's own different container.

Besides, current behavior (with the count gpu implementation) is a random behavior, and available gpu memory is also randomized - depending on memory consumed values - when the first task is stopping. While the 2nd task does get started a bit sooner and if there is no risk of OOM, then all is good here, it is still a risk and should be fixed.

prateekchaudhry requested a review from a team as a code owner August 17, 2023 00:38

prateekchaudhry added the bot/test label Aug 17, 2023

amazon-ecs-bot removed the bot/test label Aug 17, 2023

task resource accounting : account gpu as list

d8091ee

prateekchaudhry force-pushed the gpuCheck branch from 2a2b8f9 to d8091ee Compare August 17, 2023 05:00

prateekchaudhry added the bot/test label Aug 17, 2023

amazon-ecs-bot removed the bot/test label Aug 17, 2023

Realmonia approved these changes Aug 17, 2023

View reviewed changes

mye956 approved these changes Aug 18, 2023

View reviewed changes

prateekchaudhry merged commit 2d348dd into aws:dev Aug 18, 2023
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Account GPUs as list of GPU IDs for resource accounting #3852

Account GPUs as list of GPU IDs for resource accounting #3852

prateekchaudhry commented Aug 17, 2023 •

edited

Loading

Realmonia left a comment

prateekchaudhry commented Aug 17, 2023 •

edited

Loading

Account GPUs as list of GPU IDs for resource accounting #3852

Account GPUs as list of GPU IDs for resource accounting #3852

Conversation

prateekchaudhry commented Aug 17, 2023 • edited Loading

Summary

Implementation details

Testing

Description for the changelog

Licensing

Realmonia left a comment

Choose a reason for hiding this comment

prateekchaudhry commented Aug 17, 2023 • edited Loading

prateekchaudhry commented Aug 17, 2023 •

edited

Loading

prateekchaudhry commented Aug 17, 2023 •

edited

Loading