Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Account GPUs as list of GPU IDs for resource accounting #3852

Merged
merged 1 commit into from
Aug 18, 2023

Conversation

prateekchaudhry
Copy link
Contributor

@prateekchaudhry prateekchaudhry commented Aug 17, 2023

Summary

In the current implementation of task resource accounting in ECS Agent, the way we account GPUs is by counting the number of consumed GPUs on the host at any given time. But in ACS payload Agent receives, Agent gets the exact GPU ID on the instance on which to schedule the containers. Consider a scenario when a task is running on a multi-GPU machine, and using a GPU (say gpu1). If it gets in stopping state (desiredStatus=stopped) by an ACS StopTask, gpu1 gets released by ECS backend, and a new task can get scheduled on gpu1, and agent can start the task on the gpu1 if it just considers GPU accounting using counts - as available GPUs can be >= 1 on multi GPU machine. This may cause GPU OOM issues in application when

  • both tasks end up on same GPU (for single GPU, the count acts as a boolean so the issue is not reproducible on single GPU machine)
  • if stopping task GPU memory + new running task GPU memory > available GPU memory

To fix this, in host resource manager, and when accounting a task's resources, GPU accounting must be accounted as list of individual GPU IDs which the task consumes/will consume.

Implementation details

  • agent/engine/host_resource_manager.go - Changed resource type of GPU from INTEGER to STRINGSET
  • agent/api/task/task.go - Changed resource type of GPU from INTEGER to STRINGSET for each task's resources
  • agent/app/agent.go - When initializing host_resource_manager, initialize with list of host gpu ids instead of count

Testing

  • Made unit test changes to test for the required changes
  • End to end manual testing - On multi gpu machine, start a gpu task with long stopTimeout and which occupies 90% of gpu memory and stop it. Immediately start a new gpu task
    • If new task ends up on different gpu, the task immediately begins
    • If new task ends on same gpu, the task waits until the first task stops (desired change here)

Redacted Agent debug logs for 2nd case from above manual testing
T1(e2332bb033774dbbbc7fc5b9566dd229) on (GPU-72371e2b-be81-46d5-d4b4-e3406546ae6b):

level=debug time=2023-08-16T23:51:44Z msg="Enqueued task in Waiting Task Queue" taskARN="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/e2332bb033774dbbbc7fc5b9566dd229"
level=debug time=2023-08-16T23:51:44Z msg="Waiting for task event" task="e2332bb033774dbbbc7fc5b9566dd229"
level=debug time=2023-08-16T23:51:44Z msg="Task host resources to account for" PORTS_UDP=[] GPU=[GPU-72371e2b-be81-46d5-d4b4-e3406546ae6b] taskArn="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/e2332bb033774dbbbc7fc5b9566dd229" CPU=1024 MEMORY=8000 PORTS_TCP=[]
level=info time=2023-08-16T23:51:44Z msg="Resources successfully consumed, continue to task creation" taskArn="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/e2332bb033774dbbbc7fc5b9566dd229"
level=debug time=2023-08-16T23:51:44Z msg="Consumed resources after task consume call" taskArn="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/e2332bb033774dbbbc7fc5b9566dd229" CPU=2048 MEMORY=16000 PORTS_TCP=[22 2375 2376 51678 51679] PORTS_UDP=[] GPU=[GPU-7e35a581-c6b0-dafd-7d55-22dbe796a93c GPU-72371e2b-be81-46d5-d4b4-e3406546ae6b]
level=debug time=2023-08-16T23:51:44Z msg="Dequeued task from Waiting Task Queue" taskARN="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/e2332bb033774dbbbc7fc5b9566dd229"
level=info time=2023-08-16T23:51:44Z msg="Host resources consumed, progressing task" taskARN="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/e2332bb033774dbbbc7fc5b9566dd229"

T2(1fa908d7dd6449878e0f9a1efa31764b) on same GPU (GPU-72371e2b-be81-46d5-d4b4-e3406546ae6b), task stays in PENDING until first stops:

level=debug time=2023-08-16T23:52:52Z msg="Enqueued task in Waiting Task Queue" taskARN="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/1fa908d7dd6449878e0f9a1efa31764b"
level=debug time=2023-08-16T23:52:52Z msg="Waiting for task event" task="1fa908d7dd6449878e0f9a1efa31764b"
level=debug time=2023-08-16T23:52:52Z msg="Task host resources to account for" PORTS_TCP=[] PORTS_UDP=[] GPU=[GPU-72371e2b-be81-46d5-d4b4-e3406546ae6b] taskArn="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/1fa908d7dd6449878e0f9a1efa31764b" CPU=1024 MEMORY=8000
level=info time=2023-08-16T23:52:52Z msg="Resources not consumed, enough resources not available" taskArn="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/1fa908d7dd6449878e0f9a1efa31764b"
level=debug time=2023-08-16T23:52:52Z msg="Consumed resources after task consume call" PORTS_TCP=[22 2375 2376 51678 51679] PORTS_UDP=[] GPU=[GPU-72371e2b-be81-46d5-d4b4-e3406546ae6b] taskArn="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/1fa908d7dd6449878e0f9a1efa31764b" CPU=1024 MEMORY=8000

After first task Stops and GPU is freed up, T2 starts up:

level=debug time=2023-08-16T23:54:11Z msg="Task host resources to account for" PORTS_TCP=[] PORTS_UDP=[] GPU=[GPU-72371e2b-be81-46d5-d4b4-e3406546ae6b] taskArn="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/1fa908d7dd6449878e0f9a1efa31764b" CPU=1024 MEMORY=8000
level=info time=2023-08-16T23:54:11Z msg="Resources successfully consumed, continue to task creation" taskArn="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/1fa908d7dd6449878e0f9a1efa31764b"
level=debug time=2023-08-16T23:54:11Z msg="Consumed resources after task consume call" MEMORY=8000 PORTS_TCP=[22 2375 2376 51678 51679] PORTS_UDP=[] GPU=[GPU-72371e2b-be81-46d5-d4b4-e3406546ae6b] taskArn="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/1fa908d7dd6449878e0f9a1efa31764b" CPU=1024
level=debug time=2023-08-16T23:54:11Z msg="Dequeued task from Waiting Task Queue" taskARN="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/1fa908d7dd6449878e0f9a1efa31764b"
level=info time=2023-08-16T23:54:11Z msg="Host resources consumed, progressing task" taskARN="arn:aws:ecs:us-west-2:<acc>:task/external-memcheck/1fa908d7dd6449878e0f9a1efa31764b"
level=debug time=2023-08-16T23:54:11Z msg="Skipping event emission for task" task="1fa908d7dd6449878e0f9a1efa31764b" error="status not recognized by ECS"
level=debug time=2023-08-16T23:54:11Z msg="Task not steady state or terminal; progressing it" task="1fa908d7dd6449878e0f9a1efa31764b"

New tests cover the changes:
Yes

Description for the changelog

Bug - Fixed gpu resource accounting by maintaining list of used gpu ids instead of count of used gpus, to prevent possible gpu OOM in some situations

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link
Contributor

@Realmonia Realmonia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Is there a case that user intentionally make multiple tasks share the same GPU?

@prateekchaudhry
Copy link
Contributor Author

prateekchaudhry commented Aug 17, 2023

Is there a case that user intentionally make multiple tasks share the same GPU?

This should not be possible - (it was not possible before task resource accounting). If we try to launch multiple tasks (3 single container gpu tasks on a 2 gpu instance), ECS does not allow the task placement Reasons : ["RESOURCE:GPU"]. ECS verifies each task gets assigned different gpus.

Within a task, agent checks if a container is associated with distinct gpu.

So each GPU is supposed to get it's own different container.

Besides, current behavior (with the count gpu implementation) is a random behavior, and available gpu memory is also randomized - depending on memory consumed values - when the first task is stopping. While the 2nd task does get started a bit sooner and if there is no risk of OOM, then all is good here, it is still a risk and should be fixed.

@prateekchaudhry prateekchaudhry merged commit 2d348dd into aws:dev Aug 18, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants