-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dont consume host resources for tasks getting STOPPED while waiting in waitingTasksQueue #3750
Dont consume host resources for tasks getting STOPPED while waiting in waitingTasksQueue #3750
Conversation
b1c74ce
to
0f78263
Compare
376dc60
to
c3b121a
Compare
c3b121a
to
0a3624d
Compare
@@ -1227,7 +1227,7 @@ func TestHostResourceManagerResourceUtilization(t *testing.T) { | |||
testTask := createTestTask(taskArn) | |||
|
|||
// create container | |||
A := createTestContainerWithImageAndName(baseImageForOS, "A") | |||
A := createTestContainerWithImageAndName(baseImageForOS, fmt.Sprintf("A-%d", i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an unrelated change in another test case related to task resource accounting. Can make debugging easier in future if each task and container are named uniquely.
Also a general question with agent restart (sorry if this is me missing some context) - how are we storing tasks in queue to statefile (boltdb) when agent restart or recover from unexpected exit? |
We do not store tasks in queue to statefile. On restarts, So a task which has not progressed beyond that state will queue up again, as on restart, each task's startTask is called again. Those tasks which are not pre-allocated during |
Oh I understand now. So those tasks are kept in state like |
…n waitingTasksQueue (#3750) * dont consume resources for acs stopped tasks * add integ test for the stopTask in waitingTaskQueue case * remove discardConsumedHostResourceEvents
…n waitingTasksQueue (#3750) * dont consume resources for acs stopped tasks * add integ test for the stopTask in waitingTaskQueue case * remove discardConsumedHostResourceEvents
…n waitingTasksQueue (#3750) * dont consume resources for acs stopped tasks * add integ test for the stopTask in waitingTaskQueue case * remove discardConsumedHostResourceEvents
* Revert "Revert "host resource manager initialization"" This reverts commit dafb967. * Revert "Revert "Add method to get host resources reserved for a task (#3706)"" This reverts commit 8d824db. * Revert "Revert "Add host resource manager methods (#3700)"" This reverts commit bec1303. * Revert "Revert "Remove task serialization and use host resource manager for task resources (#3723)"" This reverts commit cb54139. * Revert "Revert "add integ tests for task accounting (#3741)"" This reverts commit 61ad010. * Revert "Revert "Change reconcile/container update order on init and waitForHostResources/emitCurrentStatus order (#3747)"" This reverts commit 60a3f42. * Revert "Revert "Dont consume host resources for tasks getting STOPPED while waiting in waitingTasksQueue (#3750)"" This reverts commit 8943792.
* Revert reverted changes for task resource accounting (#3796) * Revert "Revert "host resource manager initialization"" This reverts commit dafb967. * Revert "Revert "Add method to get host resources reserved for a task (#3706)"" This reverts commit 8d824db. * Revert "Revert "Add host resource manager methods (#3700)"" This reverts commit bec1303. * Revert "Revert "Remove task serialization and use host resource manager for task resources (#3723)"" This reverts commit cb54139. * Revert "Revert "add integ tests for task accounting (#3741)"" This reverts commit 61ad010. * Revert "Revert "Change reconcile/container update order on init and waitForHostResources/emitCurrentStatus order (#3747)"" This reverts commit 60a3f42. * Revert "Revert "Dont consume host resources for tasks getting STOPPED while waiting in waitingTasksQueue (#3750)"" This reverts commit 8943792. * fix memory resource accounting for multiple containers in single task (#3782) * fix memory resource accounting for multiple containers * change unit tests for multiple containers, add unit test for awsvpc
Summary
Context: Task Resource Accounting feature implements queuing (
monitorQueuedTasks
) for tasks to go through where each task waits to 'consume' resources inhost_resource_manager
.host_resource_manager
keeps an account of each task which has acquired resources to progress for task creation/running.If a ACS
StopTask
request arrives at Agent while the task is still in this queue, overseeTask - waiting inwaitForHostResources
, falls through and does not wait for the event frommonitorQueuedTasks
, ends up calling 'host_resource_manager.release()` during emitTaskEvent.But if these
STOPPED
tasks are dequeued later (in docker_task_engine), they just write to the channel with no listeners after thehost_resource_manager.consume()
call indocker_task_engine
, and the resources persist in host_resource_manager.This PR fixes this by not calling
host_resource_manager.consume()
for tasks inmonitorQueuedTasks
whose desired status has changed toSTOPPED
. Some cautionary implementation steps have been taken to isolate working ofmonitorQueuedTasks
as described in implementationImplementation
monitorQueuedTasksLock
for the main body of the loop inmonitorQueuedTasks
. This is to synchronize the processing of the topTask inmonitorQueuedTasks
and any parallel update to its desired status - say from ACS updatesmonitorQueuedTasks
has been moved to a methodtryDequeueWaitingTasks
- to put into this critical section ofmonitorQueuedTasksLock
monitorQueuedTasksLock
is also used inhandleDesiredStatusChange
which updates the desired status of a managed taskTestHostResourceManagerStopTaskNotBlockWaitingTasks
to test this behaviorTestHostResourceManagerResourceUtilization
for easier debuggingAs a result of this implementation, consider the scenarios
monitorQueuedTasks
STOPPED
and queue returns onconsumedHostResourceEvent
without consuming resourcesmonitorQueuedTasks
monitorQueuedTasks
is done processing. Resources will be consumed, and released inemitTaskEvent
monitorQueuedTasks
is done processingmonitorQueuedTasks
is done processing. Resources will be consumed and released inemitTaskEvent
Related Containers Roadmap Issue
aws/containers-roadmap#325
Testing
Verified by
stopTimeout
of 300s. Starting another same task and stopped it. This task goes to the waiting queue and also returns in overseeTask goroutine. Later this task does not call consume and checked debug logsTestHostResourceManagerStopTaskNotBlockWaitingTasks
and verified it succeeds. This test simulates ACS stopTask for a tasks stuck inwaitingTasksQueue
and verifies the resources are released in host_resource_managerNew tests cover the changes:
Yes
Description for the changelog
Dont consume host resources for tasks getting STOPPED while waiting
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.