Streamline how tasks stopped per ECS Control Plane #4301
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Streamline how tasks are stopped when ECS Control Plane indicates that a task should be stopped.
Currently ECS Control Plane indicates to ECS Data Plane via Agent Communication Service (ACS) that a task should be stopped in 2 ways:
However, currently the workflow invoked/set of actions taken to actually stop a task on ECS Data Plane side differs for these 2 cases. We should streamline this to allow for consistency and better maintainability of code (e.g., not having to make changes in multiple places in the future).
Implementation details
AddTask
method is currently overloaded. Factor out updating task in task engine logic into a separate methodUpsertTask
UpsertTask
methodupdateTaskUnsafe
method to:updateTaskDesiredStatusUnsafe
to be more specific/clearSetDesiredStatus
andUpdateDesiredStatus
on the task directly with calling task engineUpsertTask
method insteadtaskSteadyStatePollInterval
andtaskSteadyStatePollIntervalJitter
values as they are no longer necessary. This is because usingUpsertTask
to stop the task ensures that the ACS transition to stop the task is emitted on the task'sacsMessages
channel, and thus these values no longer affect how longTestTaskStopVerificationACKResponder*
integration tests take to run (see discussion on previous pull request comment thread here for additional context)Testing
Automated pull request tests.
Manually test using a custom ECS Agent built with the changes in this pull request against the internal bug repro environment mentioned in #4240 to ensure that the bug addressed by that pull request is still addressed with the changes in this pull request. With this custom ECS Agent, the aforementioned bug is still not observed AND we observe that the transition to stop the task is put on the task's ACS messages channel and stopping the container happens immediately after the transition is applied.
Partial manual testing ECS Agent logs
{"level": "debug",
"time": "2024-08-20T17:06:44Z",
"msg": "Received message of type: TaskStopVerificationAck"
}
{
"level": "debug",
"time": "2024-08-20T17:06:44Z",
"msg": "ACS activity occurred"
}
{
"level": "debug",
"time": "2024-08-20T17:06:44Z",
"msg": "Handling TaskStopVerificationACKMessage"
}
{
"level": "info",
"time": "2024-08-20T17:06:44Z",
"msg": "Sending message to task stopper to stop task",
"taskARN": "arn:aws:ecs:us-west-2:REDACTED:task/my-cluster/b176d749a3844107ab3823af0b34c18b"
}
{
"level": "info",
"time": "2024-08-20T17:06:44Z",
"msg": "Stopping task from task stop verification ACK: %s",
"taskARN": "arn:aws:ecs:us-west-2:REDACTED:task/my-cluster/b176d749a3844107ab3823af0b34c18b"
}
{
"level": "debug",
"time": "2024-08-20T17:06:44Z",
"msg": "Putting update on the acs channel",
"desiredStatus": "STOPPED",
"task": "b176d749a3844107ab3823af0b34c18b"
}
{
"level": "debug",
"time": "2024-08-20T17:06:44Z",
"msg": "Update taken off the acs channel",
"desiredStatus": "STOPPED",
"task": "b176d749a3844107ab3823af0b34c18b"
}
{
"level": "info",
"time": "2024-08-20T17:06:44Z",
"msg": "Managed task got acs event",
"desiredStatus": "STOPPED",
"seqnum": 0,
"task": "b176d749a3844107ab3823af0b34c18b"
}
{
"level": "info",
"time": "2024-08-20T17:06:44Z",
"msg": "New acs transition",
"desiredStatus": "STOPPED",
"seqnum": 0,
"task": "b176d749a3844107ab3823af0b34c18b"
}
{
"level": "info",
"time": "2024-08-20T17:06:44Z",
"msg": "Sleeping 45 seconds before applying acs transition (THIS IS ONLY DONE FOR INTERNAL BUG REPRO ENIVRONMENT)"
}
...
{
"level": "debug",
"time": "2024-08-20T17:07:29Z",
"msg": "Updating task's desired status",
"nContainers": 1,
"nENIs": 0,
"taskArn": "arn:aws:ecs:us-west-2:REDACTED:task/my-cluster/b176d749a3844107ab3823af0b34c18b",
"taskDesiredStatus": "STOPPED",
"taskFamily": "test-sleep",
"taskKnownStatus": "RUNNING",
"taskVersion": "3"
}
...
{
"level": "debug",
"time": "2024-08-20T17:07:29Z",
"msg": "Waiting for task event",
"task": "b176d749a3844107ab3823af0b34c18b"
}
{
"level": "info",
"time": "2024-08-20T17:07:29Z",
"msg": "Managed task got acs event",
"desiredStatus": "STOPPED",
"seqnum": 0,
"task": "b176d749a3844107ab3823af0b34c18b"
}
{
"level": "info",
"time": "2024-08-20T17:07:29Z",
"msg": "New acs transition",
"desiredStatus": "STOPPED",
"seqnum": 0,
"task": "b176d749a3844107ab3823af0b34c18b"
}
...
{
"level": "debug",
"time": "2024-08-20T17:07:29Z",
"msg": "Update taken off the acs channel",
"desiredStatus": "STOPPED",
"task": "b176d749a3844107ab3823af0b34c18b"
}
{
"level": "info",
"time": "2024-08-20T17:07:29Z",
"msg": "Stopping container",
"container": "sleepy300",
"task": "b176d749a3844107ab3823af0b34c18b"
}
New tests cover the changes: existing unit and integration tests updated
Description for the changelog
Streamline how tasks stopped per ECS Control Plane
Additional Information
Does this PR include breaking model changes? If so, Have you added transformation functions?
No
Does this PR include the addition of new environment variables in the README?
No
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.