Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

engine,dockerclient: per container start/stop timeouts #1849

Merged
merged 2 commits into from
Feb 19, 2019

Conversation

adnxn
Copy link
Contributor

@adnxn adnxn commented Feb 13, 2019

Summary

  • start timeout is governed by the agent and is the context timeout for
    the StartContainer api call

  • stop timeout is the parameter passed to StopContainer and is time the
    docker daemon waits after a StopContainer call to issue a SIGKILL to the
    container

Implementation details

Testing

New tests cover the changes: yes
unit test:

  • TestPerContainerTimeouts
  • TestPostUnmarshalPerContainerTimeouts

integ test:

  • TestPerContainerStopTimeout

Description for the changelog

  • Feature - Support for granular start and stop timeouts per container

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@adnxn adnxn changed the title engine,dockerclient: per container start/stop timeouts [WIP] engine,dockerclient: per container start/stop timeouts Feb 13, 2019
@adnxn adnxn added this to the 1.26.0 milestone Feb 13, 2019
@adnxn adnxn requested a review from petderek February 13, 2019 21:35
@adnxn adnxn added bot/test and removed bot/test labels Feb 14, 2019
@adnxn
Copy link
Contributor Author

adnxn commented Feb 15, 2019

will squash to single commit before unWIP

@adnxn adnxn force-pushed the dev-granular-timeouts branch 2 times, most recently from 38b4496 to 057f252 Compare February 15, 2019 21:30
@adnxn adnxn changed the title [WIP] engine,dockerclient: per container start/stop timeouts engine,dockerclient: per container start/stop timeouts Feb 15, 2019
@adnxn adnxn requested review from a team and yhlee-aws February 15, 2019 21:46
@@ -550,8 +552,8 @@ func (dg *dockerGoClient) createContainer(ctx context.Context,
return dg.containerMetadata(ctx, dockerContainer.ID)
}

func (dg *dockerGoClient) StartContainer(ctx context.Context, id string, timeout time.Duration) DockerContainerMetadata {
ctx, cancel := context.WithTimeout(ctx, timeout)
func (dg *dockerGoClient) StartContainer(ctx context.Context, id string, ctxTimeout time.Duration) DockerContainerMetadata {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add similar comment for ctxTimeout here as well? is there valid range of values for this like it does for stop timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, this is just what we chose for the start container timeout. we have some sane defaults here #1321, but since we're allowing it to be configurable from the task def we don't have a range check at this layer. the service is enforcing a lower bound of 2s and no upper bound.

the docker daemon's start container api doesnt enforce timeouts.

defer cancel()
defer metrics.MetricsEngineGlobal.RecordDockerMetric("STOP_CONTAINER")()
// Buffered channel so in the case of timeout it takes one write, never gets
// read, and can still be GC'd
response := make(chan DockerContainerMetadata, 1)
go func() { response <- dg.stopContainer(ctx, dockerID) }()
go func() { response <- dg.stopContainer(ctx, dockerID, timeout) }()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be passing ctxTimeout? What is the difference between timeout and ctxTimeout?

For my understanding, why isn't startContainer handled the same with timeout?

Copy link
Contributor Author

@adnxn adnxn Feb 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between timeout and ctxTimeout?

so i have some information in the comment above this code and also the PR description.

but also - ctxTimeout is for the timeout for the context the agent uses for the actual call to docker, where as timeout is a parameter for the api call to stop container call. the timeout parameter is how long the docker daemon will wait after a docker stop call to sigkill the container. we want our context timeout to be greater than the daemon's sigkill timeout.

Copy link
Contributor Author

@adnxn adnxn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❗️ reminder: need to bump state file version for container fields

@adnxn adnxn changed the base branch from dev to container-ordering-feature February 18, 2019 16:26
* start timeout is governed by the agent and is the context timeout for
the StartContainer api call

* stop timeout is the parameter passed to StopContainer and is time the
docker daemon waits after a StopContainer call to issue a SIGKILL to the
container
@adnxn adnxn force-pushed the dev-granular-timeouts branch from 057f252 to a7b34b2 Compare February 18, 2019 19:16
@adnxn
Copy link
Contributor Author

adnxn commented Feb 18, 2019

cherry-picked changes on to container-ordering-feature branch and resolved merge conflicts.

@@ -236,6 +236,10 @@ type Container struct {

Secrets []*Secret `locationName:"secrets" type:"list"`

StartTimeout *int64 `locationName:"startTimeout" type:"integer"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make sure ECS service treats it as int64 as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're treating this field as we do with cpu/mem

@@ -78,7 +78,10 @@ const (
// a) Add 'Associations' field to 'api.task.task'
// b) Add 'GPUIDs' field to 'apicontainer.Container'
// c) Add 'NvidiaRuntime' field to 'api.task.task'
// 20) Add 'DependsOn' field to 'apicontainer.Container'
// 20)
// a) Add 'DependsOn' field to 'apicontainer.Container'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indentation

// 20) Add 'DependsOn' field to 'apicontainer.Container'
// 20)
// a) Add 'DependsOn' field to 'apicontainer.Container'
// b) Add 'StartTime' field to 'api.container.Container'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please add a test for the state file version bump in the state manager package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, will add state manager unit tests for this.

@@ -1447,3 +1447,42 @@ func TestGPUAssociationTask(t *testing.T) {
go taskEngine.AddTask(&taskUpdate)
verifyTaskIsStopped(stateChangeEvents, testTask)
}

func TestPerContainerStopTimeout(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason why this is under unix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding TODO for windows test.

timeout := engine.cfg.DockerStopTimeout + dockerclient.StopContainerTimeout
return engine.client.StopContainer(engine.ctx, dockerContainer.DockerID, timeout)

apiTimeoutStopContainer := container.GetStopTimeout()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about calling this stopContainerAPITimeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont have a strong feeling either way tbh

@adnxn
Copy link
Contributor Author

adnxn commented Feb 18, 2019

ecs/windows/unit_test failing intermittently here due to #1864 but passing in travis.

Copy link
Contributor

@sharanyad sharanyad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm.
just that ECS_CONTAINER_STOP_TIMEOUT and stopTimeout have different meanings and might confuse customers :(

@adnxn
Copy link
Contributor Author

adnxn commented Feb 19, 2019

just that ECS_CONTAINER_STOP_TIMEOUT and stopTimeout have different meanings and might confuse customers :(

since this is a communication problem, ill update the README entry for ECS_CONTAINER_STOP_TIMEOUT and ECS_CONTAINER_START_TIMEOUT to specify that they have instance wide scope. we'll need to make sure the docs for the new container fields differentiate the two as well (task/container def field vs ecs.config variable)

@adnxn adnxn deleted the dev-granular-timeouts branch October 21, 2019 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants