-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pull retries and configurable routes for awsvpc
network mode
#975
Pull retries and configurable routes for awsvpc
network mode
#975
Conversation
This change enables a static IPv4 address to be assigned to tasks with the `awsvpc` network mode and also configures a return route for traffic between the task and the host.
21c19d9
to
e45a20b
Compare
@@ -17,6 +17,7 @@ | |||
package engine | |||
|
|||
import ( | |||
context0 "context" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using the context
package in my code.
I have no idea why golang.org/x/net/context
is still imported here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the golang.org/x/net/context
is used by the task_engine.Init()
, I think that's why the reason it's imported. I asked because I saw two context imported, we can't get rid of this until go docker client updated this.
agent/ecscni/plugin.go
Outdated
"github.com/pkg/errors" | ||
) | ||
|
||
// CNIClient defines the method of setting/cleaning up container namespace | ||
type CNIClient interface { | ||
Version(string) (string, error) | ||
Capabilities(string) ([]string, error) | ||
SetupNS(*Config) error | ||
CleanupNS(*Config) error | ||
SetupNS(*Config, []cnitypes.IPNet) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of passing an extra parameter, would it be better to add a new field in the Config
struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could do that if you feel strongly about it.
|
||
response := make(chan DockerContainerMetadata, 1) | ||
go func() { response <- dg.pullImage(image, authData) }() | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we distinguish the errors generated by the agent from the errors generated by docker, and we only retry for errors from docker? For the errors generated by the agent like CannotPullECRContainerError
and CreateEmptyVolumeError
probably won't recover with retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CannotPullECRContainerError
is already excluded from retries. I can extend CreateEmptyVolumeError
to exclude it from retries as well. The other errors wrapped by CannotPullContainerError
are not excluded right now; I only covered ECR because it was easy.
@@ -2,6 +2,7 @@ | |||
|
|||
## UNRELEASED | |||
* Feature - Support for provisioning Tasks with ENIs | |||
* Enhancement - Retry failed container image pull operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also added another feature that add customized route in the container namespace for task using awsvpc
. If I read the code correctly, this only can be configured by the config file, would it be more convenient to use environment variable. Also we may need to check the route before adding it, as the task traffic inside the container namespace can be changed by adding some route?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right; I had intended to make it configurable by environment variable as well but apparently never wrote the code. I'll do that.
I don't think we need to check the route. This is a sufficiently advanced, optional feature that is used in fairly narrow use-cases.
agent/engine/docker_task_engine.go
Outdated
return nil | ||
} | ||
if imagePullError, ok := metadata.Error.(CannotPullContainerError); ok { | ||
seelog.Warnf("When pulling %s: %s", container.Image, pullError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pullError
and imagePullError
aren't supposed to be distinct names. Can we fix this commit before the PR is merged? Either name is fine to use.
(I realize that this code doesn't exist as written in the final PR, but I do want bisects, etc, to work and we shouldn't have unbuildable code in the repo.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll fix this.
@@ -70,8 +70,8 @@ func New(p client.ConfigProvider, cfgs ...*aws.Config) *ECS { | |||
|
|||
// newClient creates, initializes and returns a new service client instance. | |||
func newClient(cfg aws.Config, handlers request.Handlers, endpoint, signingRegion, signingName string) *ECS { | |||
if signingName == "" { | |||
signingName = ServiceName | |||
if len(signingName) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I didn't make this change...this is what happened when I ran make gogenerate
.
But the logic here is actually a little better; even though the zero-value of a string is ""
, this also would handle a *string
where the zero-value is nil
with no changes.
} | ||
|
||
func (engine *DockerTaskEngine) buildCNIConfigFromTaskContainer(task *api.Task, container *api.Container) (*ecscni.Config, error) { | ||
cfg, err := task.BuildCNIConfig() | ||
if err != nil { | ||
return nil, errors.Wrapf(err, "engine: build cni configuration from taskfailed") | ||
} | ||
|
||
if engine.cfg.OverrideAWSVPCLocalIPv4Address != nil && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit]
consider moving this expression to a smaller function. ie, hasValidAWSVPCLocalOverrides(cfg *Config)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm going to leave this as-is for now unless you feel strongly about it.
select { | ||
case <-ctx.Done(): | ||
return nil | ||
default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat!!
Why keep the default block empty? Would it make sense to put err = fn()
.. etc there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seemed clearer to me to have it out. This way it reads that we early return if the context is done, but otherwise keep going.
e45a20b
to
557658d
Compare
@richardpen @nmeyerhans I've addressed your comments, can you take a look? |
557658d
to
cf49171
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also have you manually tested the route add part?
agent/api/errors.go
Outdated
@@ -46,7 +46,7 @@ type NamedError interface { | |||
ErrorName() string | |||
} | |||
|
|||
// NamedError is a wrapper type for 'error' which adds an optional name and provides a symetric marshal/unmarshal | |||
// NamedError is a wrapper type for 'error' which adds an optional name and provides a symmetric marshal/unmarshal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this comment should be moved above to the NamedError
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change it to talking about DefaultNamedError
, since that's the implementation that actually provides the marshal/unmarshal behavior.
@@ -218,7 +219,8 @@ func fileConfig() (Config, error) { | |||
|
|||
err = json.Unmarshal(data, &cfg) | |||
if err != nil { | |||
seelog.Errorf("Error reading cfg json data, err %v", err) | |||
seelog.Criticalf("Error reading cfg json data, err %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth to logging the string(data)
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, not a bad idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After further thought, I'm going to leave this out for now. I'm concerned about the case where a customer adds credentials (EngineAuthData
) to the file and then we have a parse error; that would cause credentials to be written out to the log.
agent/config/config.go
Outdated
if additionalLocalRoutesEnv != "" { | ||
err := json.Unmarshal([]byte(additionalLocalRoutesEnv), &additionalLocalRoutes) | ||
if err != nil { | ||
seelog.Warnf("Invalid format for ECS_AWSVPC_ADDITIONAL_LOCAL_ROUTES, expected a json array of CIDRs: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, should we log the additionalLocalRoutesEnv
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also going to leave this out. We don't log the read environment variable for any of the other ones that we parse. I'd rather leave this consistent for now and then revisit it later when we can change all of them.
os.Setenv("ECS_ENABLE_TASK_ENI", "true") | ||
defer os.Unsetenv("ECS_ENABLE_TASK_ENI") | ||
additionalLocalRoutesJSON := `["1.2.3.4/22","5.6.7.8/32"]` | ||
os.Setenv("ECS_AWSVPC_ADDITIONAL_LOCAL_ROUTES", additionalLocalRoutesJSON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you missed defer
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops!
agent/config/config_unix_test.go
Outdated
assert.Equal(t, map[string]string{"attribute1": "value1"}, cfg.InstanceAttributes) | ||
} | ||
|
||
// TestDockerAuthMergeFromFile tests docker auth read from file correctly after merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comment is incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops!
agent/config/config_windows_test.go
Outdated
@@ -25,6 +25,8 @@ import ( | |||
) | |||
|
|||
func TestConfigDefault(t *testing.T) { | |||
os.Setenv("AWS_DEFAULT_REGION", "foo-bar-1") | |||
defer os.Unsetenv("AWS_DEFAULT") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: AWS_DEFAULT_REGION
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops!
agent/config/config_windows_test.go
Outdated
@@ -64,10 +66,12 @@ func TestConfigDefault(t *testing.T) { | |||
} | |||
|
|||
func TestConfigIAMTaskRolesReserves80(t *testing.T) { | |||
os.Setenv("AWS_DEFAULT_REGION", "foo-bar-1") | |||
defer os.Unsetenv("AWS_DEFAULT") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops!
Yes, but in an earlier revision. I will re-run my tests with the latest changes. |
README.md
Outdated
| `ECS_ENABLE_TASK_ENI` | `false` | Whether to enable task networking for task to be launched with its own network interface | `false` | Not applicable | | ||
| `ECS_CNI_PLUGINS_PATH` | `/ecs/cni` | The path where the cni binary file is located | `/amazon-ecs-cni-plugins` | Not applicable | | ||
| `ECS_AWSVPC_BLOCK_IMDS` | `true` | Whether to block access to [Instance Metdata](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html) for Tasks started with `awsvpc` network mode | `false` | Not applicable | | ||
| `ECS_AWSVPC_ADDITIONAL_LOCAL_ROUTES` | `["10.0.15.0/24"]` | Additional routes that cause traffic to travel by the default ENI | `[]` | Not applicable | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the name strongly implies it, the description should be explicit that this is used in awsvpc
mode.
It's also not entirely accurate to say that traffic to these routes goes out via the default ENI; it may never leave the host at all. I might rewrite the description as
In
awsvpc
network mode, traffic to these prefixes will be routed via the host bridge instead of the task ENI
or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll use your description.
agent/config/config.go
Outdated
if additionalLocalRoutesEnv != "" { | ||
err := json.Unmarshal([]byte(additionalLocalRoutesEnv), &additionalLocalRoutes) | ||
if err != nil { | ||
seelog.Warnf("Invalid format for ECS_AWSVPC_ADDITIONAL_LOCAL_ROUTES, expected a json array of CIDRs: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is warn
the right severity here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're using warn
for all of them right now. I'd be happy to change all of them to error
if you'd like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we're using warn
for both non-fatal conditions (actually should be warnings) and for parsing errors (should be errors). I'll change the ones that actually cause the agent to exit to error
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's pretty much what I was getting at. If NewConfig()
returns an error, agent panics. We should probably log the condition that cause the panic at an appropriately high severity.
cf49171
to
2bc5d66
Compare
@richardpen @nmeyerhans I believe I have addressed your comments, including additional manual testing. Can you take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. I've got minor questions comments.
agent/config/config.go
Outdated
err = json.Unmarshal([]byte(instanceAttributesEnv), &instanceAttributes) | ||
if instanceAttributesEnv != "" { | ||
if err != nil { | ||
err := fmt.Errorf("Invalid format for ECS_INSTANCE_ATTRIBUTES. Expected a json hash") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we log/wrap the original error as well?
agent/ecscni/plugin_test.go
Outdated
@@ -26,7 +27,11 @@ func TestSetupNS(t *testing.T) { | |||
mockResult.EXPECT().String().Return(""), | |||
) | |||
|
|||
err := ecscniClient.SetupNS(&Config{}) | |||
additionalRoutesJson := `["169.254.172.1/32", "10.11.12.13/32"]` | |||
additionalRoutes := &[]cnitypes.IPNet{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this reads weird to me. Can it be just var AdditionalLocalRoutes []cnitypes.IPNet{}
here instead and you can unmarshal it as &additionalRoutes
?
agent/utils/utils.go
Outdated
|
||
retriableErr, isRetriableErr := err.(Retriable) | ||
|
||
if err == nil || isRetriableErr && !retriableErr.Retry() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use parenthesis here to illustrate/enforce the operator precedence? Otherwise, readers like me will always have to look up the operator precedence documentation for how this is evaluated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just short-circuiting evaluation; any time a left-condition evaluates false, nothing proceeds.
Are you asking me to change this to err == nil || (isRetriableErr && !retriableErr.Retry())
? That's logically equivalent, but (to me) harder to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you asking me to change this to
err == nil || (isRetriableErr && !retriableErr.Retry())
? That's logically equivalent, but (to me) harder to read.
Yeah, I prefer that. Because, then I wouldn't have to wonder if isRetriableErr
is associated with the ||
or &&
operator.
|
||
// OverrideAWSVPCLocalIPv4Address overrides the local IPv4 address chosen | ||
// for a task using the `awsvpc` networking mode. Using this configuration | ||
// will limit you to running one `awsvpc` task at a time. IPv4 addresses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using this configuration will limit you to running one
awsvpc
task
Since we don't enforce this anywhere in the code (iiuc), may be we should add a TODO
so that we don't lose sight of this? Even an issue would be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want a TODO even if we don't have any short-term plans to change it? If you statically-assign an IP address, you'll always be limited to one or else you'll have IP address conflicts and things won't work right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, may be the better thing to do here is to update the description for the ECS_AWSVPC_ADDITIONAL_LOCAL_ROUTES
field in the README file with some of what you have here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ECS_AWSVPC_ADDITIONAL_LOCAL_ROUTES
won't limit you to running one awsvpc
task; it just adds routes. It's specifically the static IP assignment that is limiting here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, OverrideAWSVPCLocalIPv4Address
gets set just from the JSON config and nowhere else? If that's the case, it's probably fine to leave it be. I'm just looking for ways to highlight this behavior so that it doesn't lead to post start-task
failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. The intent here is that this is meant to serve a very limited set of use-cases and is not really a generally useful feature, so it's excluded from the more-common way to configure the agent (environment variables).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
We're going to need an IPv6 equivalent of ECS_AWSVPC_ADDITIONAL_LOCAL_ROUTES at some point...
2bc5d66
to
4ebfefe
Compare
@aaithal I've addressed your comments, except for the one about |
Summary
This series of commits adds retries for failed pulls (compensating for transient network issues) and allows additional routes to be configured for tasks launched with the
awsvpc
network mode.Implementation details
Retries are implemented with the new
utils.RetryNWithBackoffCtx
(based on existingutils.RetryNWithBackoff
) inside thedockerGoClient
and excludes one expected non-transient error (missing IAM permissions to callecr:GetAuthorizationToken
). Other non-transient errors could be excluded from retries by extendingCannotPullContainerError
to implementutils.Retrier
, but has not been done yet (I have not assembled a list of expected non-transient error codes from Docker).Configurable routes were added by depending on the
types.IPNet
type fromgithub.com/containernetworking/cni/pkg/types
, which wrapsnet.IPNet
so it can be serialized and deserialized in CIDR notation.Testing
make release
)go build -out amazon-ecs-agent.exe ./agent
)make test
) passgo test -timeout=25s ./agent/...
) passmake run-integ-tests
) pass.\scripts\run-integ-tests.ps1
) passmake run-functional-tests
) pass.\scripts\run-functional-tests.ps1
) passRan in an environment with manually configured additional routes and with transient network failures.
New tests cover the changes: yes
Description for the changelog
Licensing
This contribution is under the terms of the Apache 2.0 License: yes