Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add restart tracker to ecs-agent module #4162

Merged
merged 1 commit into from
May 3, 2024
Merged

Conversation

timj-hh
Copy link
Contributor

@timj-hh timj-hh commented Apr 29, 2024

Summary

This change adds a new restart module to /api/container/. The restart tracker contains a configurable policy and some helper methods. There is an accompanying test file unit testing the tracker.

Implementation details

Restart Tracker

  • Implement methods RecordRestart and ShouldRestart to make a restart decision based on passed in container metadata
  • Write RecordRestart to iterate restart count.
  • Implement unit testing modifying each variable checked in ShouldRestart.
  • Integrates ContainerRestartPolicy as implemented in the ecsacs container module into the restart tracker

Testing

This component has not been plugged in yet, but all written tests pass.

New tests cover the changes: yes

Description for the changelog

Add restart module

Does this PR include breaking model changes? If so, Have you added transformation functions?
no

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@timj-hh timj-hh requested a review from a team as a code owner April 29, 2024 16:58
@timj-hh timj-hh changed the title Add restart tracker and restart policy to ecs-agent module Add restart tracker to ecs-agent module Apr 30, 2024

// RestartPolicy represents a policy that contains key information considered when
// deciding whether or not a container should be restarted after it has exited.
type RestartPolicy struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we move this into a separate file ecs-agent/api/container/restart/policy.go for clearer separation, and consider renaming the files from this PR to ecs-agent/api/container/restart/tracker.go and ecs-agent/api/container/restart/tracker_test.go for more concise naming?

rt.lock.RLock()
defer rt.lock.RUnlock()
rt.RestartCount++
rt.LastStartedAt = startedAt
Copy link
Contributor

@danehlim danehlim May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the churn and not calling this out earlier:

Based off what was decided during internal design, this field should be named LastRestartAt (not LastStartedAt) and RecordRestart method should not take in the started at time for the container from the container runtime, right? Otherwise this seems to assume that the container runtime of the calling Agent always updates the started at time for the container when a container is restarted. Per our previous internal discussions, such assumptions about the container runtime of the calling Agent should not be made as the code should be agnostic of architectural/implementation details of any calling Agent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can rename the field, but the reason this is an argument to RecordRestart is to ensure we call this method with the correct time value. LastStartedAt will likely be passed into ShouldRestart after the first restart of a container, so we want to make sure that is set when the container would normally get to RUNNING, not necessarily the same time we would record a restart.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that started at time of a container will be updated by the underlying container runtime of the calling Agent. Based off previous internal discussions and the fact this code should be agnostic of the calling Agent (and its underlying container runtime), is that a safe assumption to make?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline, removed time argument to RecordRestart in favor of time.Now and adjusted unit tests to round time checks to check within a millisecond of accuracy


func (rt *RestartTracker) GetLastStartedAt() time.Time {
rt.lock.RLock()
defer rt.lock.RUnlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why do we have a lock here? Will rt be accessed by more than 1 thread at the same time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible, for agents that implement multiple asynchronous containers that independently restart we may get simultaneous access. I believe it is also common coding practice in ecs-agent

RestartAttemptPeriod: 60 * time.Second,
})
assert.Equal(t, 0, rt.RestartCount)
for i := 1; i < 1000; i++ {
Copy link
Contributor

@danehlim danehlim May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think doing this 1000 times is necessary - if I'm not mistaken, this means that this loop would repeatedly execute across all test runs without a need to. Can we factor out this for loop?


// After restarting, we should inform restart decisions with LastRestartedAt instead of the passed in startedAt time.
rt.RecordRestart()
shouldRestart, reason = rt.ShouldRestart(&exitCode, time.Now().Add(-61*time.Second), apicontainerstatus.ContainerRunning)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Best practice would be to save time.Now().Add(-61*time.Second) in a startedAt variable, similar to what you have done for TestShouldRestart.

@timj-hh timj-hh merged commit 3b4c985 into aws:dev May 3, 2024
40 checks passed
@singholt singholt mentioned this pull request May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants