Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

periodically disconnect from acs #3586

Merged
merged 2 commits into from
Feb 24, 2023
Merged

periodically disconnect from acs #3586

merged 2 commits into from
Feb 24, 2023

Conversation

singholt
Copy link
Contributor

@singholt singholt commented Feb 22, 2023

Summary

This PR introduces a connection time property to the ACS session object. With this change, the agent will periodically disconnect from ACS (every 15-45 minutes).

Currently we rely on ACS to periodically disconnect from the agent, unless the agent stops receiving heartbeats from ACS. About every ~1-2 minutes, if the agent does not receive any heartbeats and if no activity occurs, the agent closes its connection to ACS.

If an ACS host is unhealthy (i.e. it looses track of its active connections and doesn't disconnect periodically anymore) and that we can no longer rely on heartbeats, the agent will be stuck with a stale connection. This may lead to task credentials not being refreshed or no task payload being served by this container instance. With this change, we would no longer have a single point of failure (i.e. unhealthy ACS host) and we let the agent control its fate.

Implementation details

  • A new connectionTimer is started when the ACS websocket connection is established.
  • When this timer is up, the agent sends all pending acks to ACS and writes a close message on the connection using websocket Control Messages. The agent sends a message of type websocket.CloseMessage with status code NormalClosure.
  • The close message is sent to ACS which then is echoed back to the agent. This close message is added to the message queue. This ensures that the agent reads and handles unread messages first, before closing the connection. Reference
  • When agent detects the close message, it returns an io.EOF back to the parent go routine that started the ACS session. This is in-line with how it returns an io.EOF when ACS initiates the connection close. Reference
  • Upon io.EOF, the websocket connection is established immediately without backoff.

Testing

  • Updated unit tests.
  • Built a test AMI with these changes and launched a container instance that disconnects to ACS every minute (for testing purposes). The connection is re-established in < ~1 second.
  • Let this container instance run for about 2 days to monitor the agent, upon frequent disconnection.
  • Created an ECS service and ran a small test that increments the service's desired task count by 1 every 25 seconds, upto 50 tasks.
  • Checked ECS agent logs to verify that the disconnection workflow works as expected and that all tasks get served as requested.
level=info msg="Closing ACS websocket connection after 1 minutes" module=acs_handler.go
level=debug msg="Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue" module=client.go
level=info msg="ACS Websocket connection closed for a valid reason" module=acs_handler.go
level=info msg="ACS Websocket connection closed for a valid reason: EOF" module=acs_handler.go
level=debug msg="Received connect to ACS message" module=acs_handler.go
level=info msg="Establishing a Websocket connection ..."
level=debug msg="Established a Websocket connection to ..."
level=info msg="Connected to ACS endpoint"
  • Checked ACS logs and verified the sequence of operations that is executed when agent closes the connection.

New tests cover the changes: yes

Description for the changelog

Enhancement: periodically disconnect from ACS

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@singholt singholt changed the title [wip] periodically disconnect from acs periodically disconnect from acs Feb 23, 2023
@singholt singholt marked this pull request as ready for review February 23, 2023 17:26
@singholt singholt requested a review from a team as a code owner February 23, 2023 17:26
ubhattacharjya
ubhattacharjya previously approved these changes Feb 23, 2023
@yinyic
Copy link
Contributor

yinyic commented Feb 23, 2023

The close message is sent to ACS which then is echoed back to the agent. This close message is added to the message queue. This ensures that the agent reads and handles unread messages first, before closing the connection. Reference

That's rather desirable! Curious if we were able to verify this behavior? (Agent sent close message, handled another payload message, and closed connection upon receiving close echo from ACS)

@singholt
Copy link
Contributor Author

singholt commented Feb 23, 2023

The close message is sent to ACS which then is echoed back to the agent. This close message is added to the message queue. This ensures that the agent reads and handles unread messages first, before closing the connection. Reference

That's rather desirable! Curious if we were able to verify this behavior? (Agent sent close message, handled another payload message, and closed connection upon receiving close echo from ACS)

by queuing up the close message, we don't delay processing it but rather do it so that agent will first read other payload messages.

as soon as the close message is read, agent processes it and terminates/reconnects the connection. makes sense?

Copy link
Contributor

@YashdalfTheGray YashdalfTheGray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the context and the write up on the testing. Looks good.

Copy link
Contributor

@YashdalfTheGray YashdalfTheGray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rereviewing after rebase and added tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants