Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Websocket errors starting 1.17 #1235

Closed
roman-vynar opened this issue Feb 15, 2018 · 12 comments
Closed

Websocket errors starting 1.17 #1235

roman-vynar opened this issue Feb 15, 2018 · 12 comments
Labels

Comments

@roman-vynar
Copy link

Summary

After we upgraded to 1.17, we observe websocket errors constantly in the logs.

Environment Details

Docker: 17.09.0-ce
Operating System: Ubuntu 16.04.3 LTS
Amazon ECS Agent: v1.17.0 (761937f)

Supporting Log Snippets

2018-02-15T11:01:25Z [ERROR] Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1] 
2018-02-15T11:01:25Z [INFO] Error from tcs; backing off: websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel
2018-02-15T11:03:28Z [ERROR] Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1] 
2018-02-15T11:03:28Z [INFO] Error from tcs; backing off: websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel
2018-02-15T11:05:24Z [ERROR] Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1] 
2018-02-15T11:05:24Z [INFO] Error from tcs; backing off: websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel

Every 2 minutes.

@adnxn
Copy link
Contributor

adnxn commented Feb 15, 2018

@roman-vynar, the websocket 1002 error code indicates that an endpoint is terminating the connection due to a protocol error. I'm not able to reproduce this error on my end and some more logging data would be very helpful for debugging this behavior. Would you be able to send me logs from the problematic instance? If this is easily reproducible for you, I'd also suggest running the agent with ECS_LOGLEVEL=debug.

You can capture logs using our log collection tool and email them to me directly at adnkha at amazon dot com. If you end up sending the logs to my email, please update this issue so I don't miss them. Thanks!

Also fwiw, since this appears to be the tcs connection it shouldn't be affecting the task life cycles and scheduling. The only expected side effect should be erroneous metrics.

Also - Are you using a proxy with this instance?

@roman-vynar
Copy link
Author

Hi @adnxn Thanks for looking into this! I have emailed you more details.

We don't have a proxy in front of ecs-agent.

@roman-vynar
Copy link
Author

The reason was ECS_DISABLE_METRICS=true.
I have removed this option to make those errors disappear.

@roman-vynar
Copy link
Author

Sorry, I decided to follow up on this.

As long as we don't use CloudWatch metrics we have ECS_DISABLE_METRICS=true.
Another reason for ECS_DISABLE_METRICS=true is a potential load spike per #588

We are still getting those errors on v1.17.3:

May 8 12:48:50 ecs-10-0-7-184.*** ecs-agent ERROR Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1]
May 8 12:49:06 ecs-10-0-7-69.*** ecs-agent ERROR Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1]
May 8 12:49:17 ecs-10-0-7-140.*** ecs-agent ERROR Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1]
May 8 12:49:28 ecs-10-0-7-220.*** ecs-agent ERROR Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1]
May 8 12:49:51 ecs-10-0-7-230.*** ecs-agent ERROR Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1]

Is there any relation between those errors and the fact that CloudWatch metrics are disabled?

Thanks.

@richardpen
Copy link

@roman-vynar

Is there any relation between those errors and the fact that CloudWatch metrics are disabled?

Yes, those errors are caused by the fact that there is no activity in the connection. And this connection is used to publish resource usage metrics and container health metrics. So, if the ECS_DISABLE_METRICS=true is set and no containers are using the container health check feature, the connection will be closed periodically.

@roman-vynar
Copy link
Author

Thanks @richardpen !
Is it possible to suppress such errors since it is obvious there is no activity when metrics are disabled?

@richardpen
Copy link

Unfortunately, there is no way to suppress this error now. I'll mark this as a bug, and we will make the change to be able to disable both the resource usage metrics and container health metrics. We will keep this updated when we have any progress.

thanks,
Peng

@rhuddleston
Copy link

Any update on this? I see it's marked "more info needed", do you need more information?

@rhuddleston
Copy link

Can we remove "more info needed" from this issue?

@roman-vynar
Copy link
Author

What's more info needed? :)

@shubham2892
Copy link
Contributor

We are looking into this issue.

@shubham2892 shubham2892 pinned this issue Jan 15, 2019
@shubham2892 shubham2892 unpinned this issue Jan 15, 2019
@shubham2892
Copy link
Contributor

We have introduced the flag ECS_DISABLE_DOCKER_HEALTH_CHECK in Agent version 1.25.0. ECS_DISABLE_DOCKER_HEALTH_CHECK disables docker health container check.
After updating to 1.25.0, setting that flag to true along with setting ECS_DISABLE_METRICS to true will suppress the websocket errors in the logs.

Closing the issue, please re-open if you see the issue with the flag set or have any more questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants