Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregator doesnt properly handle errors when getting results #728

Closed
johnSchnake opened this issue May 24, 2019 · 2 comments · Fixed by #730
Closed

Aggregator doesnt properly handle errors when getting results #728

johnSchnake opened this issue May 24, 2019 · 2 comments · Fixed by #730
Assignees
Labels
kind/bug Behavior isn't as expected or intended legacy/ZD2245 p0-highest
Milestone

Comments

@johnSchnake
Copy link
Contributor

What steps did you take and what happened:
Tests were run in a situation with poor network behavior and the logs have lots of "connection reset by peer" lines.

The worker logic uses a client that uses retries by default

The aggregator will accept the first request and then blacklist any more attempts because it has "seen" those results. So what is happening is part of the data is being sent, the connection gets reset, the client tries again but gets turned away.

What did you expect to happen:
The server should accept new results in cases where errors were encountered.

Anything else you would like to add:
The logic for marking the results as "seen" even if we get errors makes sense (as the code comments indicate) because we don't want the server to disregard error'd results as it may make the server hang if the client doesn't retry again.

However, it seems like the logic needs to be tweaked so that we either:

  • always allow the resending of data and take the newest values
  • track which results we got with/without error and allow retries on the former

Environment:

  • Sonobuoy version: v0.14.2
@johnSchnake
Copy link
Contributor Author

Thanks to @rbankston for reporting this bug and providing the necessary logs to diagnose it. (kubernetes/kubernetes#74839 may have caused the original connection reset issue but it is not apparently a final, perfect fix and may run into other issues)

@johnSchnake johnSchnake added the kind/bug Behavior isn't as expected or intended label May 24, 2019
@johnSchnake johnSchnake added this to the v0.14.3 milestone May 24, 2019
@johnSchnake
Copy link
Contributor Author

Added this to the current milestone as p0. We have the bandwidth to support and I consider this a pretty significant bug since (1) it is confusing to users (2) it is not easy to debug if you are not very familiar with the system (3) it may be blocking to users: if this happens regularly for some reason it can keep users from ever being able to get results from their plugins.

Working on this currently.

@johnSchnake johnSchnake self-assigned this May 24, 2019
This was referenced May 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Behavior isn't as expected or intended legacy/ZD2245 p0-highest
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants