Aggregator doesnt properly handle errors when getting results #728

johnSchnake · 2019-05-24T14:52:57Z

What steps did you take and what happened:
Tests were run in a situation with poor network behavior and the logs have lots of "connection reset by peer" lines.

The worker logic uses a client that uses retries by default

The aggregator will accept the first request and then blacklist any more attempts because it has "seen" those results. So what is happening is part of the data is being sent, the connection gets reset, the client tries again but gets turned away.

What did you expect to happen:
The server should accept new results in cases where errors were encountered.

Anything else you would like to add:
The logic for marking the results as "seen" even if we get errors makes sense (as the code comments indicate) because we don't want the server to disregard error'd results as it may make the server hang if the client doesn't retry again.

However, it seems like the logic needs to be tweaked so that we either:

always allow the resending of data and take the newest values
track which results we got with/without error and allow retries on the former

Environment:

Sonobuoy version: v0.14.2

johnSchnake · 2019-05-24T14:54:35Z

Thanks to @rbankston for reporting this bug and providing the necessary logs to diagnose it. (kubernetes/kubernetes#74839 may have caused the original connection reset issue but it is not apparently a final, perfect fix and may run into other issues)

johnSchnake · 2019-05-24T18:42:51Z

Added this to the current milestone as p0. We have the bandwidth to support and I consider this a pretty significant bug since (1) it is confusing to users (2) it is not easy to debug if you are not very familiar with the system (3) it may be blocking to users: if this happens regularly for some reason it can keep users from ever being able to get results from their plugins.

Working on this currently.

rbankston added the legacy/ZD2245 label May 24, 2019

johnSchnake added the kind/bug Behavior isn't as expected or intended label May 24, 2019

johnSchnake added this to the v0.14.3 milestone May 24, 2019

johnSchnake added the p0-highest label May 24, 2019

johnSchnake self-assigned this May 24, 2019

This was referenced May 24, 2019

Unify http/non-http result handling #729

Merged

Retry results #730

Merged

stevesloka closed this as completed in #730 Jun 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregator doesnt properly handle errors when getting results #728

Aggregator doesnt properly handle errors when getting results #728

johnSchnake commented May 24, 2019

johnSchnake commented May 24, 2019

johnSchnake commented May 24, 2019

Aggregator doesnt properly handle errors when getting results #728

Aggregator doesnt properly handle errors when getting results #728

Comments

johnSchnake commented May 24, 2019

johnSchnake commented May 24, 2019

johnSchnake commented May 24, 2019