Properly report error when a plugin fails to run #720

johnSchnake · 2019-05-15T18:00:40Z

What this PR does / why we need it:
Currently the logic of the aggregator short circuits when a
plugin fails to Run without error. It exits abruptly and does
not launch other plugins or report the failure on the plugin
status.

This commit changes the following behavior:

ensures we can create all the auth certs needed before launching
any plugins, ensuring we don't fail to generate some certs after
other plugins are already running.
when a plugin fails to Run without error, we send a failing
result so that the aggregator properly reports the status of
the plugin as an error and the results tarball shows the error
as expected.

Which issue(s) this PR fixes
Fixes #559

Special notes for your reviewer:
No good test for the entire logic here, I can attach some screenshots of when I modified the code to force an error though.

Release note:

The aggregator server will now report a plugin as failed if it can't properly be started and will continue starting other plugins.

johnSchnake · 2019-05-15T18:03:19Z

pkg/plugin/aggregation/run.go

@@ -150,14 +152,22 @@ func Run(client kubernetes.Interface, plugins []plugin.Interface, cfg plugin.Agg
 	}()

 	// 4. Launch each plugin, to dispatch workers which submit the results back
+	certs := map[string]*tls.Certificate{}


This is a weird diff, but the point was that the loop over plugins did 2 things:

create cert

run plugin

If we can't create certs I think that is a completely different type of error condition and I'm OK with maintaining the existing behavior where the server does a hard return there. I pulled it into its own loop to more clearly separate the two operations so that we wouldn't have some plugins already started, fail to generate the next cert, then return from that function (without completing the function to ingest results, update status, etc)

johnSchnake · 2019-05-15T18:03:54Z

pkg/plugin/driver/job/job.go

@@ -64,7 +64,7 @@ func NewPlugin(dfn plugin.Definition, namespace, sonobuoyImage, imagePullPolicy,
 // a Job only launches one pod, only one result type is expected.
 func (p *Plugin) ExpectedResults(nodes []v1.Node) []plugin.ExpectedResult {
 	return []plugin.ExpectedResult{
-		plugin.ExpectedResult{ResultType: p.GetResultType()},
+		{ResultType: p.GetResultType()},


TIL that this is a gofmt simplification. I thought you had to put the type there.

johnSchnake · 2019-05-15T18:05:04Z

pkg/plugin/aggregation/run.go

+			err = errors.Wrapf(err, "error running plugin %v", p.GetName())
+			logrus.Error(err)
+			monitorCh <- utils.MakeErrorResult(p.GetResultType(), map[string]interface{}{"error": err.Error()}, "")
+			continue


These 4 lines are the real change:

form the error with wrap

log it

send the error result (taken from the Monitor code)

continue with next plugin and the rest of aggregator operations.

codecov-io · 2019-05-15T18:18:57Z

Codecov Report

Merging #720 into master will decrease coverage by 0.19%.
The diff coverage is 0%.

@@            Coverage Diff            @@
##           master     #720     +/-   ##
=========================================
- Coverage   39.36%   39.16%   -0.2%     
=========================================
  Files          68       68             
  Lines        3821     3827      +6     
=========================================
- Hits         1504     1499      -5     
- Misses       2220     2229      +9     
- Partials       97       99      +2

Impacted Files	Coverage Δ
pkg/plugin/aggregation/run.go	`0% <0%> (ø)`	⬆️
pkg/plugin/driver/job/job.go	`21.51% <0%> (ø)`	⬆️
pkg/plugin/aggregation/aggregator.go	`67.01% <0%> (-5.16%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9a19511...c2553ab. Read the comment docs.

johnSchnake · 2019-05-15T18:42:06Z

Currently the logic of the aggregator short circuits when a plugin fails to `Run` without error. It exits abruptly and does not launch other plugins or report the failure on the plugin status. This commit changes the following behavior: - ensures we can create all the auth certs needed before launching any plugins, ensuring we don't fail to generate some certs after other plugins are already running. - when a plugin fails to Run without error, we send a failing result so that the aggregator properly reports the status of the plugin as an error and the results tarball shows the error as expected. Fixes #559 Signed-off-by: John Schnake <[email protected]>

johnSchnake requested a review from stevesloka May 15, 2019 18:00

johnSchnake commented May 15, 2019

View reviewed changes

johnSchnake merged commit e9aba66 into vmware-tanzu:master May 21, 2019

johnSchnake deleted the failedPluginRun branch May 21, 2019 20:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly report error when a plugin fails to run #720

Properly report error when a plugin fails to run #720

johnSchnake commented May 15, 2019

johnSchnake May 15, 2019

johnSchnake May 15, 2019

johnSchnake May 15, 2019

codecov-io commented May 15, 2019

johnSchnake commented May 15, 2019

Properly report error when a plugin fails to run #720

Properly report error when a plugin fails to run #720

Conversation

johnSchnake commented May 15, 2019

johnSchnake May 15, 2019

Choose a reason for hiding this comment

johnSchnake May 15, 2019

Choose a reason for hiding this comment

johnSchnake May 15, 2019

Choose a reason for hiding this comment

codecov-io commented May 15, 2019

Codecov Report

johnSchnake commented May 15, 2019