Skip to content

Commit

Permalink
Do not fail jobs if we can't query the API server while running
Browse files Browse the repository at this point in the history
The aggregator server has a monitoring routine for each plugin. Part
of that is to poll the API server to check the pod status. Currently
job plugins will abort and be marked as failed if we have our API
query fail.

This is different than the daemonset plugins which are just blindly
tolerant of the errors.

This change makes the job plugins tolerant in the same way. This is
because we know in general that the API server was working since
the aggregator pod is running already. There are lots of transient
errors that could take place, even the API server could go down and
be brought up. None of that should cause the plugin to become a
hard failure.

Fixes #1043

Signed-off-by: John Schnake <[email protected]>
  • Loading branch information
johnSchnake committed Dec 14, 2019
1 parent e0d3921 commit 8f2750f
Show file tree
Hide file tree
Showing 3 changed files with 8 additions and 9 deletions.
1 change: 0 additions & 1 deletion pkg/plugin/driver/daemonset/daemonset.go
Original file line number Diff line number Diff line change
Expand Up @@ -227,7 +227,6 @@ func (p *Plugin) listOptions() metav1.ListOptions {

// findDaemonSet gets the daemonset that we created, using a kubernetes label search.
func (p *Plugin) findDaemonSet(kubeclient kubernetes.Interface) (*appsv1.DaemonSet, error) {
// TODO(EKF): Move to v1 in 1.11
dsets, err := kubeclient.AppsV1().DaemonSets(p.Namespace).List(p.listOptions())
if err != nil {
return nil, errors.WithStack(err)
Expand Down
5 changes: 3 additions & 2 deletions pkg/plugin/driver/job/job.go
Original file line number Diff line number Diff line change
Expand Up @@ -208,10 +208,11 @@ func (p *Plugin) monitorOnce(kubeclient kubernetes.Interface, _ []v1.Node) (done
return true, nil
}

// Make sure there's a pod
// Make sure there's a pod; dont fail the pod if there are issues querying the API server.
pod, err := p.findPod(kubeclient)
if err != nil {
return true, utils.MakeErrorResult(p.GetName(), map[string]interface{}{"error": err.Error()}, plugin.GlobalResult)
errlog.LogError(errors.Wrapf(err, "could not find pod created by plugin %v, will retry", p.GetName()))
return false, nil
}

// Make sure the pod isn't failing
Expand Down
11 changes: 5 additions & 6 deletions pkg/plugin/driver/job/job_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -377,12 +377,11 @@ func TestMonitorOnce(t *testing.T) {
expectDone: true,
job: &Plugin{driver.Base{CleanedUp: true}},
}, {
desc: "Missing pod results in error",
job: &Plugin{},
podOnServer: nil,
errFromServer: errors.New("forcedError"),
expectErrResultMsg: "forcedError",
expectDone: true,
desc: "Server error results in error being ignored",
job: &Plugin{},
podOnServer: nil,
errFromServer: errors.New("forcedError"),
expectDone: false,
}, {
desc: "Failing pod results in error",
job: &Plugin{},
Expand Down

0 comments on commit 8f2750f

Please sign in to comment.