Error count "failed" taskrun for metric of Prometheus #3045

vincent-pli · 2020-08-03T08:37:39Z

Expected Behavior

Count correct the number

Actual Behavior

Incorrect

Steps to Reproduce the Problem

Create a taskrun which cannot find task refer to.
check the tekton_taskrun_count{status="failed"}

Additional Info

When such exception happened,

pipeline/pkg/reconciler/taskrun/taskrun.go

Lines 240 to 245 in c185296

    
           taskMeta, taskSpec, err := resources.GetTaskData(ctx, tr, resolver.GetTask) 
        
           if err != nil { 
        
           	logger.Errorf("Failed to determine Task spec to use for taskrun %s: %v", tr.Name, err) 
        
           	tr.Status.MarkResourceFailed(podconvert.ReasonFailedResolution, err) 
        
           	return nil, nil, controller.NewPermanentError(err) 
        
           }

a PermanentError was raised, the knative/pkg will forget the taskrun from workqueue, I guess that we expected.
but we also update the .status, that's will enqueue another taskrun object (same but with different .status), and it will enter :

pipeline/pkg/reconciler/taskrun/taskrun.go

Lines 109 to 110 in c185296

    
           if tr.IsDone() { 
        
           	logger.Infof("taskrun done : %s \n", tr.Name)

directly, then count as tekton_taskrun_count{status="failed"} if any error in this section (if tr.IsDone()), a normal error will be raised, that's means the taskrun will re-enqueue again and again based on implements of this:
https://github.com/kubernetes/client-go/blob/00dbcca6ee44c678754d3f5fda1bd0e704b26fe2/util/workqueue/default_rate_limiters.go#L89

A candidate solution is wrapper all error in this section as PermanentError to let workqueue forget it.

The text was updated successfully, but these errors were encountered:

vincent-pli · 2020-08-03T08:39:18Z

#3029
@pritidesai

pritidesai · 2020-08-03T23:29:26Z

thanks @vincent-pli

Yup, once it enters tr.Done, controller should not reschedule it again.

I am debugging into metrics. When I ran a sample invalid TaskRun, Controller log file has taskrun done : <taskrun-name> 12 times based on the default retry attempts with each TasRun instance having different addresses but TaskRun count it set to 1.

taskrun done appearing 12 times means one attempt is done and scheduling again whereas done should mean done after all possible attempts.

(stats.Measurement) {
 v: (float64) 1,
 m: (*stats.Float64Measure)(0xc000138a90)({
  desc: (*stats.measureDescriptor)(0xc000306dc0)({
   subs: (int32) 1,
   name: (string) (len=13) "taskrun_count",
   description: (string) (len=18) "number of taskruns",
   unit: (string) (len=1) "1"
  })
 }),
 desc: (*stats.measureDescriptor)(0xc000306dc0)({
  subs: (int32) 1,
  name: (string) (len=13) "taskrun_count",
  description: (string) (len=18) "number of taskruns",
  unit: (string) (len=1) "1"
 })
}

And each TaskRun status has exact same status:

Status: (v1beta1.TaskRunStatus) {
  Status: (v1beta1.Status) {
   ObservedGeneration: (int64) 0,
   Conditions: (v1beta1.Conditions) (len=1 cap=1) {
    (apis.Condition) {
     Type: (apis.ConditionType) (len=9) "Succeeded",
     Status: (v1.ConditionStatus) (len=5) "False",
     Severity: (apis.ConditionSeverity) "",
     LastTransitionTime: (apis.VolatileTime) {
      Inner: (v1.Time) 2020-08-03 23:19:43 +0000 UTC
     },
     Reason: (string) (len=23) "TaskRunResolutionFailed",
     Message: (string) (len=97) "error when listing tasks for taskRun taskrun-failure: tasks.tekton.dev \"does-not-exist\" not found"
    }
   },

I am not able to reproduce the metrics count yet, will give it a one more try (dont have Prometheus setup on my local cluster).

pritidesai · 2020-08-04T00:48:50Z

I am debugging this with:

apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
  name: taskrun-failure
spec:
  taskRef:
    name: does-not-exist

pritidesai · 2020-08-04T01:53:20Z

But like you mentioned, in case of Permanent Error, controller should not try to exhaust the number of retries and just declare failure.

vincent-pli · 2020-08-04T07:36:20Z

@pritidesai , so you reproduce it?

I means in section: if tr.IsDone(), if any exception occurred, it raise normal error rather than Permanent one, this cause repeat reconcile.
Actually, with your case, this line will raise resource name should not nil exception, since r.Status.PodName is nil.

pipeline/pkg/reconciler/taskrun/taskrun.go

Line 124 in 28d1347

    
           pod, err := c.KubeClientSet.CoreV1().Pods(tr.Namespace).Get(tr.Status.PodName, metav1.GetOptions{})

The error is normal error, will cause the issue.

pritidesai · 2020-08-05T16:34:18Z

/assign

pritidesai · 2020-08-05T16:53:33Z

Yes, it raises error with resource name may not be empty which is causing the reconciler to fail with Reconcile error . The failed reconciled run tries to exhaust the number of retries in case of such failure instead of forgetting that taskRun.

resource name may not be empty is coming from the go client. We need a check on the tr.Status.PodName before calling Get with it.

After raising Permanent error from prepare or reconcile, reconciler should not return any error from within tr.IsDone() so that the controller can declare Reconcile succeeded and exit.

The following unit test also does not catch this, it verifies the Permanent Error but does not validate whether the taskrun is not requeued again, will add that test:

pipeline/pkg/reconciler/taskrun/taskrun_test.go

Lines 1698 to 1705 in cb85f1d

    
           // When a TaskRun is invalid and can't run, we return a permanent error because 
        
           // a regular error will tell the Reconciler to keep trying to reconcile; instead we want to stop 
        
           // and forget about the Run. 
        
           if reconcileErr == nil { 
        
           	t.Fatalf("Expected to see error when reconciling invalid TaskRun but none") 
        
           } 
        
           if !controller.IsPermanentError(reconcileErr) { 
        
           	t.Fatalf("Expected to see a permanent error when reconciling invalid TaskRun, got %s instead", reconcileErr)

I have changes in place, will create a PR.

vincent-pli added the kind/bug Categorizes issue or PR as related to a bug. label Aug 3, 2020

tekton-robot assigned pritidesai Aug 5, 2020

pritidesai mentioned this issue Aug 7, 2020

avoid requeuing taskrun in case of permanent error #3068

Merged

4 tasks

tekton-robot closed this as completed in #3068 Nov 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error count "failed" taskrun for metric of Prometheus #3045

Error count "failed" taskrun for metric of Prometheus #3045

vincent-pli commented Aug 3, 2020

vincent-pli commented Aug 3, 2020

pritidesai commented Aug 3, 2020 •

edited

Loading

pritidesai commented Aug 4, 2020

pritidesai commented Aug 4, 2020 •

edited

Loading

vincent-pli commented Aug 4, 2020

pritidesai commented Aug 5, 2020

pritidesai commented Aug 5, 2020 •

edited

Loading

Error count "failed" taskrun for metric of Prometheus #3045

Error count "failed" taskrun for metric of Prometheus #3045

Comments

vincent-pli commented Aug 3, 2020

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

vincent-pli commented Aug 3, 2020

pritidesai commented Aug 3, 2020 • edited Loading

pritidesai commented Aug 4, 2020

pritidesai commented Aug 4, 2020 • edited Loading

vincent-pli commented Aug 4, 2020

pritidesai commented Aug 5, 2020

pritidesai commented Aug 5, 2020 • edited Loading

pritidesai commented Aug 3, 2020 •

edited

Loading

pritidesai commented Aug 4, 2020 •

edited

Loading

pritidesai commented Aug 5, 2020 •

edited

Loading