Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error count "failed" taskrun for metric of Prometheus #3045

Closed
vincent-pli opened this issue Aug 3, 2020 · 7 comments · Fixed by #3068
Closed

Error count "failed" taskrun for metric of Prometheus #3045

vincent-pli opened this issue Aug 3, 2020 · 7 comments · Fixed by #3068
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@vincent-pli
Copy link
Member

Expected Behavior

Count correct the number

Actual Behavior

Incorrect

Steps to Reproduce the Problem

Create a taskrun which cannot find task refer to.
check the tekton_taskrun_count{status="failed"}

Additional Info

When such exception happened,

taskMeta, taskSpec, err := resources.GetTaskData(ctx, tr, resolver.GetTask)
if err != nil {
logger.Errorf("Failed to determine Task spec to use for taskrun %s: %v", tr.Name, err)
tr.Status.MarkResourceFailed(podconvert.ReasonFailedResolution, err)
return nil, nil, controller.NewPermanentError(err)
}

a PermanentError was raised, the knative/pkg will forget the taskrun from workqueue, I guess that we expected.
but we also update the .status, that's will enqueue another taskrun object (same but with different .status), and it will enter :
if tr.IsDone() {
logger.Infof("taskrun done : %s \n", tr.Name)

directly, then count as tekton_taskrun_count{status="failed"} if any error in this section (if tr.IsDone()), a normal error will be raised, that's means the taskrun will re-enqueue again and again based on implements of this:
https://github.com/kubernetes/client-go/blob/00dbcca6ee44c678754d3f5fda1bd0e704b26fe2/util/workqueue/default_rate_limiters.go#L89

A candidate solution is wrapper all error in this section as PermanentError to let workqueue forget it.

@vincent-pli vincent-pli added the kind/bug Categorizes issue or PR as related to a bug. label Aug 3, 2020
@vincent-pli
Copy link
Member Author

#3029
@pritidesai

@pritidesai
Copy link
Member

pritidesai commented Aug 3, 2020

thanks @vincent-pli

Yup, once it enters tr.Done, controller should not reschedule it again.

I am debugging into metrics. When I ran a sample invalid TaskRun, Controller log file has taskrun done : <taskrun-name> 12 times based on the default retry attempts with each TasRun instance having different addresses but TaskRun count it set to 1.

taskrun done appearing 12 times means one attempt is done and scheduling again whereas done should mean done after all possible attempts.

(stats.Measurement) {
 v: (float64) 1,
 m: (*stats.Float64Measure)(0xc000138a90)({
  desc: (*stats.measureDescriptor)(0xc000306dc0)({
   subs: (int32) 1,
   name: (string) (len=13) "taskrun_count",
   description: (string) (len=18) "number of taskruns",
   unit: (string) (len=1) "1"
  })
 }),
 desc: (*stats.measureDescriptor)(0xc000306dc0)({
  subs: (int32) 1,
  name: (string) (len=13) "taskrun_count",
  description: (string) (len=18) "number of taskruns",
  unit: (string) (len=1) "1"
 })
}

And each TaskRun status has exact same status:

Status: (v1beta1.TaskRunStatus) {
  Status: (v1beta1.Status) {
   ObservedGeneration: (int64) 0,
   Conditions: (v1beta1.Conditions) (len=1 cap=1) {
    (apis.Condition) {
     Type: (apis.ConditionType) (len=9) "Succeeded",
     Status: (v1.ConditionStatus) (len=5) "False",
     Severity: (apis.ConditionSeverity) "",
     LastTransitionTime: (apis.VolatileTime) {
      Inner: (v1.Time) 2020-08-03 23:19:43 +0000 UTC
     },
     Reason: (string) (len=23) "TaskRunResolutionFailed",
     Message: (string) (len=97) "error when listing tasks for taskRun taskrun-failure: tasks.tekton.dev \"does-not-exist\" not found"
    }
   },

I am not able to reproduce the metrics count yet, will give it a one more try (dont have Prometheus setup on my local cluster).

@pritidesai
Copy link
Member

I am debugging this with:

apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
  name: taskrun-failure
spec:
  taskRef:
    name: does-not-exist

@pritidesai
Copy link
Member

pritidesai commented Aug 4, 2020

But like you mentioned, in case of Permanent Error, controller should not try to exhaust the number of retries and just declare failure.

@vincent-pli
Copy link
Member Author

@pritidesai , so you reproduce it?

I means in section: if tr.IsDone(), if any exception occurred, it raise normal error rather than Permanent one, this cause repeat reconcile.
Actually, with your case, this line will raise resource name should not nil exception, since r.Status.PodName is nil.

pod, err := c.KubeClientSet.CoreV1().Pods(tr.Namespace).Get(tr.Status.PodName, metav1.GetOptions{})

The error is normal error, will cause the issue.

@pritidesai
Copy link
Member

/assign

@pritidesai
Copy link
Member

pritidesai commented Aug 5, 2020

Yes, it raises error with resource name may not be empty which is causing the reconciler to fail with Reconcile error . The failed reconciled run tries to exhaust the number of retries in case of such failure instead of forgetting that taskRun.

resource name may not be empty is coming from the go client. We need a check on the tr.Status.PodName before calling Get with it.

After raising Permanent error from prepare or reconcile, reconciler should not return any error from within tr.IsDone() so that the controller can declare Reconcile succeeded and exit.

The following unit test also does not catch this, it verifies the Permanent Error but does not validate whether the taskrun is not requeued again, will add that test:

// When a TaskRun is invalid and can't run, we return a permanent error because
// a regular error will tell the Reconciler to keep trying to reconcile; instead we want to stop
// and forget about the Run.
if reconcileErr == nil {
t.Fatalf("Expected to see error when reconciling invalid TaskRun but none")
}
if !controller.IsPermanentError(reconcileErr) {
t.Fatalf("Expected to see a permanent error when reconciling invalid TaskRun, got %s instead", reconcileErr)

I have changes in place, will create a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants