Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PipelineRun fails too eagerly on missing resources #3378

Closed
mattmoor opened this issue Oct 13, 2020 · 6 comments · Fixed by #3385
Closed

PipelineRun fails too eagerly on missing resources #3378

mattmoor opened this issue Oct 13, 2020 · 6 comments · Fixed by #3385
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@mattmoor
Copy link
Member

Expected Behavior

PipelineRun only fails when referenced resources do not exist.

Actual Behavior

PipelineRun fails when referenced resources do not YET exist in it's informer cache at the time of reconciliation.

You can see the bug here as the lister is passed to fetch the resource, and on error it manifests as a permanent failure of the pipeline resource:

providedResources, err := resources.GetResourcesFromBindings(pr, c.resourceLister.PipelineResources(pr.Namespace).Get)
if err != nil {
// This Run has failed, so we need to mark it as failed and stop reconciling it
pr.Status.MarkFailed(ReasonCouldntGetResource,
"PipelineRun %s/%s can't be Run; it tries to bind Resources that don't exist: %s",
pipelineMeta.Namespace, pr.Name, err)
return controller.NewPermanentError(err)
}

Steps to Reproduce the Problem

This is an intermittent flake in the e2e testing, which manifests with a message like:

PipelineRun arendelle-8mqgj/git-check-pipeline-run can''t be Run; it tries to bind Resources that don''t exist: error following resource reference for git-repo: pipelineresource.tekton.dev "git-source-resource" not found

Additional Info

This is roughly at HEAD (I observed this downstream, but the issue is clearly upstream)

@mattmoor mattmoor added the kind/bug Categorizes issue or PR as related to a bug. label Oct 13, 2020
@mattmoor
Copy link
Member Author

As @jlpettersson pointed out in slack, they had to put the PipelineRun at the bottom here for the test to pass:
https://github.com/tektoncd/pipeline/blob/master/examples/v1beta1/pipelineruns/pipelinerun-using-different-subpaths-of-workspace.yaml

IMO a good signal that this is solid would be that this passes with the PipelineRun first. It's a decent simulation of informer lag, and aims for a better experience for users that might not want to deal with orchestrating multiple applies.

@mattmoor
Copy link
Member Author

@GregDritschler
Copy link
Contributor

This is very closely related to #2740. The pipelinerun and taskrun reconcilers use the client to get the Pipeline or Task respectively to avoid lister cache issues. There was a lot of good discussion in that issue. I'm not sure why it was closed.

@mattmoor
Copy link
Member Author

There is a lot of nuance, and honestly there are facets of the Tekton resource model that make this challenging (as I highlighted in that thread).

The conflict here is between:

  1. fast-feedback when a user fat-fingers a bad reference
  2. working in the presence of slow informer syncing

Using the client could solve both, but then puts things into conflict with scaling the system.

I think an appropriate solution here would be to give the informer caches a grace period before declaring missing references fatal. This could key off of metadata.creationTimestamp.

mattmoor added a commit to mattmoor/pipeline that referenced this issue Oct 14, 2020
As the PipelineRun reconciler executes, it resolves resources using the informer's lister cache.  Currently, when that cache is behind the pipeline run will immediately fail.  This change builds in a buffer of `resources.MinimumAge` and a helper `resources.IsYoung` that elide this check, returning the error to the controller framework to requeue the key for later processing (with backoff).

Fixes: tektoncd#3378
mattmoor added a commit to mattmoor/pipeline that referenced this issue Oct 14, 2020
As the PipelineRun reconciler executes, it resolves resources using the informer's lister cache.  Currently, when that cache is behind the pipeline run will immediately fail.  This change builds in a buffer of `resources.MinimumAge` and a helper `resources.IsYoung` that elide this check, returning the error to the controller framework to requeue the key for later processing (with backoff).

Fixes: tektoncd#3378
@mattmoor
Copy link
Member Author

I have a change, which will hopefully build in a grace period for this here: #3385

Let's see how the e2e tests respond to this 🤞

@LiYa-Xu
Copy link

LiYa-Xu commented Oct 19, 2020

Hi all ,I am not sure I met a same issue. The e2e test case fails on case TestTaskRunPipelineRunCancel, I got message like

CONT  TestTaskRunPipelineRunCancel/retries=1
  build_logs.go:34: Could not get logs for pod cancel-me-task-nzln8-pod-ddrn7: container "step-unnamed-0" in pod "cancel-me-task-nzln8-pod-ddrn7" is waiting to start: PodInitializing
    build_logs.go:36: build logs 

Then I checked the events from this e2e test namespces , seems the pod is still under pending status:

arendelle-hpcnl    19m   Normal    Created         pod/cancel-me-task-mwvc9-pod-qn26j                     Created container place-scripts
arendelle-hpcnl    19m   Normal    Started         pod/cancel-me-task-mwvc9-pod-qn26j                     Started container place-scripts
arendelle-hpcnl    19m   Normal    Started         taskrun/cancel-me-task-mwvc9                           
arendelle-hpcnl    19m   Normal    Pending         taskrun/cancel-me-task-mwvc9                           Pending
arendelle-hpcnl    19m   Normal    Pending         taskrun/cancel-me-task-mwvc9                           pod status "Initialized":"False"; message: "containers with incomplete status: [place-scripts place-tools]"
arendelle-hpcnl    19m   Warning   Failed          taskrun/cancel-me-task-mwvc9                           TaskRun "cancel-me-task-mwvc9" was cancelled
arendelle-hpcnl    19m   Normal    Started         pipelinerun/cancel-me                                  
arendelle-hpcnl    19m   Normal    Running         pipelinerun/cancel-me                                  Tasks Completed: 0 (Failed: 0, Cancelled 0), Incomplete: 1, Skipped: 0
arendelle-hpcnl    19m   Warning   Failed          pipelinerun/cancel-me                                  PipelineRun "cancel-me" was cancelled

Can you help me?

mattmoor added a commit to mattmoor/pipeline that referenced this issue Oct 20, 2020
As the PipelineRun reconciler executes, it resolves resources using the informer's lister cache.  Currently, when that cache is behind the pipeline run will immediately fail.  This change builds in a buffer of `resources.MinimumAge` and a helper `resources.IsYoung` that elide this check, returning the error to the controller framework to requeue the key for later processing (with backoff).

Fixes: tektoncd#3378
tekton-robot pushed a commit that referenced this issue Oct 20, 2020
As the PipelineRun reconciler executes, it resolves resources using the informer's lister cache.  Currently, when that cache is behind the pipeline run will immediately fail.  This change builds in a buffer of `resources.MinimumAge` and a helper `resources.IsYoung` that elide this check, returning the error to the controller framework to requeue the key for later processing (with backoff).

Fixes: #3378
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants