Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance BuildRun reconciles failure scenarios #641

Merged
merged 11 commits into from
Mar 22, 2021

Conversation

qu1queee
Copy link
Contributor

@qu1queee qu1queee commented Mar 4, 2021

Changes

Fixes #558

This PR addresses the following:

  • [1] Prior to a TaskRun resource generation, we need to have a well defined pattern for marking a BuildRun as FAILED, update its Status.Conditions and to stop any further reconciliation (also known as one-shot approach). For errors that might appear as failures when doing system calls, we will allow another reconciliation.
  • [2] Refactor of some of the existing functions when running preparations prior to the TaskRun object generation, e.g. retrieval of the strategy, service account or service account generation.
  • [3] Add more test coverage, mainly unit for [1] and [2]

For the new custom checks that might lead to a failed BuildRun, we have the following states:

Before:

Status Reason CompletionTime is set Description
Unknown Pending No The BuildRun is waiting on a Pod in status Pending.
Unknown Running No The BuildRun has been validate and started to perform its work.
True Succeeded Yes The BuildRun Pod is done.
False Failed Yes The BuildRun failed in one of the steps.
False BuildRunTimeout Yes The BuildRun timed out.

After:

Status Reason CompletionTime is set Description
Unknown Pending No The BuildRun is waiting on a Pod in status Pending.
Unknown Running No The BuildRun has been validate and started to perform its work.
True Succeeded Yes The BuildRun Pod is done.
False Failed Yes The BuildRun failed in one of the steps.
False BuildRunTimeout Yes The BuildRun timed out.
False UnknownStrategyKind Yes The Build specified strategy Kind is unknown. (options: ClusterBuildStrategy or BuildStrategy)
False ClusterBuildStrategyNotFound Yes The referenced cluster strategy was not found in the cluster.
False BuildStrategyNotFound Yes The referenced namespaced strategy was not found in the cluster.
False SetOwnerReferenceFailed Yes Setting ownerreferences from the BuildRun to the related TaskRun failed.
False TaskRunIsMissing Yes The BuildRun related TaskRun was not found.
False TaskRunGenerationFailed Yes The generation of a TaskRun spec failed.
False ServiceAccountNotFound Yes The referenced service account was not found in the cluster.
False BuildRegistrationFailed Yes The related Build in the BuildRun is on a Failed state.
False BuildNotFound Yes The related Build in the BuildRun was not found.

Note: You can follow this PR by checking one commit at a time, it will simplify the review.

Submitter Checklist

  • Includes tests if functionality changed/was added
  • Includes docs if changes are user-facing
  • Set a kind label on this PR
  • Release notes block has been filled in, or marked NONE

See the contributor guide
for details on coding conventions, github and prow interactions, and the code review process.

Release Notes

Introduce new Failed Reasons for BuildRun `Status.Conditions` and enhance the scenarios when BuildRuns are marked as Failed.

@qu1queee qu1queee added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/enhancement labels Mar 4, 2021
@openshift-ci-robot openshift-ci-robot added release-note Label for when a PR has specified a release note and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Mar 4, 2021
@qu1queee qu1queee changed the title One shot br Enhance BuildRun reconciles failure scenarios Mar 4, 2021
@qu1queee qu1queee added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 4, 2021
@qu1queee qu1queee changed the title Enhance BuildRun reconciles failure scenarios WIP: Enhance BuildRun reconciles failure scenarios Mar 4, 2021
@openshift-ci-robot openshift-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/enhancement labels Mar 4, 2021
@qu1queee qu1queee force-pushed the one_shot_br branch 5 times, most recently from f96dfcf to dc7bb3b Compare March 9, 2021 20:27
@qu1queee qu1queee force-pushed the one_shot_br branch 4 times, most recently from 9b5c03b to 6ade3c5 Compare March 12, 2021 13:47
@qu1queee qu1queee removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 12, 2021
@qu1queee qu1queee changed the title WIP: Enhance BuildRun reconciles failure scenarios Enhance BuildRun reconciles failure scenarios Mar 12, 2021
@qu1queee qu1queee requested review from HeavyWombat, SaschaSchwarze0 and zhangtbj and removed request for sbose78 March 12, 2021 13:55
Copy link
Member

@SaschaSchwarze0 SaschaSchwarze0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress. Did a first run through it.

pkg/reconciler/buildrun/buildrun.go Outdated Show resolved Hide resolved
pkg/reconciler/buildrun/buildrun.go Show resolved Hide resolved
updateErr := r.updateBuildRunErrorStatus(ctx, buildRun, err.Error())
return nil, resources.HandleError("Failed to choose a service account to use", err, updateErr)
func (r *ReconcileBuildRun) getReferencedStrategy(ctx context.Context, build *buildv1alpha1.Build, buildRun *buildv1alpha1.BuildRun) (strategy buildv1alpha1.BuilderStrategy, err error) {
if build.Spec.StrategyRef.Kind == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the old code assumed a namespaced build strategy when no kind was set. I like to make it mandatory. But instead of code-validation, I prefer schema validation and therefore propose to change the type of Kind in StrategyRef from *BuildStrategyKind to BuildStrategyKind (in buildstrategy.go).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code assumes we dont default to anything but rather fail, I think this comment goes into the direction of #657, which as you could see have no consensus yet. So we might also tackle this one via that issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using 657 sounds good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After 657 discussions and also keeping in mind this was using a default before, I added one more commit that inlines with the new logic and that supports a default, see 796f5ae

updateErr := r.updateBuildRunErrorStatus(ctx, buildRun, err.Error())
return nil, resources.HandleError("failed to set OwnerReference for BuildRun and TaskRun", err, updateErr)
default:
err = fmt.Errorf("unknown strategy %s", string(*build.Spec.StrategyRef.Kind))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to handle this with schema validation as well ? OpenAPI supports enums. We currently define BuildStrategyKind like this in buildstrategy.go:

type BuildStrategyKind string

In typescript I would simply write:

type BuildStrategyKind = 'BuildStrategy' | 'ClusterBuildStrategy'

Not sure if something like this is possible in go and supported by the kube typing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. Im totally fine with the way it is, it actually allowed me to simplify this code into a switch clause.

}
ctxlog.Info(ctx, "created serviceAccount for BuildRun", namespace, buildRun.Namespace, name, serviceAccount.Name, "BuildRun", buildRun.Name)
// add the secrets references into the new sa
ApplyCredentials(ctx, build, serviceAccount)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ApplyCredentials does not do a save. You need to move this to before the client.Create call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! thanks, fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this deserves new integration tests., Im surprised this didnt break anything

Copy link
Member

@SaschaSchwarze0 SaschaSchwarze0 Mar 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im surprised this didnt break anything

Me too. :-) I think integration and e2e test use an anonymous registry. Is not too difficult to change and we should do that probably soon. And private git registry tests in our todo list for next week. ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extended the creates a new service-account and deletes it after the build is terminated to check if an output secret exists on the autogenerated sa

@qu1queee qu1queee force-pushed the one_shot_br branch 2 times, most recently from 866c253 to 2a2f435 Compare March 12, 2021 15:58
Copy link
Member

@SaschaSchwarze0 SaschaSchwarze0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress. Two more remaining things from my site.

@@ -58,7 +58,7 @@ func add(ctx context.Context, mgr manager.Manager, r reconcile.Reconciler, maxCo

// The CreateFunc is also called when the controller is started and iterates over all objects. For those BuildRuns that have a TaskRun referenced already,
// we do not need to do a further reconciliation. BuildRun updates then only happen from the TaskRun.
return o.Status.LatestTaskRunRef == nil
return o.Status.LatestTaskRunRef == nil && o.Status.CompletionTime == nil
},
UpdateFunc: func(e event.UpdateEvent) bool {
// Ignore updates to CR status in which case metadata.Generation does not change
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would feel more comfortable if the update condition is

			// Avoid reconciling when for updates on the BuildRun, the build.shipwright.io/name
			// label is set, and when a BuildRun already have a referenced TaskRun.
			if o.GetLabels()[buildv1alpha1.LabelBuild] == "" || o.Status.LatestTaskRunRef != nil || o.Status.CompletionTime != nil {
				return false
			}

I do not think we have any use case to reconcile a completed BuildRun.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

_, obj, _ := client.CreateArgsForCall(0)
serviceAccount, castSuccessful := obj.(*corev1.ServiceAccount)
Expect(castSuccessful).To(BeTrue())
It("fails on a TaskRun creation due to unknown buildStrategy kind", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the kind is missing, the fallback to the namespaced strategy is back. I think the unit test "only" fails to create the TaskRun because the namespaced build strategy does not exist. That's complicated to understand, I think. Would maybe be better if the test case would mock the namespaced build strategy to exist and then explicitly test this fallback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@HeavyWombat HeavyWombat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two tiny things I would suggest to change.

pkg/reconciler/buildrun/buildrun.go Outdated Show resolved Hide resolved
pkg/reconciler/buildrun/resources/runtime_image.go Outdated Show resolved Hide resolved
@adambkaplan adambkaplan added this to the release-v0.4.0 milestone Mar 22, 2021
@qu1queee qu1queee force-pushed the one_shot_br branch 2 times, most recently from 16f1dd9 to 7d3d2b6 Compare March 22, 2021 15:50
qu1queee and others added 11 commits March 22, 2021 16:57
Ensure we have a customize error for Status Update failures
Ensure we have an error struct that can support multiple errors
Ensure we have a condition func to mark BuildRuns as Failed
Ensure we have a func for BuildRun objects to understand if the BuildRun
is failed or not.

Signed-off-by: Matthias Diester <[email protected]>
Mainly done for two reasons. One for simplifying the call, second to
stop using the controllerutil.CreateOrUpdate() func, which was not
possible to test via the unit-test fakes.

Signed-off-by: Matthias Diester <[email protected]>
This increases the coverage for the service account management in
buildruns and provides support for most of all the scenarios we have
there.
Simplify the createTaskRun func()
Simplify the strategies retrieval and make it a standalone step prior
to the TaskRun spec generation.
Ensure the serviceAccount retrieval is a standalone step prior to the
TaskRun spec generation.
For both serviceAccount and strategies retrieval, fail the TaskRun if
any of the custom checks take place or if a referenced object is not
found, otherwise allow one more reconciliation on sys call failures.

Signed-off-by: Matthias Diester <[email protected]>
On Create predicate events, do not reconcile if the CompletionTime is
set for BuildRuns.

When retrieving a Build object, reconcile again if is a sys call or
mark the BuildRun as failed if the object was not found.

When the Build object registration failed, mark the BuildRun as failed.

When creating the TaskRun, allow to reconcile one more time on errors.

Remove obsolete functions for updating BuildRun conditions.

Add unit-tests for Build object retrieval.
This test the behaviour of the BuildRun on custom validations, asserting
for a False Status in the Conditions or for an error after
reconciliation, meaning we allow one more reconcile.
To reflect test coverage on BuildRun failures and to asser for the
correct Condition fields inside the object.
Add new custom Reason valued when we have a Condition with a False
Status.
When a Build do not specify an strategy Kind, ensure we default to a
namespaced scope one. This logic was already in placed, a previous
commit removed and this commit added it again in a different function.

Also, it should fix shipwright-io#657
Introduce well defined test cases for:
- BuildRun with no strategy kind defined should default to namespaced
  strategy and it should find the one that is in the system
- BuildRun with no strategy kind defined should default to namespaced
  strategy and it should **fail** if there is no such strategy

Signed-off-by: Matthias Diester <[email protected]>
Ignore a potential follow-up error by explicitly setting it to undef.

Remove unwanted new-line.

Co-authored-by: Matthias Diester <[email protected]>
@gabemontero
Copy link
Member

I'll

/approve

and allow someone from CodeEngine to apply the lgtm

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gabemontero

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 22, 2021
Copy link
Member

@SaschaSchwarze0 SaschaSchwarze0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 22, 2021
@openshift-merge-robot openshift-merge-robot merged commit 023e050 into shipwright-io:master Mar 22, 2021
@qu1queee qu1queee deleted the one_shot_br branch March 23, 2021 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. release-note Label for when a PR has specified a release note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve BuildRun Failure State Transitions
9 participants