Enhance BuildRun reconciles failure scenarios #641

qu1queee · 2021-03-04T20:36:41Z

Changes

Fixes #558

This PR addresses the following:

[1] Prior to a TaskRun resource generation, we need to have a well defined pattern for marking a BuildRun as FAILED, update its Status.Conditions and to stop any further reconciliation (also known as one-shot approach). For errors that might appear as failures when doing system calls, we will allow another reconciliation.
[2] Refactor of some of the existing functions when running preparations prior to the TaskRun object generation, e.g. retrieval of the strategy, service account or service account generation.
[3] Add more test coverage, mainly unit for [1] and [2]

For the new custom checks that might lead to a failed BuildRun, we have the following states:

Before:

Status	Reason	CompletionTime is set	Description
Unknown	Pending	No	The BuildRun is waiting on a Pod in status Pending.
Unknown	Running	No	The BuildRun has been validate and started to perform its work.
True	Succeeded	Yes	The BuildRun Pod is done.
False	Failed	Yes	The BuildRun failed in one of the steps.
False	BuildRunTimeout	Yes	The BuildRun timed out.

After:

Status	Reason	CompletionTime is set	Description
Unknown	Pending	No	The BuildRun is waiting on a Pod in status Pending.
Unknown	Running	No	The BuildRun has been validate and started to perform its work.
True	Succeeded	Yes	The BuildRun Pod is done.
False	Failed	Yes	The BuildRun failed in one of the steps.
False	BuildRunTimeout	Yes	The BuildRun timed out.
False	UnknownStrategyKind	Yes	The Build specified strategy Kind is unknown. (options: ClusterBuildStrategy or BuildStrategy)
False	ClusterBuildStrategyNotFound	Yes	The referenced cluster strategy was not found in the cluster.
False	BuildStrategyNotFound	Yes	The referenced namespaced strategy was not found in the cluster.
False	SetOwnerReferenceFailed	Yes	Setting ownerreferences from the BuildRun to the related TaskRun failed.
False	TaskRunIsMissing	Yes	The BuildRun related TaskRun was not found.
False	TaskRunGenerationFailed	Yes	The generation of a TaskRun spec failed.
False	ServiceAccountNotFound	Yes	The referenced service account was not found in the cluster.
False	BuildRegistrationFailed	Yes	The related Build in the BuildRun is on a Failed state.
False	BuildNotFound	Yes	The related Build in the BuildRun was not found.

Note: You can follow this PR by checking one commit at a time, it will simplify the review.

Submitter Checklist

Includes tests if functionality changed/was added
Includes docs if changes are user-facing
Set a kind label on this PR
Release notes block has been filled in, or marked NONE

See the contributor guide
for details on coding conventions, github and prow interactions, and the code review process.

Release Notes

Introduce new Failed Reasons for BuildRun `Status.Conditions` and enhance the scenarios when BuildRuns are marked as Failed.

SaschaSchwarze0

Good progress. Did a first run through it.

pkg/reconciler/buildrun/buildrun.go

pkg/reconciler/buildrun/resources/service_accounts.go

SaschaSchwarze0 · 2021-03-12T14:58:40Z

pkg/reconciler/buildrun/buildrun.go

-		updateErr := r.updateBuildRunErrorStatus(ctx, buildRun, err.Error())
-		return nil, resources.HandleError("Failed to choose a service account to use", err, updateErr)
+func (r *ReconcileBuildRun) getReferencedStrategy(ctx context.Context, build *buildv1alpha1.Build, buildRun *buildv1alpha1.BuildRun) (strategy buildv1alpha1.BuilderStrategy, err error) {
+	if build.Spec.StrategyRef.Kind == nil {


I think the old code assumed a namespaced build strategy when no kind was set. I like to make it mandatory. But instead of code-validation, I prefer schema validation and therefore propose to change the type of Kind in StrategyRef from *BuildStrategyKind to BuildStrategyKind (in buildstrategy.go).

this code assumes we dont default to anything but rather fail, I think this comment goes into the direction of #657, which as you could see have no consensus yet. So we might also tackle this one via that issue.

Using 657 sounds good.

After 657 discussions and also keeping in mind this was using a default before, I added one more commit that inlines with the new logic and that supports a default, see 796f5ae

SaschaSchwarze0 · 2021-03-12T15:10:35Z

pkg/reconciler/buildrun/buildrun.go

-		updateErr := r.updateBuildRunErrorStatus(ctx, buildRun, err.Error())
-		return nil, resources.HandleError("failed to set OwnerReference for BuildRun and TaskRun", err, updateErr)
+	default:
+		err = fmt.Errorf("unknown strategy %s", string(*build.Spec.StrategyRef.Kind))


Is it possible to handle this with schema validation as well ? OpenAPI supports enums. We currently define BuildStrategyKind like this in buildstrategy.go:

type BuildStrategyKind string

In typescript I would simply write:

type BuildStrategyKind = 'BuildStrategy' | 'ClusterBuildStrategy'

Not sure if something like this is possible in go and supported by the kube typing.

Not sure. Im totally fine with the way it is, it actually allowed me to simplify this code into a switch clause.

SaschaSchwarze0 · 2021-03-12T15:17:18Z

pkg/reconciler/buildrun/resources/service_accounts.go

+		}
+		ctxlog.Info(ctx, "created serviceAccount for BuildRun", namespace, buildRun.Namespace, name, serviceAccount.Name, "BuildRun", buildRun.Name)
+		// add the secrets references into the new sa
+		ApplyCredentials(ctx, build, serviceAccount)


ApplyCredentials does not do a save. You need to move this to before the client.Create call.

good catch! thanks, fixed

this deserves new integration tests., Im surprised this didnt break anything

Im surprised this didnt break anything

Me too. :-) I think integration and e2e test use an anonymous registry. Is not too difficult to change and we should do that probably soon. And private git registry tests in our todo list for next week. ;-)

I extended the creates a new service-account and deletes it after the build is terminated to check if an output secret exists on the autogenerated sa

SaschaSchwarze0

Good progress. Two more remaining things from my site.

SaschaSchwarze0 · 2021-03-22T10:50:52Z

pkg/reconciler/buildrun/controller.go

@@ -58,7 +58,7 @@ func add(ctx context.Context, mgr manager.Manager, r reconcile.Reconciler, maxCo

 			// The CreateFunc is also called when the controller is started and iterates over all objects. For those BuildRuns that have a TaskRun referenced already,
 			// we do not need to do a further reconciliation. BuildRun updates then only happen from the TaskRun.
-			return o.Status.LatestTaskRunRef == nil
+			return o.Status.LatestTaskRunRef == nil && o.Status.CompletionTime == nil
 		},
 		UpdateFunc: func(e event.UpdateEvent) bool {
 			// Ignore updates to CR status in which case metadata.Generation does not change


I would feel more comfortable if the update condition is

// Avoid reconciling when for updates on the BuildRun, the build.shipwright.io/name // label is set, and when a BuildRun already have a referenced TaskRun. if o.GetLabels()[buildv1alpha1.LabelBuild] == "" || o.Status.LatestTaskRunRef != nil || o.Status.CompletionTime != nil { return false }

I do not think we have any use case to reconcile a completed BuildRun.

SaschaSchwarze0 · 2021-03-22T11:12:11Z

pkg/reconciler/buildrun/buildrun_test.go

-				_, obj, _ := client.CreateArgsForCall(0)
-				serviceAccount, castSuccessful := obj.(*corev1.ServiceAccount)
-				Expect(castSuccessful).To(BeTrue())
+			It("fails on a TaskRun creation due to unknown buildStrategy kind", func() {


When the kind is missing, the fallback to the namespaced strategy is back. I think the unit test "only" fails to create the TaskRun because the namespaced build strategy does not exist. That's complicated to understand, I think. Would maybe be better if the test case would mock the namespaced build strategy to exist and then explicitly test this fallback.

HeavyWombat

Just two tiny things I would suggest to change.

pkg/reconciler/buildrun/buildrun.go

pkg/reconciler/buildrun/resources/runtime_image.go

Ensure we have a customize error for Status Update failures Ensure we have an error struct that can support multiple errors Ensure we have a condition func to mark BuildRuns as Failed Ensure we have a func for BuildRun objects to understand if the BuildRun is failed or not. Signed-off-by: Matthias Diester <[email protected]>

Mainly done for two reasons. One for simplifying the call, second to stop using the controllerutil.CreateOrUpdate() func, which was not possible to test via the unit-test fakes. Signed-off-by: Matthias Diester <[email protected]>

This increases the coverage for the service account management in buildruns and provides support for most of all the scenarios we have there.

Simplify the createTaskRun func() Simplify the strategies retrieval and make it a standalone step prior to the TaskRun spec generation. Ensure the serviceAccount retrieval is a standalone step prior to the TaskRun spec generation. For both serviceAccount and strategies retrieval, fail the TaskRun if any of the custom checks take place or if a referenced object is not found, otherwise allow one more reconciliation on sys call failures. Signed-off-by: Matthias Diester <[email protected]>

On Create predicate events, do not reconcile if the CompletionTime is set for BuildRuns. When retrieving a Build object, reconcile again if is a sys call or mark the BuildRun as failed if the object was not found. When the Build object registration failed, mark the BuildRun as failed. When creating the TaskRun, allow to reconcile one more time on errors. Remove obsolete functions for updating BuildRun conditions. Add unit-tests for Build object retrieval.

This test the behaviour of the BuildRun on custom validations, asserting for a False Status in the Conditions or for an error after reconciliation, meaning we allow one more reconcile.

To reflect test coverage on BuildRun failures and to asser for the correct Condition fields inside the object.

Add new custom Reason valued when we have a Condition with a False Status.

When a Build do not specify an strategy Kind, ensure we default to a namespaced scope one. This logic was already in placed, a previous commit removed and this commit added it again in a different function. Also, it should fix shipwright-io#657

Introduce well defined test cases for: - BuildRun with no strategy kind defined should default to namespaced strategy and it should find the one that is in the system - BuildRun with no strategy kind defined should default to namespaced strategy and it should **fail** if there is no such strategy Signed-off-by: Matthias Diester <[email protected]>

Ignore a potential follow-up error by explicitly setting it to undef. Remove unwanted new-line. Co-authored-by: Matthias Diester <[email protected]>

gabemontero · 2021-03-22T16:00:47Z

I'll

/approve

and allow someone from CodeEngine to apply the lgtm

openshift-ci-robot · 2021-03-22T16:00:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gabemontero

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [gabemontero]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

SaschaSchwarze0

/lgtm

qu1queee added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/enhancement labels Mar 4, 2021

openshift-ci-robot added release-note Label for when a PR has specified a release note and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Mar 4, 2021

openshift-ci-robot requested review from adambkaplan and sbose78 March 4, 2021 20:36

qu1queee changed the title ~~One shot br~~ Enhance BuildRun reconciles failure scenarios Mar 4, 2021

qu1queee added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 4, 2021

qu1queee changed the title ~~Enhance BuildRun reconciles failure scenarios~~ WIP: Enhance BuildRun reconciles failure scenarios Mar 4, 2021

openshift-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/enhancement labels Mar 4, 2021

qu1queee force-pushed the one_shot_br branch 5 times, most recently from f96dfcf to dc7bb3b Compare March 9, 2021 20:27

qu1queee force-pushed the one_shot_br branch 4 times, most recently from 9b5c03b to 6ade3c5 Compare March 12, 2021 13:47

qu1queee removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 12, 2021

qu1queee changed the title ~~WIP: Enhance BuildRun reconciles failure scenarios~~ Enhance BuildRun reconciles failure scenarios Mar 12, 2021

qu1queee requested review from HeavyWombat, SaschaSchwarze0 and zhangtbj and removed request for sbose78 March 12, 2021 13:55

SaschaSchwarze0 requested changes Mar 12, 2021

View reviewed changes

qu1queee force-pushed the one_shot_br branch 2 times, most recently from 866c253 to 2a2f435 Compare March 12, 2021 15:58

qu1queee requested a review from SaschaSchwarze0 March 12, 2021 15:59

SaschaSchwarze0 requested changes Mar 22, 2021

View reviewed changes

HeavyWombat requested changes Mar 22, 2021

View reviewed changes

pkg/reconciler/buildrun/buildrun.go Outdated Show resolved Hide resolved

pkg/reconciler/buildrun/resources/runtime_image.go Outdated Show resolved Hide resolved

adambkaplan added this to the release-v0.4.0 milestone Mar 22, 2021

qu1queee force-pushed the one_shot_br branch 2 times, most recently from 16f1dd9 to 7d3d2b6 Compare March 22, 2021 15:50

qu1queee and others added 11 commits March 22, 2021 16:57

Refactor service account retrieval

2a68567

Mainly done for two reasons. One for simplifying the call, second to stop using the controllerutil.CreateOrUpdate() func, which was not possible to test via the unit-test fakes. Signed-off-by: Matthias Diester <[email protected]>

Expand service account unit tests

d41aabd

This increases the coverage for the service account management in buildruns and provides support for most of all the scenarios we have there.

Enhance unit-tests

b516072

This test the behaviour of the BuildRun on custom validations, asserting for a False Status in the Conditions or for an error after reconciliation, meaning we allow one more reconcile.

Add integration tests

41f6683

To reflect test coverage on BuildRun failures and to asser for the correct Condition fields inside the object.

Enhance BuildRun states table:

17fe345

Add new custom Reason valued when we have a Condition with a False Status.

Apply suggestions from code review

239f470

Ignore a potential follow-up error by explicitly setting it to undef. Remove unwanted new-line. Co-authored-by: Matthias Diester <[email protected]>

qu1queee force-pushed the one_shot_br branch from 59c4524 to 239f470 Compare March 22, 2021 15:58

qu1queee requested review from SaschaSchwarze0 and HeavyWombat March 22, 2021 15:58

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 22, 2021

SaschaSchwarze0 approved these changes Mar 22, 2021

View reviewed changes

openshift-ci-robot assigned SaschaSchwarze0 Mar 22, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 22, 2021

openshift-merge-robot merged commit 023e050 into shipwright-io:master Mar 22, 2021

qu1queee mentioned this pull request Mar 23, 2021

Change StrategyRef strategy kind default behaviour #657

Closed

qu1queee deleted the one_shot_br branch March 23, 2021 10:12

qu1queee mentioned this pull request Mar 23, 2021

Improve BuildRun Failure State Transitions #558

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance BuildRun reconciles failure scenarios #641

Enhance BuildRun reconciles failure scenarios #641

qu1queee commented Mar 4, 2021 •

edited

Loading

SaschaSchwarze0 left a comment

SaschaSchwarze0 Mar 12, 2021

qu1queee Mar 12, 2021

SaschaSchwarze0 Mar 12, 2021

qu1queee Mar 19, 2021

SaschaSchwarze0 Mar 12, 2021

qu1queee Mar 12, 2021

SaschaSchwarze0 Mar 12, 2021

qu1queee Mar 12, 2021

qu1queee Mar 12, 2021

SaschaSchwarze0 Mar 12, 2021 •

edited

Loading

qu1queee Mar 19, 2021

SaschaSchwarze0 left a comment

SaschaSchwarze0 Mar 22, 2021

qu1queee Mar 22, 2021

SaschaSchwarze0 Mar 22, 2021

qu1queee Mar 22, 2021

HeavyWombat left a comment

gabemontero commented Mar 22, 2021

openshift-ci-robot commented Mar 22, 2021

SaschaSchwarze0 left a comment

Enhance BuildRun reconciles failure scenarios #641

Enhance BuildRun reconciles failure scenarios #641

Conversation

qu1queee commented Mar 4, 2021 • edited Loading

Changes

Submitter Checklist

Release Notes

SaschaSchwarze0 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SaschaSchwarze0 Mar 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SaschaSchwarze0 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HeavyWombat left a comment

Choose a reason for hiding this comment

gabemontero commented Mar 22, 2021

openshift-ci-robot commented Mar 22, 2021

SaschaSchwarze0 left a comment

Choose a reason for hiding this comment

qu1queee commented Mar 4, 2021 •

edited

Loading

SaschaSchwarze0 Mar 12, 2021 •

edited

Loading