Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce error codes on the Build CRD validations #463

Closed
qu1queee opened this issue Nov 3, 2020 · 6 comments · Fixed by #536
Closed

Introduce error codes on the Build CRD validations #463

qu1queee opened this issue Nov 3, 2020 · 6 comments · Fixed by #536
Assignees
Labels
beta BETA related issue

Comments

@qu1queee
Copy link
Contributor

qu1queee commented Nov 3, 2020

From #457

Idea:

The Build controller is currently executing the following validations:

when a Build validation fails, the validation error will be propagated into the Build .Status.Reason field, e.g.:

NAMESPACE          NAME                              REGISTERED   REASON                                                 BUILDSTRATEGYKIND      BUILDSTRATEGYNAME      CREATIONTIME
16hpqn7r3d6        sh-build-cc2020-12                False        secret sh-registry-secret-01 does not exist            ClusterBuildStrategy   kaniko-medium          11d
16hpqn7r3d6        sh-build-cc20201023-01            False        secret sh-rs-not-existing does not exist               ClusterBuildStrategy   kaniko-medium          11d
1qvnynlptul        helloworld-build                  False        secret myregistry does not exist                       ClusterBuildStrategy   kaniko-medium          3d21h
a3659891-e2eb      clitest                           False        clusterBuildStrategy buildpacks-v3-v3 does not exist   ClusterBuildStrategy   buildpacks-v3-v3       67d
a8e740be-92dd      sh-builddef02                     False        secret testsecret does not exist                       ClusterBuildStrategy   kaniko-medium          61d

For the above validations and error message definition, we have full control on it. However, if third parties (e.g. UI/CLI teams) rely on those error messages, we will often break them whenever we apply changes (e.g. enhancements) on our error messages.

Acceptance Criteria:

We would like to make this error messages persistent across time. Because we understand the message will change, the solution we would like to have is to prepend the message with a well defined error code.

Having a prepended error code will ensure that the message can change over time, but third parties could rely only on the code, whilst this is never gonna change. A code can only be removed in the posterity.

This issue is about:

  • generating such error codes (WIP on the best approach)
  • implementation of the error codes usage.
  • add a test case to detect changes on error codes, and make the test to fail.
@adambkaplan
Copy link
Member

I am loathe to support numeric error codes when Kubernetes has a much better alternative - status conditions. We have already discussed this as feature we want (see #307). With conditions we can probably do away with the top-level Reason field and move to a model where each condition describes various states of our respective API objects (specifically Build and BuildRun).

Condition types and Reason codes in status conditions give us the best of both worlds - a human-readable string backed by an API guarantee [1]. Tools can infer state and failure mode from these conditions, and developers can likewise infer state by inspecting the conditions YAML.

Numeric error codes buried in a message string require end-users and tool builders to look up these codes in documentation that must be carefully maintained. This is adds toil and is IMO hostile to the user experience.

[1] https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties

@qu1queee
Copy link
Contributor Author

qu1queee commented Nov 18, 2020

@adambkaplan some key things:

  1. I think we briefly discussed the error codes in a community meeting, I do recall you mention this same concern, but there was not a follow-up, so we might start having action items around the lack of consensus for particular topics, to avoid dev work that we might not want.

  2. @SaschaSchwarze0 had the same idea, he shared a rough model for adding conditions on the Build.

  3. The conditions framework Move to Conditions instead of Status in the BuildRun #307 is designed in a way so that other CRDs can support conditions if needed, so from this point of view if we want to do the conditions support in Build, its simple.

  4. I have strong opinions on why the Build CRD should not support conditions. Just for the record, we are speaking about the Build not BuildRun.

  • Build is a configuration template, it does not trigger or generate any object that requires recollection of the state. Contrary to a Deployment that generates a pod, or a BuildRun that generates a TaskRun and then a pod, or a Statefulset that generates a pod.

  • In community docs around conditions they mentioned the following

    However, conditions are observations and not, themselves, state machines, nor do we define comprehensive state machines for objects, nor behaviors associated with state transitions. The system is level-based rather than edge-triggered, and should assume an Open World.

    There is a nice article around level vs edge-triggered systems, if you take a look you will understand that the validations we do in the Build(e.g check if a secret is there, check if a BuildStrategy is there) are edge-triggered, not level-based. If there is a disruption on the system when we validate the existence of a secret, then we will mark our validation as "Failed", although the secret was there( edge-triggered approach) and we will not reconcile after that. We might have a reconciliation after, but only around events on the secret(create, delete), but this events could take long time to happen after the disruption took place. Contrary to a TaskRun, where the whole conditions are around the state of a pod and if this matches the desired state of the cluster, where a Unknown status make a lot of sense to indicate that the controller is reconciling on that object and that the object is active.

    I do not see how the Unknown Status could fit on Conditions for the Build, because as mentioned before, the Build is just a template, where we reconcile on a Creation event of that template, and where we do not rectify the state of an object in case of disruptions.

@adambkaplan
Copy link
Member

Let’s step back a bit on this issue and associated PR. The problem context is as follows:

  1. Build objects have an internal validation with the Registered status field, which can be False for multiple reasons.
  2. Some reasons are transient (ex - failed API call), whereas some require user action to address (missing Secret or BuildStrategy).
  3. External tooling (user interfaces, command lines) cannot readily identify reasons why Registered is False.

The real want is an API which lets an external tool understand why a Build object is not Registered. Since external tools are involved, we aspire to have compatibility guarantees with respect to this API.

The proposed solution is to prepend a numeric error code to the “Reason” field, which today is a free form string. The field retains its free form message, but the numeric prefix comes with an implied API guarantee - namely that number refers to one thing and one thing only. This is problematic because from an API perspective, we have no means of enforcing this guarantee. “Reason” is a free form string, and under this solution it remains a free form string with a convention that cannot be strictly enforced by Go code. It also adds overhead to end clients, which need to parse the numeric code to decipher the underlying reason the Build object did not register.

I have proposed as an alternative using Kubernetes Conditions [1]. This lets us provide API stability through the Type and Reason fields, and allow our controllers to be expressive through the Conditions array. If desired, we can enforce the API compatibility/guarantees in code via Condition-compatible types.

@qu1queee has rightly identified that Build objects themselves don’t generate pods that would trigger reconciliation. Most objects that use Conditions have underlying pods that trigger reconciliation. The closest objects we have to Build in upstream communities are:

  1. CronJob - this has a status subresource, but no conditions. [2]
  2. Tekton Tasks - this has no status whatsoever. [3]

Given this information, adding a full Conditions array to the Build object may not make sense. I therefore propose the following middle ground that uses the conventions of Conditions, but not the full blown array.

  1. Convert “Reason” from a free string to a Golang string type, with constants that represent CamelCase reasons the Build isn’t registered.
  2. Add the field “Message” to the Build status, containing the free message that we have today.

This will address our goal of providing an API to understand why builds do not register, while retaining our ability to report detailed error messages. Does this make sense?

[1] https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties
[2] https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.19/#cronjobstatus-v1beta1-batch
[3] https://github.com/tektoncd/pipeline/blob/master/pkg/apis/pipeline/v1beta1/task_types.go#L45-L53

@qu1queee
Copy link
Contributor Author

qu1queee commented Dec 2, 2020

@adambkaplan thanks, I need to sync with Zoe and Jordan around your reply, will provide some feedback soon.

@qu1queee
Copy link
Contributor Author

qu1queee commented Dec 3, 2020

@adambkaplan yes, you are laying out a very nice proposal. I think that we will definitely not use Conditions for the Build, but as you said we should adopt some of the API Conditions best practices, and from this point of view, we are misusing the status.Reason field, and we are not using status.Message which we should. I will propose a PR for these changes later, and from there we can discuss if this overall refactoring make sense.

Now, the above is not related with what we were proposing in #483 . In #483 we are behind a way to standardize the "error message" . The error codes was the initial idea, which we all agree we should not longer pursuit ( @xiujuan95, @zhangtbj and myself). However, the question remains open, how can we achieve a better standardization of those error messages. Either if the error message is today under status.Reason or tomorrow under status.Message the question remains open. Plus, we would like this message to do not change over time,only if needed and if it does, to consider this as an API breaking change. Also, as you mentioned, we are aware this error message might get sometimes transient errors(api failed calls), but for the situations were is not the case and that our validations work, we should definitely have a better way to represent the error message. Therefore asking for some ideas.

@xiujuan95 thanks for working on #483, but as discussed we should proceed to close it.

qu1queee added a commit to qu1queee/build that referenced this issue Dec 11, 2020
issue shipwright-io#463 .

Keep in mind this would be the second attempt to improve on this
topic, we previously tried it with:

- shipwright-io#483

Signed-off-by: Zoe <[email protected]>
@qu1queee
Copy link
Contributor Author

@adambkaplan and others, for your information @xiujuan95 and myself started already working on the "Build refactoring", based on the comments on this issue, see our initial TODO list for changes qu1queee@4b69868.

We think this should be a neat approach, so please take a look while we work on it, otherwise we will PR this soonish.

qu1queee added a commit to qu1queee/build that referenced this issue Jan 18, 2021
issue shipwright-io#463 .

Keep in mind this would be the second attempt to improve on this
topic, we previously tried it with:

- shipwright-io#483

Signed-off-by: Zoe <[email protected]>
qu1queee added a commit to qu1queee/build that referenced this issue Jan 18, 2021
issue shipwright-io#463 .

Keep in mind this would be the second attempt to improve on this
topic, we previously tried it with:

- shipwright-io#483

Signed-off-by: Zoe <[email protected]>
qu1queee added a commit to qu1queee/build that referenced this issue Jan 18, 2021
issue shipwright-io#463 .

Keep in mind this would be the second attempt to improve on this
topic, we previously tried it with:

- shipwright-io#483

Signed-off-by: Zoe <[email protected]>
qu1queee added a commit to qu1queee/build that referenced this issue Jan 18, 2021
issue shipwright-io#463 .

Keep in mind this would be the second attempt to improve on this
topic, we previously tried it with:

- shipwright-io#483

Signed-off-by: Zoe <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beta BETA related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants