Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build controller should watch referenced secrets #466

Closed
qu1queee opened this issue Nov 3, 2020 · 21 comments · Fixed by #493
Closed

Build controller should watch referenced secrets #466

qu1queee opened this issue Nov 3, 2020 · 21 comments · Fixed by #493
Labels
beta BETA related issue

Comments

@qu1queee
Copy link
Contributor

qu1queee commented Nov 3, 2020

From #457

Idea:

Currently a Build validation (see comment for more information) takes place only during events on the Build (e.g. create, update).

Referenced secrets on a Build definition might not longer be present during runtime of a BuildRun(creation event), leading to unwanted experiences for users during the BuildRun execution.

Because a BuildRun already have a "basic" mechanism for preventing that the building starts, if the Build validations fail, we should enhance the Build reconciles and also validate when particular events on the reference secrets take place.

Acceptance_Criteria:

Some thoughts on the implementation:

  • We need a new watcher on the Build controller so that we can watch all referenced secrets on the namespace. Example of this can be found in the BuildRun controller, where we watch different resources
  • The new watcher should have some filters. We should use predicates for reconciling only on deletion/creation events.
  • This requires unit and integration test cases
  • Docs updates
@qu1queee qu1queee added the beta BETA related issue label Nov 3, 2020
@qu1queee qu1queee added this to the release-v1.0.0 milestone Nov 3, 2020
@qu1queee
Copy link
Contributor Author

qu1queee commented Nov 3, 2020

@SaschaSchwarze0 can u add more context in here around the extra fixes this card will provide us, bitte?

@SaschaSchwarze0
Copy link
Member

SaschaSchwarze0 commented Nov 4, 2020

@SaschaSchwarze0 can u add more context in here around the extra fixes this card will provide us, bitte?

Sure,

these scenarios should be:

  • Create a build that references a secret that does not exist. Later, create the secret. The build should get registered. This scenario works as well today using the error-reconciliation-loop, but this current implementation has the weakness that the time between reconciles is growing over time. So, if there is for example a day between build and secret creation, it can take quite a lot time for the build to be registered. This is expected to be improved by this feature.
  • Create a build that references an existing secret. The build gets registered. Later, delete the secret. The build should get unregistered. This scenario is not working today.

Going forward, once we also validate the content of the secret (correct secret type, registry push permission), we will need to extend this reconciliation:

  • In addition to creation and deletion, we will also need to reconcile on update, and re-validate the secret for those builds that reference it.
  • Once we do permission checks, we might even consider doing periodic checks every day maybe to validate that the secret is still valid (and has not expired for example).

But that's probably a second step in a separate issue.

But regarding this issue, a question I have is if we want to specifically watch referenced secrets are referenced objects in general which will also include the build strategy?

@qu1queee
Copy link
Contributor Author

qu1queee commented Nov 5, 2020

Just to answer your question.

But regarding this issue, a question I have is if we want to specifically watch referenced secrets are referenced objects in general which will also include the build strategy?

I will initially say no, because an strategy once applied should be immutable. Is an immutable object because in theory this is a resource controlled by the admin entities, not regular users. A Build that references an strategy can assume that such strategy will not change its contents, and that any change on the strategy should end-up as a new strategy with a new name.

@zhangtbj
Copy link
Contributor

zhangtbj commented Nov 9, 2020

We need to care about the namespace scope and secret event type to make sure the performance is good.

@sbose78
Copy link
Member

sbose78 commented Nov 9, 2020

The objective is to be able to declare the Build unusable upfront instead of failing after triggering the BuildRun ?

@qu1queee
Copy link
Contributor Author

qu1queee commented Nov 9, 2020

@sbose78 yes, by ensuring that the Build validation not only happens during a creation event, but also if a secret gets deleted.

Concrete use case is:

  1. A Build is validate, all secrets are in place, Build is marked as Succeeded.
  2. A BuildRun is applied, but prior to the BuildRun creation event, one of the Build referenced secrets got deleted

Based on the above, we should "re-validate" a Build when an event(deletion) takes place on the referenced secret.

@qu1queee
Copy link
Contributor Author

qu1queee commented Nov 16, 2020

@sbose78 any concerns on this? or things important to keep in mind

@sbose78
Copy link
Member

sbose78 commented Nov 16, 2020

My understanding is that the controller-runtime client uses a previously-populated cache for reading secrets, and therefore this should be good!

@gabemontero , thoughts on this ? :)

@gabemontero
Copy link
Member

This seems like a good idea to me @sbose78 @qu1queee

Similarly I would think, as the api evolves and references like this expand (configmaps or volumes perhaps) a similar approach is taken with those Build dependencies if you will

@xiujuan95
Copy link
Contributor

xiujuan95 commented Nov 17, 2020

@qu1queee @zhangtbj @sbose78 @SaschaSchwarze0

Because I plan to take this issue, I go through above discussions, just want to confirm, this issue will focus on below two cases, right?

  • A build has been in Succeeded status, then the referenced secret of this build is deleted, then the status of this build should be updated and show us secret not found;

  • A build is created, however, the referenced secret doesn't exist. Then when we create this secret, then the status of this build should be updated and succeed with a shorter time;

@zhangtbj
Copy link
Contributor

About the second case:

A build is created, however, the referenced secret doesn't exist. Then when we create this secret, then the status of this build should be updated and succeed with a shorter time;

We already support this case.

I don't understand the scenario which @SaschaSchwarze0 mentioned well:

Create a build that references a secret that does not exist. Later, create the secret. The build should get registered. This scenario works as well today using the error-reconciliation-loop, but this current implementation has the weakness that the time between reconciles is growing over time. So, if there is for example a day between build and secret creation, it can take quite a lot time for the build to be registered. This is expected to be improved by this feature.

So, if there is for example a day between build and secret creation, it can take quite a lot time for the build to be registered. This is expected to be improved by this feature.

I think it is a default behavior from kube, what can we do to improve in this issue?

@SaschaSchwarze0
Copy link
Member

SaschaSchwarze0 commented Nov 17, 2020

We already support this case.

I don't understand the scenario which @SaschaSchwarze0 mentioned well:

Create a build that references a secret that does not exist. Later, create the secret. The build should get registered. This scenario works as well today using the error-reconciliation-loop, but this current implementation has the weakness that the time between reconciles is growing over time. So, if there is for example a day between build and secret creation, it can take quite a lot time for the build to be registered. This is expected to be improved by this feature.

So, if there is for example a day between build and secret creation, it can take quite a lot time for the build to be registered. This is expected to be improved by this feature.

I think it is a default behavior from kube, what can we do to improve in this issue?

Our current code uses an error driven reconciliation loop. This is not ideal for two reasons:

  • In my opinion, reconciling again on error makes sense if the error is something that can be expected to go away when reconciling again. For example, an API update failure because of a version conflict or a network hickup. We reconcile again, but require a user to apply a fix (the creation of the secret) for the reconciliation to be successful. -> In such a case, it is more efficient to reconcile on event of the user action (which is the idea here as the event is the secret creation). More efficient because all the failing reconciliations consume CPU, require Kube API calls.
  • The error-driven reconciliation starts with a short frequency that gets larger. After a day, when you create the secret, it can take minutes for the build to be reconciled again. This can confuse users because they cannot know what is happening behind the scenes.

In short: by reconciling the build on secret creation, we improve efficiency and user experience.

@qu1queee
Copy link
Contributor Author

+1 on @SaschaSchwarze0 and @zhangtbj thanks for asking (very valid question) . In a nutshell, by watching the secret those long frequencies will not be a concern anymore, whenever the secret gets deleted/created. I pinged @xiujuan95 offline, I will be pairing with her on this card, to save some comments on this issue. If @SaschaSchwarze0 or @zhangtbj wants to join us, let us know.

@zhangtbj
Copy link
Contributor

Pls go ahead.

My only question is about:

In such a case, it is more efficient to reconcile on event of the user action (which is the idea here as the event is the secret creation). More efficient because all the failing reconciliations consume CPU, require Kube API calls.

First of all, I think, in our current logic, our build controller will always keep reconciling if the secret doesn't exit. So although we add secret creation event, it just make ONE reconciliation faster once the secret is created, but before that, such as ONE DAY without the secret, the build controller will still keep reconciling and also cost CPU and Memory.

After a day, when you create the secret, it can take minutes for the build to be reconciled again.

And I think error reconciling should not take minutes, should be just several seconds

So I think it is better that we just care about the secret deletion to simplify our enhancement. Secret creation can help but cannot help a lots, in my opinion.

@SaschaSchwarze0
Copy link
Member

My only question is about:

In such a case, it is more efficient to reconcile on event of the user action (which is the idea here as the event is the secret creation). More efficient because all the failing reconciliations consume CPU, require Kube API calls.

First of all, I think, in our current logic, our build controller will always keep reconciling if the secret doesn't exit. So although we add secret creation event, it just make ONE reconciliation faster once the secret is created, but before that, such as ONE DAY without the secret, the build controller will still keep reconciling and also cost CPU and Memory.

After a day, when you create the secret, it can take minutes for the build to be reconciled again.

And I think error reconciling should not take minutes, should be just several seconds

It might not have been made clear, but, once we reconcile on secret creation, we also will stop returning an error in the build reconciliation when it determines a missing secret. So, we will not keep reconciling on such a build. This saves resources in our controller.

The reconciliation indeed only needs seconds. But, what - with the current implementation - can take minutes, is the time between secret creation and build reconciliation. If our controller returns an error, it is out of our control on when the next reconciliation happens. This will be addressed in this issue - by reconciling the build when a secret is created.

@adambkaplan
Copy link
Member

The biggest challenge with this feature is that the build controller (or another controller therein) would need to do the following:

  1. Watch every secret in every non-system namespace
  2. Check if the secret is (or was) referenced by a Build object
  3. Update status on referenced Build objects

Doing this with what we have today would result in a Cartesian explosion - for a given namespace, every secret N would need to check M Build objects. This can't scale.

OwnerRefs is probably not the way to go, since a secret could be referenced by multiple Build objects. It also risks a secret being deleted accidentally if the parent Build object is deleted. The inverse is also true (don't want to delete a Build because it is owner-refed by all Secrets).

Perhaps we add annotations to referenced Secrets when the Build object is created/updated? And similarly remove/update annotations if the Build object is deleted?

@qu1queee
Copy link
Contributor Author

@adambkaplan agree. I started on this today together with @xiujuan95 and I have exactly the same concern, see todo comment . One important thing is that the controller-runtime should be getting resources from the cache( see discussion, but still is an expensive call

@xiujuan95
Copy link
Contributor

During the implementation, I met a problem, when delete secrets event happens, I can't recognize which build is referencing this secret. So according to current logic, it will alway reconcile a build called this secret's name, this is not expected.

	preSecret := predicate.Funcs{
		// Check if this secret has been deleted, if it's true, will return true
		DeleteFunc: func(e event.DeleteEvent) bool {
			return !e.DeleteStateUnknown
		},
		UpdateFunc:  nil,

	}
	// Watch all secrets
	err = c.Watch(&source.Kind{Type: &corev1.Secret{}}, &handler.EnqueueRequestsFromMapFunc{
		ToRequests: handler.ToRequestsFunc(func(o handler.MapObject) []reconcile.Request {
			secret := o.Object.(*corev1.Secret)
			
			return []reconcile.Request{
				{
					NamespacedName: types.NamespacedName{
						Name:      secret.Name,
						Namespace: secret.Namespace,
					},
				},
			}
		}),
		}, preSecret)
	if err != nil {
		return err
	}

Below is an example, I have a secret called foobar and I create a build to use it. Now I delete this secret, but controller will reconcile a build names foobar:

2020-11-23T16:09:59.152+0800	DEBUG	build.build-controller	start reconciling Build	{"namespace": "default", "name": "foobar"}
2020-11-23T16:09:59.152+0800	DEBUG	build.build-controller	finish reconciling build. build was not found	{"namespace": "default", "name": "foobar"}
2020-11-23T16:09:59.152+0800	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "build-controller", "request": "default/foobar"}

So maybe I also agree with @adambkaplan to add annotation for secrets, like below:

apiVersion: v1
kind: Secret
metadata:
  name: foobar
  annotations:
    build.build.dev/build: ["build-0", "build-1", ''build-2"]

Once the referenced secret is deleted, then all builds ("build-0", "build-1", ''build-2") are referencing this secret will be reconciled. And I think this way also can improve controller's performance, because we don't need to reconcile all builds when a secret is deleted.
Any Opinion? @qu1queee @zhangtbj @SaschaSchwarze0

@sbose78
Copy link
Member

sbose78 commented Nov 23, 2020

Approach looks fine to me.

Implementation note:

  1. Use Predicates for filtering out: 'Accept' only those Secret events which have the build.build.dev/build annotation.
  2. Use EnqueueRequestsFromMapFunc for determining the correct Build object before adding it to the reconciliation queue.
    Sample code: https://github.com/argoproj-labs/argocd-operator/pull/199/files#diff-31ecb8aa6e5fc370a3729ec025dca6767d511ee5062276d99401fcab2b3aee23R698

@sbose78
Copy link
Member

sbose78 commented Nov 24, 2020

Doing this with what we have today would result in a Cartesian explosion - for a given namespace, every secret N would need to check M Build objects. This can't scale.

Adding to that:

Controller-runtime client uses a cache for reading and it populates the cache by watching any resources that were requested. Are there general scale concerns around watching all secrets here ?

@qu1queee
Copy link
Contributor Author

qu1queee commented Nov 26, 2020

@sbose78 I do prefer to do #466 (comment) , two levels of filtering via predicates and the watcher.

Asking for your input around who will define the annotation. I think in a nutshell there are two options:

  1. user is forced to add the annotation in the secret. This is the best one for the controller, because that annotation will enable us to have a nice level of filtering when reacting to secrets events(create, update,delete). Best for performance, but not desired from a user perspective. Although, forcing users to add the annotation seems to be a common pattern in Tekton & Build v1 from what I see.

  2. user is not forced to add the annotation, so either we patched the annotation via the Build controller, or we filter secrets based on their types. I did both options and I realised there is not good way to avoid the controller reacting on CRUD events from secrets of any namespaces in the cluster. This is good for users, but performance wise seems not ideal.

we have been discussing this today @xiujuan95 @SaschaSchwarze0 and myself, what are ur thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beta BETA related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants