Skip to content
This repository has been archived by the owner on Aug 17, 2023. It is now read-only.

[Operator] various fixes for Kubeflow Operator #411

Merged
merged 14 commits into from
Oct 14, 2020

Conversation

adrian555
Copy link
Member

This PR addresses a list of issues related to the Kubeflow Operator

  • First reported under Operator reconciling and redeploying the kfdef without known reason #393, the operator will redeploy Kubeflow when the cluster's master gets restarted. This has been reproduced and the root cause of the issue is garbage collection related. Kubeflow Operator set the ownerReferences for the resources it created and depended on the garbage collection to delete those resources when the kfdef instance was deleted. Since the garbage collection problem won't be fixed through Kubernetes project for a while, we take the similar approach the other projects have been doing to use the annotations instead.

  • As part of the Kubeflow deployment, many new CRDs are created. These are not known to the current controller of the operator. One example is the Application kind resources, they were not actually watched by the controller and so when they were deleted, operator did not get notified. To fix such issue, this PR adds a new controller to watch on these GVKs once the first kfdef instance creates successfully.

  • Reduce the number of reconcile requests when create and delete the kfdef instance. Two events happen during kfdef CR is applied. First, it is a CREATE event. And when the Reconile() function kicks in, the finalizer is added to the CR, an UPDATE event happens. This results in one more Reconcile() call after the Kubeflow is successfully deployed. Two events happen also when the kfdef CR is deleted. First is an UPDATE event. This happens because the kfdef resource has a finalizer so the delete action leads to adding the deletion timestamp to the CR and so the kfdef CR is updated. Then when the finalizer is finally removed, a DELETE event happens. This results in one more Reconcile() call after the kfdef CR is actually deleted. This PR instead adds the finalizer in the watch handler func when the kfdef instance is created. Only the UPDATE event will queue a request. When the kfdef CR is deleted and the deletion timestamp is added, this is the UPDATE event for the reconcile func to handle. When it finally is deleted when the finalizer is removed, no request is needed to queue for the reconcile func.

  • For resources created during Kubeflow deployment, verify their annotations containing the kfdef instance's name and namespace then queue the request for reconciling.

  • Now that the operator is no longer depending on the garbage collection to delete the resources, this PR adds KfDelete() function to handle the deletion. It is done as part of the kfdef instance's finalizer.

  • Change GenerateYamlWithOwnerReferences function to GenerateYamlWithOperatorAnnotation function so that the Kubeflow resources will be added the annotations indicating they are deployed through the Kubeflow operator. A couple of special cases are excluded from appending the annotations. First, for Namespace kind resource, it tries to verify if the namespace exists. If a namespace already exists, the annotations will not be added. This will avoid the accidental deletion of a namespace that is not created by the operator. This strategy works when the operator and the Kubeflow coexist in the same cluster. If in the future the remote deployment through operator is supported, this needs revisit. Second, for Profiles kind resource, it will not add the annotations. Part of the reasons this PR does so is currently the profiles CRD is the owner of user profiles (ie. namespace). When the profiles crd is deleted, the user namespaces and data will be cascade deleted. Therefore this PR assumes that users will incline to keeping user namespaces and data when the Kubeflow is uninstalled, it then will not remove the profiles crd when the kfdef instance is deleted.

  • Pass byOperator indicator to utils.DeleteResource function to indicate that if the function is called by the operator, it will only delete the resources with the annotations appended during Kubeflow deployment.

  • Avoid processing the applications with same name in kfdef multiple times. This is also applied to CLI install. The fact is that with the kustomize v3, kfdef configuration is allowed to have the same application name (eg. kubeflow-apps) pointing to different manifest path. This is however useful for use cases where a v3 full stack needs to break into sub-components. The tool appends all the manifests into one single kustomization.yaml file under kustomize/kubeflow-apps directory. And we are taking the advantage of this. But at runtime, when the apply() function loops through the kfdef configuration, it iterates by application name and results in applying the resources multiple times. This PR skips the apply if the application is already applied.

@kubeflow-bot
Copy link

This change is Reviewable

Copy link
Member

@Tomcli Tomcli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adrian555 for the great work.

pkg/controller/kfdef/const.go Show resolved Hide resolved
pkg/controller/kfdef/kfdef_controller.go Show resolved Hide resolved
@Tomcli
Copy link
Member

Tomcli commented Sep 23, 2020

@adrian555 i'm testing the new operator. Creating and watching the new resources works great. However, when i try to delete the kfdef, it panic due to a nil channel. To recreate this, run:

  1. Deploy kfdef on Kubernetes 1.16
  2. Delete a custom resource such as kubectl delete VirtualService ml-pipeline-ui -n kubeflow
  3. After it redeployed, delete kfdef kubectl delete kfdef --all -n kubeflow
    you will see the errors below
time="2020-09-23T00:44:51Z" level=info msg="Go Version: go1.14.3"
time="2020-09-23T00:44:51Z" level=info msg="Go OS/Arch: linux/amd64"
time="2020-09-23T00:44:51Z" level=info msg="Version of operator-sdk: v0.13.0"
time="2020-09-23T00:44:51Z" level=info msg="Kubeflow version: 1.1.0"
time="2020-09-23T00:44:55Z" level=info msg="Registering Components."
time="2020-09-23T00:44:55Z" level=info msg="Adding controller for kfdef."
time="2020-09-23T00:44:58Z" level=error msg="Cannot create watch for resources Deployment extensions/v1beta1: no matches for kind \"Deployment\" in version \"extensions/v1beta1\"."
time="2020-09-23T00:44:58Z" level=info msg="Controller added to watch on Kubeflow resources with known GVK."
time="2020-09-23T00:45:04Z" level=error msg="Could not create ServiceMonitor object. Error: no ServiceMonitor registered with the API."
time="2020-09-23T00:45:04Z" level=error msg="Install prometheus-operator in your cluster to create ServiceMonitor objects. Error: no ServiceMonitor registered with the API."
time="2020-09-23T00:45:04Z" level=info msg="Starting the Cmd."
time="2020-09-23T00:45:04Z" level=info msg="Got create event for ibmstack.kubeflow."
time="2020-09-23T00:45:04Z" level=info msg="Watch a change for KfDef CR: ibmstack.kubeflow."
time="2020-09-23T00:45:04Z" level=info msg="Reconciling KfDef resources. Request.Namespace: kubeflow, Request.Name: ibmstack."
time="2020-09-23T00:45:04Z" level=info msg="Deleting kfdef."
E0923 00:45:04.959894       1 runtime.go:73] Observed a panic: "close of nil channel" (close of nil channel)
goroutine 1501 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1c35e20, 0x22c7170)
	/Users/tommyli/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:69 +0x7b
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/Users/tommyli/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:51 +0x82
panic(0x1c35e20, 0x22c7170)
	/usr/local/Cellar/go/1.14.3/libexec/src/runtime/panic.go:969 +0x166
github.com/kubeflow/kfctl/v3/pkg/controller/kfdef.(*ReconcileKfDef).Reconcile(0xc000c80580, 0xc0010bd948, 0x8, 0xc0010bd940, 0x8, 0x0, 0x303a37030, 0xc0004d4000, 0xc00002a518)
	/Users/tommyli/go/src/github.com/kubeflow/kfctl/pkg/controller/kfdef/kfdef_controller.go:258 +0xcb5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00042e460, 0x1ccf160, 0xc00024c2e0, 0x0)
	/Users/tommyli/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:216 +0x161
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00042e460, 0xc000a66100)
	/Users/tommyli/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192 +0xae
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc00042e460)
	/Users/tommyli/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc0031ac0f0)
	/Users/tommyli/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152 +0x5f
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0031ac0f0, 0x3b9aca00, 0x0, 0xc000779d01, 0xc00026dc20)
	/Users/tommyli/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc0031ac0f0, 0x3b9aca00, 0xc00026dc20)
	/Users/tommyli/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
	/Users/tommyli/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:157 +0x2fd
panic: close of nil channel [recovered]
	panic: close of nil channel

goroutine 1501 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/Users/tommyli/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:58 +0x105
panic(0x1c35e20, 0x22c7170)
	/usr/local/Cellar/go/1.14.3/libexec/src/runtime/panic.go:969 +0x166
github.com/kubeflow/kfctl/v3/pkg/controller/kfdef.(*ReconcileKfDef).Reconcile(0xc000c80580, 0xc0010bd948, 0x8, 0xc0010bd940, 0x8, 0x0, 0x303a37030, 0xc0004d4000, 0xc00002a518)
	/Users/tommyli/go/src/github.com/kubeflow/kfctl/pkg/controller/kfdef/kfdef_controller.go:258 +0xcb5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00042e460, 0x1ccf160, 0xc00024c2e0, 0x0)
	/Users/tommyli/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:216 +0x161
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00042e460, 0xc000a66100)
	/Users/tommyli/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192 +0xae
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc00042e460)
	/Users/tommyli/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc0031ac0f0)
	/Users/tommyli/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152 +0x5f
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0031ac0f0, 0x3b9aca00, 0x0, 0xc000779d01, 0xc00026dc20)
	/Users/tommyli/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc0031ac0f0, 0x3b9aca00, 0xc00026dc20)
	/Users/tommyli/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
	/Users/tommyli/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:157 +0x2fd

@adrian555
Copy link
Member Author

@Tomcli this doesn't seem to happen to me. I just ran this again but did not see the error. Were you deploying multiple times with the same operator? Maybe more details on the operations you have done will help understand the root cause. Thanks.

@Tomcli
Copy link
Member

Tomcli commented Sep 23, 2020

@Tomcli this doesn't seem to happen to me. I just ran this again but did not see the error. Were you deploying multiple times with the same operator? Maybe more details on the operations you have done will help understand the root cause. Thanks.

The extra thing i did here is I also deleted a crd to check the watcher.

@Tomcli
Copy link
Member

Tomcli commented Sep 29, 2020

/lgtm
@adrian555 we should open an issue or leave a note for the users to know that kfctl delete takes longer time to delete than k8s garbage collector.

Copy link
Contributor

@animeshsingh animeshsingh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @adrian555

few nits, else
/lgtm

@pvaneck @vpavlin @nakfour any comments?

operator.md Outdated Show resolved Hide resolved
operator.md Outdated Show resolved Hide resolved
operator.md Outdated Show resolved Hide resolved
pkg/controller/kfdef/kfdef_controller.go Outdated Show resolved Hide resolved
Co-authored-by: Animesh Singh <[email protected]>
@k8s-ci-robot k8s-ci-robot removed the lgtm label Oct 13, 2020
adrian555 and others added 3 commits October 13, 2020 09:46
Co-authored-by: Animesh Singh <[email protected]>
Co-authored-by: Animesh Singh <[email protected]>
@animeshsingh
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: animeshsingh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 20f20a7 into kubeflow:master Oct 14, 2020
crobby pushed a commit to crobby/kfctl that referenced this pull request Feb 25, 2021
* handle generator with envs

* operator fixes

* update Dockerfile.ubi

* address comments

* address comments

* Update operator.md

Co-authored-by: Animesh Singh <[email protected]>

* Update operator.md

Co-authored-by: Animesh Singh <[email protected]>

* Update operator.md

Co-authored-by: Animesh Singh <[email protected]>

* Update pkg/controller/kfdef/kfdef_controller.go

Co-authored-by: Animesh Singh <[email protected]>

Co-authored-by: Animesh Singh <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants