-
Notifications
You must be signed in to change notification settings - Fork 137
[Operator] various fixes for Kubeflow Operator #411
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @adrian555 for the great work.
@adrian555 i'm testing the new operator. Creating and watching the new resources works great. However, when i try to delete the kfdef, it panic due to a nil channel. To recreate this, run:
|
@Tomcli this doesn't seem to happen to me. I just ran this again but did not see the error. Were you deploying multiple times with the same operator? Maybe more details on the operations you have done will help understand the root cause. Thanks. |
The extra thing i did here is I also deleted a crd to check the watcher. |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Animesh Singh <[email protected]>
Co-authored-by: Animesh Singh <[email protected]>
Co-authored-by: Animesh Singh <[email protected]>
Co-authored-by: Animesh Singh <[email protected]>
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: animeshsingh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* handle generator with envs * operator fixes * update Dockerfile.ubi * address comments * address comments * Update operator.md Co-authored-by: Animesh Singh <[email protected]> * Update operator.md Co-authored-by: Animesh Singh <[email protected]> * Update operator.md Co-authored-by: Animesh Singh <[email protected]> * Update pkg/controller/kfdef/kfdef_controller.go Co-authored-by: Animesh Singh <[email protected]> Co-authored-by: Animesh Singh <[email protected]>
This PR addresses a list of issues related to the Kubeflow Operator
First reported under Operator reconciling and redeploying the kfdef without known reason #393, the operator will redeploy Kubeflow when the cluster's master gets restarted. This has been reproduced and the root cause of the issue is garbage collection related. Kubeflow Operator set the ownerReferences for the resources it created and depended on the garbage collection to delete those resources when the kfdef instance was deleted. Since the garbage collection problem won't be fixed through Kubernetes project for a while, we take the similar approach the other projects have been doing to use the annotations instead.
As part of the Kubeflow deployment, many new CRDs are created. These are not known to the current controller of the operator. One example is the
Application
kind resources, they were not actually watched by the controller and so when they were deleted, operator did not get notified. To fix such issue, this PR adds a new controller to watch on these GVKs once the first kfdef instance creates successfully.Reduce the number of reconcile requests when create and delete the kfdef instance. Two events happen during kfdef CR is applied. First, it is a CREATE event. And when the Reconile() function kicks in, the finalizer is added to the CR, an UPDATE event happens. This results in one more Reconcile() call after the Kubeflow is successfully deployed. Two events happen also when the kfdef CR is deleted. First is an UPDATE event. This happens because the kfdef resource has a finalizer so the delete action leads to adding the deletion timestamp to the CR and so the kfdef CR is updated. Then when the finalizer is finally removed, a DELETE event happens. This results in one more Reconcile() call after the kfdef CR is actually deleted. This PR instead adds the finalizer in the watch handler func when the kfdef instance is created. Only the UPDATE event will queue a request. When the kfdef CR is deleted and the deletion timestamp is added, this is the UPDATE event for the reconcile func to handle. When it finally is deleted when the finalizer is removed, no request is needed to queue for the reconcile func.
For resources created during Kubeflow deployment, verify their annotations containing the kfdef instance's name and namespace then queue the request for reconciling.
Now that the operator is no longer depending on the garbage collection to delete the resources, this PR adds KfDelete() function to handle the deletion. It is done as part of the kfdef instance's finalizer.
Change
GenerateYamlWithOwnerReferences
function toGenerateYamlWithOperatorAnnotation
function so that the Kubeflow resources will be added the annotations indicating they are deployed through the Kubeflow operator. A couple of special cases are excluded from appending the annotations. First, forNamespace
kind resource, it tries to verify if the namespace exists. If a namespace already exists, the annotations will not be added. This will avoid the accidental deletion of a namespace that is not created by the operator. This strategy works when the operator and the Kubeflow coexist in the same cluster. If in the future the remote deployment through operator is supported, this needs revisit. Second, forProfiles
kind resource, it will not add the annotations. Part of the reasons this PR does so is currently the profiles CRD is the owner of user profiles (ie. namespace). When the profiles crd is deleted, the user namespaces and data will be cascade deleted. Therefore this PR assumes that users will incline to keeping user namespaces and data when the Kubeflow is uninstalled, it then will not remove the profiles crd when the kfdef instance is deleted.Pass
byOperator
indicator toutils.DeleteResource
function to indicate that if the function is called by the operator, it will only delete the resources with the annotations appended during Kubeflow deployment.Avoid processing the applications with same name in kfdef multiple times. This is also applied to CLI install. The fact is that with the kustomize v3, kfdef configuration is allowed to have the same application name (eg.
kubeflow-apps
) pointing to different manifest path. This is however useful for use cases where a v3 full stack needs to break into sub-components. The tool appends all the manifests into one single kustomization.yaml file underkustomize/kubeflow-apps
directory. And we are taking the advantage of this. But at runtime, when the apply() function loops through the kfdef configuration, it iterates by application name and results in applying the resources multiple times. This PR skips the apply if the application is already applied.