-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Goroutine memory leak #397
Conversation
Signed-off-by: jooho <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Jooho -- this is my third attempt to fully grasp the problem and your code changes. Without adding 1000 namespaces and a memory profiler, do you have a reasonably easy way to test the code before and after. Even a manual setup with 2 or 3 namespaces and a few kubectl
commands to verify the state of the cluster/MM deployment resources and/or log messages.
@@ -24,6 +24,10 @@ import ( | |||
|
|||
func modelMeshEnabled(n *corev1.Namespace, controllerNamespace string) bool { | |||
if v, ok := n.Labels["modelmesh-enabled"]; ok { | |||
// Returns false if the namespace state is terminating even though the namespace have the 'modelmesh-enabled=true' label. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is helpful. Could you add comments to each of the changed code blocks above to describe what the code does and why (what problem it is solving).
@@ -119,24 +119,34 @@ func (r *ServiceReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct | |||
namespace = req.Name | |||
n := &corev1.Namespace{} | |||
if err := r.Client.Get(ctx, req.NamespacedName, n); err != nil { | |||
if k8serr.IsNotFound(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, controller keep checking namespaces even though the namespaces do not exist anymore.
As a result, a lot of misleading error messages showed up in the log. (I don't find the snippet of the message now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I think the ServingRuntime controller also has this problem (erroring when reconciling after a namespace is deleted). I was getting similar error logs from it. But that can be resolved in a separate PR.
I noticed that there is a difference in logic in this PR whether the namespace is terminating (cleanup resources created for the namespace) or if the namespace doesn't exist (skip reconciliation). I think there is a chance that a reconciliation might not see that a namespace is in the terminating state by the time it reconciles. If the namespace has been deleted, we'd still want to ensure disconnect/cleanup of the MMService and run ModelEventStream.RemoveWatchedService
. If the cleanup is moved to the namespace-deleted case then we would not even need the check to see if the namespace is terminating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tjohnson31415 Thanks for the review. From my 2 cents, the operator keeps watching the namespace so I don't think there is a possibility of missing the terminated status of the Namespace object. When any namespace changes its status, this operator detects it. WDYT?
Regarding the IsNotFound
part, do you want me to update this pr without it? and send another pr to update it when this original PR is merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more comment about the namespace object status. The main issue was the operator does not have logic when the namespace is under Terminating
status(source). From my understanding, Terminating
is the only problematic status in this procedure
Active
+modelmesh-enabled"
label ==> modelMeshEnabled return trueTerminating
+modelmesh-enabled"
label ==> modelMeshEnabled return trueActive
+ no label ==> modelMeshEnabled return false
The other status does not impact this logic.
The most important point here is, that when any namespace's status changes, this reconclie will be called and check the above logic. When any namespace gets terminate
signal, the status will be changed to Terminating
.
please correct me if something is wrong here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the "happy path" case, I think that reconciliation should always be able to observe the terminating state. In error/edge cases (eg. many namespaces deleted in bulk, error encountered during reconciliation and a requeue happening), I'm not so sure since "Terminating" is a temporary status.
Before this PR, the reconciliation handled neither "Terminating" nor "Not Found". This PR implements clean-up if the namespace is seen as Terminating, but, If reconciliation happens for a namespace that no longer exists, should the code:
(a) Assume that in-memory resources for that namespace do not exists and just skip reconciliation (i.e. assume that if modelmesh was enabled for that namespace that it previously reconciled while it was Terminating)
(b) Check for in-memory resources that exist and clean them up if they do
With the "level-based" reconciliation that Operators are meant to implement, I think (b) makes more sense as it reconciles based on the current state, without assumption of past reconciliation.
if err := r.List(ctx, sl, client.HasLabels{"modelmesh-service"}, client.InNamespace(namespace)); err != nil { | ||
return ctrl.Result{}, err | ||
} else { | ||
for i := range sl.Items { | ||
s := &sl.Items[i] | ||
if err2 := r.Delete(ctx, s); err2 != nil && err == nil { | ||
err = err2 | ||
if err := r.Delete(ctx, s); err != nil { | ||
return ctrl.Result{}, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the same logic but I tried to separate error handling.
The logic is
- If the namespace is not for modelmesh anymore, it will delete modelmesh Service when it exists.
- If the namespace is terminated, it does not need to delete the modelmesh Service because it will be gone with the namespace
controllers/service_controller.go
Outdated
//requeue is never expected here | ||
if _, err, _ := r.reconcileService(ctx, mms, namespace, owner); err != nil { | ||
return ctrl.Result{}, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the namespace is not for modelmesh anymore, it should trigger reconcileService for MMService list that manages the goroutines.
controllers/service_controller.go
Outdated
|
||
n := &corev1.Namespace{} | ||
if err := r.Client.Get(ctx, types.NamespacedName{Name: namespace}, n); err != nil { | ||
return nil, err, false | ||
} | ||
|
||
if !modelMeshEnabled(n, r.ControllerDeployment.Namespace) { | ||
r.ModelEventStream.RemoveWatchedService(serviceName, namespace) | ||
r.Log.Info("Deleted Watched Service", "name", serviceName, "namespace", namespace) | ||
return nil, nil, false | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will remove the goroutine when modelmesh is not enabled for a namespace.
- when the namespace does not have the annotation
modelmesh-enabled
- when the namespace is under a
Terminating
state.
@ckadner added comments to the source. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to verify that these changes prevent a growth in the number of go routines when creating and deleting namespaces. To see the count of go routines, I used the pprof patch linked above.
The recent change to add more comments has caused the build to fail because the make fmt
check doesn't pass.
Also, I'm not sure that the reconciler will always see the namespace as terminating. The cleanup that currently happens if the namespace is terminating could be moved to happen when the namespace has been deleted. Then there wouldn't need to be an additional check for the terminating state.
@@ -119,24 +119,34 @@ func (r *ServiceReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct | |||
namespace = req.Name | |||
n := &corev1.Namespace{} | |||
if err := r.Client.Get(ctx, req.NamespacedName, n); err != nil { | |||
if k8serr.IsNotFound(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I think the ServingRuntime controller also has this problem (erroring when reconciling after a namespace is deleted). I was getting similar error logs from it. But that can be resolved in a separate PR.
I noticed that there is a difference in logic in this PR whether the namespace is terminating (cleanup resources created for the namespace) or if the namespace doesn't exist (skip reconciliation). I think there is a chance that a reconciliation might not see that a namespace is in the terminating state by the time it reconciles. If the namespace has been deleted, we'd still want to ensure disconnect/cleanup of the MMService and run ModelEventStream.RemoveWatchedService
. If the cleanup is moved to the namespace-deleted case then we would not even need the check to see if the namespace is terminating
Signed-off-by: jooho <[email protected]>
3ef9bba
to
4a014c8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Jooho
/lgtm
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ckadner, Jooho The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Motivation
The goroutines watching ETCD data are never released from memory until the process is down because the logic to stop the goroutines does not exist. It only deletes Service
Moreover, it tries to get data for a namespace even if the namespace does not exist. It causes a lot of garbage logs like the following
ISSUE: #385
Modifications
modelmesh-enabled
is removed from a namespace or the namespace is deleted.Result
When the lable
modelmesh-enabled
is added to a namespace, 7 Goroutines are created but if the label is removed or the namespace is removed, 7 Goroutines will be stopped. The Goroutines count can be checked with pprof.Unfortunately, Modelmesh does not have pprof pkg so it is not possible to check it directly.
In order to test this, pprof pkg needs to be added to main. This PR has the code example. If Modelmesh needs this part, I will contribute this with another PR.