Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.14.0+: Memory leaks on backups jobs (CSI backups + cloud plugin) #7925

Closed
jkroepke opened this issue Jun 25, 2024 · 58 comments · Fixed by #8012
Closed

v1.14.0+: Memory leaks on backups jobs (CSI backups + cloud plugin) #7925

jkroepke opened this issue Jun 25, 2024 · 58 comments · Fixed by #8012
Assignees
Milestone

Comments

@jkroepke
Copy link

jkroepke commented Jun 25, 2024

What steps did you take and what happened:

Setup scheduled Backups via CSI driver. No restic, kopia or other file based copy action are in use.

Memory leak.

image

We have an additional to the failed daily schedule a hourly schedule. In grafana, I can see a memory increase after each backup.

What did you expect to happen:

No memory leak

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

Does work from a pod with in-cluster authentication:

velero debug  --backup opsstack-velero-daily-opsstack-20240625003049 --namespace opsstack
2024/06/25 15:38:09 Collecting velero resources in namespace: opsstack
An error occurred: exec failed: Traceback (most recent call last):
  velero-debug-collector:23:13: in <toplevel>
  <builtin>: in kube_get
Error: could not initialize search client: stat /tmp/.kube/config: no such file or directory

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

  • kubectl logs deployment/velero -n velero
  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
  • velero backup logs <backupname>
  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
  • velero restore logs <restorename>

velero.zip

Anything else you would like to add:

Environment:

  • Velero version (use velero version): v1.14.0
  • Velero features (use velero client config get features): <NOT SET>
  • Kubernetes version (use kubectl version): v1.29.4
  • Kubernetes installer & version: AKS
  • Cloud provider or hardware configuration: AKS
  • OS (e.g. from /etc/os-release): Azure Linux

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@blackpiglet blackpiglet added Area/CSI Related to Container Storage Interface support Performance labels Jun 26, 2024
@blackpiglet
Copy link
Contributor

Could you give more information about the two metrics velero REQESTS and velero LIMITS?
Are those metrics for the Velero server pods?

They seems have default values.

	// default values for Velero server pod resource request/limit
	DefaultVeleroPodCPURequest = "500m"
	DefaultVeleroPodMemRequest = "128Mi"
	DefaultVeleroPodCPULimit   = "1000m"
	DefaultVeleroPodMemLimit   = "512Mi"

@blackpiglet blackpiglet self-assigned this Jun 26, 2024
@jkroepke
Copy link
Author

Yes, thats the Velero Pod:

Mem Request: 512.00 MiB
Mem Limit: 1.00 GiB

CPU Req: 0.1000
CPU Limit: 1.0000

Memory Usage:

image

After each Peak, velero was OOMed

History Memory Usage (we do updates each week):

image

Last 3 hours (mention we have one hourly schedule)

image

@blackpiglet
Copy link
Contributor

blackpiglet commented Jun 26, 2024

Could you share more details on the backup workload?
How many volumes are involved?

@jkroepke
Copy link
Author

jkroepke commented Jun 26, 2024

10 Volumes via CSI Driver.

A working backups typically saves 1513 Kubernetes resources.

We experience this issue in approx. 10 Kubernetes clusters. All 10 clusters configured identically (10 volumes, ~1500 resources)

@jkroepke
Copy link
Author

I guess, I found the root cause.

For each backup, the azure plugin is spawned, but not terminated.

root     3353897  0.0  0.0 1236380 11452 ?       Sl   Jun19   3:56 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 98762dde1ae7a3f1a11e0894ba14c8b4c1f596017280b388b97a104816907683 -address /run/containerd/containerd.sock
65527    3353931  0.0  0.0    972     4 ?        Ss   Jun19   0:00  \_ /pause
65527    4167441  0.3  0.6 5561796 107432 ?      Ssl  Jun25   6:59  \_ /velero server --uploader-type=kopia --default-backup-storage-location=opsstack --default-volume-snapshot-locations=velero.io/azure:opsstack --log-format=json --features=EnableCSI --namespace=opsstack
65527       9334  0.0  0.2 1395916 38048 ?       Sl   Jun25   0:02      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527      78788  0.0  0.2 1395916 37500 ?       Sl   Jun25   0:02      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     148224  0.0  0.2 1395916 36780 ?       Sl   Jun25   0:02      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     217993  0.0  0.2 1395916 38040 ?       Sl   Jun25   0:02      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     287368  0.0  0.2 1395916 36944 ?       Sl   Jun25   0:02      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     356756  0.0  0.2 1395916 36800 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     426057  0.0  0.2 1395916 37744 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     495363  0.0  0.2 1395916 37848 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     564793  0.0  0.2 1395916 36780 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     633070  0.0  0.2 1395916 37848 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     702396  0.0  0.2 1395916 38216 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     771675  0.0  0.2 1395916 37144 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     840974  0.0  0.2 1395916 36844 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     910286  0.0  0.2 1395916 36400 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     979589  0.0  0.2 1395916 37984 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1048695  0.0  0.2 1395916 37004 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1118222  0.0  0.2 1395916 36428 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1187685  0.0  0.2 1395916 37888 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1257170  0.0  0.2 1395916 37488 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1326865  0.0  0.2 1395916 36652 ?       Sl   Jun25   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1397481  0.0  0.2 1395916 38296 ?       Sl   Jun25   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1465597  0.0  0.2 1395916 36980 ?       Sl   Jun25   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1535111  0.0  0.2 1395916 37020 ?       Sl   Jun25   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1604535  0.0  0.2 1395916 37452 ?       Sl   00:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1638944  0.0  0.2 1395916 37176 ?       Sl   00:31   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1674221  0.0  0.2 1395916 40172 ?       Sl   01:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1743626  0.0  0.2 1395916 35636 ?       Sl   02:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1812856  0.0  0.2 1395916 37392 ?       Sl   03:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1882257  0.0  0.2 1396172 37320 ?       Sl   04:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1951690  0.0  0.2 1395916 36572 ?       Sl   05:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    2021143  0.0  0.2 1396172 38152 ?       Sl   06:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    2090248  0.0  0.2 1395916 37876 ?       Sl   07:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    2159473  0.0  0.2 1395916 37324 ?       Sl   08:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI

@blackpiglet
Copy link
Contributor

I see.
The Azure plugin allocates 100M on startup, so if enough Azure plugin instances pile up, it could cause the OOM.

@jkroepke
Copy link
Author

In our case, we will remove the Azure Plugin in for now. It was a left over from the migration to CSI based backups and all old azure plugin based Backups are out of the rotation.

@blackpiglet
Copy link
Contributor

velero-plugin-for-microsoft-azure is needed for communicating with Azure Object Storage to store the backup and restore metadata.
So maybe it cannot be removed for the AKS environment.

@jkroepke
Copy link
Author

Azure Object Storage to store the backup and restore metadata.

I though, that part is bundled into velero?

https://github.com/vmware-tanzu/velero/blob/main/pkg/util/azure/storage.go

@blackpiglet
Copy link
Contributor

That part of code is used for filesystem backup to interact with cloud providers.

@jkroepke
Copy link
Author

okay ,thanks for confirmation. I guess then we may stuck

@sseago
Copy link
Collaborator

sseago commented Jun 26, 2024

@blackpiglet Hmm. In my experience, the plugin process typically gets terminated after each backup -- or more specifically, after the controller reconcile exits, since the GRPC connection is terminated. Maybe we should investigate whether 1) something has changed in 1.14 or 2) some issue with one or more plugins is preventing their exit?

@jkroepke
Copy link
Author

We have only Azure Plugin active. We had CSI Plugin in history, but its now merged in 1.14

@blackpiglet
Copy link
Contributor

@sseago
Agree. I will investigate whether something has changed in the plugin workflow, but we had met a similar situation in the E2E test.
https://github.com/vmware-tanzu/velero-plugin-for-microsoft-azure/pull/210/files
This PR was created to mitigate the OOM issue met in the AKS environment, but it's still possible to hit this issue when object storage consumes a significant amount of memory.

So it seems the plugins do not always exist right after it completes, and that behavior is not introduced recently.

@jkroepke
Copy link
Author

Our current workaround is now a cronjob that restart velero all 2 hours.

@jkroepke
Copy link
Author

So it seems the plugins do not always exist right after it completes, and that behavior is not introduced recently

We check a cluster running with 1.13 velero and we can observe that no plugin instances are left.

@sseago
Copy link
Collaborator

sseago commented Jun 27, 2024

Our current workaround is now a cronjob that restart velero all 2 hours.

@jkroepke The risk here is that if you restart velero while a backup or restore is running, that backup/restore will be marked failed since an in-progress action can't be resumed after restart/crash.

@jkroepke
Copy link
Author

Yep, but without that cronjob, it's guaranteed that Velero get OOMed on backup.

Does velero not support graceful shutdown? If a backup/restore is running, I would expect that SIGTERM is delayed until the backup/restore process is finished.

@blackpiglet
Copy link
Contributor

@jkroepke
Could you please check the version of the Velero Azure plugin?

@jkroepke
Copy link
Author

jkroepke commented Jul 3, 2024

@blackpiglet velero/velero-plugin-for-microsoft-azure:v1.10.0 - on initial report, we used 1.9 but then we also upgrade to 1.10 without any difference

@anshulahuja98
Copy link
Collaborator

We have been seeing frequent OOMKilled in restores, where we have Limit set as 1Gi.
Tracking this if we figure out some pattern.
We use 1.13 in downstream

@jkroepke
Copy link
Author

jkroepke commented Jul 5, 2024

We have been seeing frequent OOMKilled in restores, where we have Limit set as 1Gi. Tracking this if we figure out some pattern. We use 1.13 in downstream

Please maybe create an dedicated issue for this.

@jkroepke jkroepke changed the title v1.14.0: Memory leaks v1.14.0: Memory leaks on backups jobs (CSI backups + azure plugin) Jul 5, 2024
@jkroepke jkroepke changed the title v1.14.0: Memory leaks on backups jobs (CSI backups + azure plugin) v1.14.0+: Memory leaks on backups jobs (CSI backups + azure plugin) Jul 5, 2024
@anshulahuja98
Copy link
Collaborator

Sure, will avoid any confusion here.

@duduvaa
Copy link

duduvaa commented Jul 9, 2024

The issue isn't related only to Azure plugin.
I ran CSI backups, using Velero scheduling. All backups were completed OK.

Total Backup: 21
Total plugins instances:21

$ ps -ef | grep velero
root      137860       1  0 09:02 ?        00:00:03 /usr/bin/conmon -b /run/containers/storage/overlay-containers/658303d8e801341e45275428224e536422b7f51c0961d22708f3c9af5873d14b/userdata -c 658303d8e801341e45275428224e536422b7f51c0961d22708f3c9af5873d14b --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-adp_velero-77b95849d5-44cjw_2cf677d9-b902-47ae-9c06-42971ac78ab3/velero/0.log --log-level info -n k8s_velero_velero-77b95849d5-44cjw_openshift-adp_2cf677d9-b902-47ae-9c06-42971ac78ab3_0 -P /run/containers/storage/overlay-containers/658303d8e801341e45275428224e536422b7f51c0961d22708f3c9af5873d14b/userdata/conmon-pidfile -p /run/containers/storage/overlay-containers/658303d8e801341e45275428224e536422b7f51c0961d22708f3c9af5873d14b/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/658303d8e801341e45275428224e536422b7f51c0961d22708f3c9af5873d14b/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio --syslog -u 658303d8e801341e45275428224e536422b7f51c0961d22708f3c9af5873d14b -s
1000830+  137872  137860  1 09:02 ?        00:01:13 /velero server -features=EnableCSI --uploader-type=kopia --fs-backup-timeout=2880m --restore-resource-priorities=securitycontextconstraints,customresourcedefinitions,klusterletconfigs.config.open-cluster-management.io,managedcluster.cluster.open-cluster-management.io,namespaces,roles,rolebindings,clusterrolebindings,klusterletaddonconfig.agent.open-cluster-management.io,managedclusteraddon.addon.open-cluster-management.io,storageclasses,volumesnapshotclass.snapshot.storage.k8s.io,volumesnapshotcontents.snapshot.storage.k8s.io,volumesnapshots.snapshot.storage.k8s.io,datauploads.velero.io,persistentvolumes,persistentvolumeclaims,serviceaccounts,secrets,configmaps,limitranges,pods,replicasets.apps,clusterclasses.cluster.x-k8s.io,endpoints,services,,clusterbootstraps.run.tanzu.vmware.com,clusters.cluster.x-k8s.io,clusterresourcesets.addons.cluster.x-k8s.io --log-level debug --disable-informer-cache=false
1000830+  180946  137872  0 09:21 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  192363  137872  0 09:26 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  203868  137872  0 09:31 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  215300  137872  0 09:36 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  226758  137872  0 09:41 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  238226  137872  0 09:46 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  249688  137872  0 09:51 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  261103  137872  0 09:56 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  272539  137872  0 10:01 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  283993  137872  0 10:06 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  295429  137872  0 10:11 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  306872  137872  0 10:16 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  318247  137872  0 10:21 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  329665  137872  0 10:26 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  341128  137872  0 10:31 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  352609  137872  0 10:36 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  364052  137872  0 10:41 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  375590  137872  0 10:46 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  387024  137872  0 10:51 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  398526  137872  0 10:56 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  409917  137872  0 11:01 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
core      418429  139617  0 11:05 pts/0    00:00:00 grep --color=auto velero
root      751578  751549  0 Jun24 ?        01:21:51 /velero node-agent server

@jkroepke
Copy link
Author

Did you see, that I attached a zip file with all debug logs? According to the issue template, I do all the things manually, what velero debug does.

@duduvaa
Copy link

duduvaa commented Jul 11, 2024

@jkroepke,

Can you pls share the commands?

thanks

@jkroepke
Copy link
Author

_If you are using earlier versions:_
Please provide the output of the following commands (Pasting long output into a [GitHub gist](https://gist.github.com) or other pastebin is fine.)
- `kubectl logs deployment/velero -n velero`
- `velero backup describe <backupname>` or `kubectl get backup/<backupname> -n velero -o yaml`
- `velero backup logs <backupname>`
- `velero restore describe <restorename>` or `kubectl get restore/<restorename> -n velero -o yaml`
- `velero restore logs <restorename>`

@sseago
Copy link
Collaborator

sseago commented Jul 11, 2024

@jkroepke @duduvaa Since both of you have seen this -- are there specific circumstances that reproduce it? Only with datamover? Or CSI-no-datamover? Or fs-backup? With all workloads or only some?

@duduvaa
Copy link

duduvaa commented Jul 11, 2024

it reproduced with:
CSI
CSI+DataMover

Kopia Backup - working fine

@jkroepke
Copy link
Author

We have 2 scheudles with only CSI backups enabled. + Azure Blob Storage for manifests.

@sseago
Copy link
Collaborator

sseago commented Jul 11, 2024

@jkroepke @duduvaa That's good information. If kopia backup doesn't do this, this suggests that we're closing the object store plugin processes properly in the regular backup controller. The backup operation controller only reconciles when we have async operations -- which only happens for CSI and Datamover backups, so it may be in that code where this is occurring. It gives us a more focused area to dig into this.

@jkroepke
Copy link
Author

In our case, we do Azure File snapshots vis csi. That might be special.

@sseago
Copy link
Collaborator

sseago commented Jul 11, 2024

@jkroepke I'm seeing it in my dev cluster with aws object store plugin and CSI snapshots (no datamover). If I use fs-backup, I don't see it.

@sseago
Copy link
Collaborator

sseago commented Jul 11, 2024

Some debug progress, but still more to do. We know this happens only for CSI or CSI+datamover backups. I've isolated this to the backup finalizer controller. From the added logging, we see in the other controllers we initialize the needed plugins, then at end of reconcile, the relevant plugins are killed. In the finalizer controller case, though it's a bit different. In my cluster, 3 plugins were initialized (internal, object store, and the OADP openshift BIA/RIA plugin) -- then all 3 of these were killed. However, immediately after killing the aws plugin (again this is due to the defer pluginManager.CleanupClients() call at end of finalizer reconcile), a new plugin process starts up. This is triggered by a restartableProcess.ResetIfNeeded() call -- I still need to dig into why this is happening at the end of the reconcile. Maybe some embedded defer call somewhere I'm not seeing? I'm not sure yet. I'll dig into it again tomorrow.

@jkroepke
Copy link
Author

If it helps, you can provide an extensive log build that I will run on my cluster.

@xlgao-zju
Copy link

image

We met the same issue here.

@yuanqijing
Copy link

yuanqijing commented Jul 12, 2024

@jkroepke @sseago hi, i have the same issue, i suspect the codes here introduce the processes leak.

velero/pkg/backup/backup.go

Lines 729 to 743 in a91d2cb

backupStore, volumeInfos, err := kb.getVolumeInfos(*backupRequest.Backup, log)
if err != nil {
log.WithError(err).Errorf("fail to get the backup VolumeInfos for backup %s", backupRequest.Name)
return err
}
if err := updateVolumeInfos(volumeInfos, unstructuredDataUploads, asyncBIAOperations, log); err != nil {
log.WithError(err).Errorf("fail to update VolumeInfos for backup %s", backupRequest.Name)
return err
}
if err := putVolumeInfos(backupRequest.Name, volumeInfos, backupStore); err != nil {
log.WithError(err).Errorf("fail to put the VolumeInfos for backup %s", backupRequest.Name)
return err
}

On my machine, I deleted this section of the code, and the issue of leaks did not occur. but, the exact reason why this code was causing a process leak is unclear.

@yuanqijing
Copy link

I think i found the reason. The issue appears to be that each FinalizeBackup call invokes getVolumeInfos to obtain a storage client and refresh volume information. The getVolumeInfos function creates a new pluginManager and uses defer CleanupClients to clean up the process. However, after obtaining the client, the caller proceeds to upload the backup data, during which a new process is created but not cleaned up. Consequently, this newly created process remains persistent.

@anshulahuja98
Copy link
Collaborator

I think i found the reason. The issue appears to be that each FinalizeBackup call invokes getVolumeInfos to obtain a storage client and refresh volume information. The getVolumeInfos function creates a new pluginManager and uses defer CleanupClients to clean up the process. However, after obtaining the client, the caller proceeds to upload the backup data, during which a new process is created but not cleaned up. Consequently, this newly created process remains persistent.

Which all versions are impacted due to this?

@yuanqijing
Copy link

I think i found the reason. The issue appears to be that each FinalizeBackup call invokes getVolumeInfos to obtain a storage client and refresh volume information. The getVolumeInfos function creates a new pluginManager and uses defer CleanupClients to clean up the process. However, after obtaining the client, the caller proceeds to upload the backup data, during which a new process is created but not cleaned up. Consequently, this newly created process remains persistent.

Which all versions are impacted due to this?

The problem is introduced in #7554 , i think all verstions after that commit are impacted.

@sseago
Copy link
Collaborator

sseago commented Jul 12, 2024

@anshulahuja98 @yuanqijing Yes, I see the problem. The pattern of getting a plugin manager and deferring cleanup clients is used elsewhere to guarantee that the plugin process exits. However, in this case we're killing the plugin process while at the same time returning the backup store -- so the next time that someone uses that backup store, it re-initializes the plugin, but at this point it's no longer connected to a plugin manager, so it's never cleaned up.

	pluginManager := kb.pluginManager(log)
	defer pluginManager.CleanupClients()

	backupStore, storeErr := kb.backupStoreGetter.Get(location, pluginManager, log)
	if storeErr != nil {
		return nil, nil, storeErr
	}

	volumeInfos, err := backupStore.GetBackupVolumeInfos(backup.Name)
	if err != nil {
		return nil, nil, err
	}

	return backupStore, volumeInfos, nil
}

We need to get the plugin manager at a higher level in the call stack so we clean up after we're done with the backup store.

@sseago
Copy link
Collaborator

sseago commented Jul 12, 2024

I've submitted #8012 which passes in the already-initialized plugin manager from the finalizer controller, so we don't need to spawn a new plugin process for get/put volume info. I've tested this locally and it resolves the leak.

@sseago
Copy link
Collaborator

sseago commented Jul 12, 2024

Once #8012 merges we should CP to release-1.14 for 1.14.1.

@Lyndon-Li Lyndon-Li added this to the v1.15 milestone Jul 15, 2024
@Lyndon-Li
Copy link
Contributor

Reopen for cherry-pick.

@Lyndon-Li
Copy link
Contributor

Close as completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment