Make kopia repo cache policy configurable #7620

Lyndon-Li · 2024-04-03T05:07:53Z

Related to issue #7499.
The cache policy decides the root file system disk usage in the pod where data movement is running, and on the other hand, it also impact the restore performance significantly.
Therefore, besides the hard-code policy, we should expose some of them to users so that users could decide the cache disk usage.

u3813 · 2024-04-25T10:08:16Z

Since #7499 has been closed, I'm going to comment here. I am experiencing the same behaviour as described in #7499 (comment) and large Kopia restores are not working for me currently. As a result, I had to switch back to Restic for the moment.

I'm using Velero v1.13.0 (vmware-tanzu/velero 6.0.0) to restore a PV that has around 600 GB worth of data in a single namespace. My nodes are running out of ephemeral storage during the restore:

# node-agent Pod
  Normal   Scheduled  52s   default-scheduler  Successfully assigned velero/node-agent-kmrpk to affeccf5-d3c9-402b-a739-13021a4b0668
  Warning  Evicted    53s   kubelet            The node had condition: [DiskPressure].

Using the command kubectl get --raw "/api/v1/nodes/affeccf5-d3c9-402b-a739-13021a4b0668/proxy/stats/summary" I was able to tell that the node-agent's ephemeral-storage usedBytes kept growing until my node's total allocatable ephemeral storage was reached (60584402435 allocatable). At this point the assigned node-agent Pod was killed, a new one was spawned and the restore failed.

As mentioned in #7499 (comment), restore cache data will usually be deleted once it's older than 10 minutes. But in my case, it seems that too much data is being restored within these 10 minutes and as a result my nodes are running out of ephemeral storage before the 10 minutes are reached and the cache can be cleared. This behaviour is not occuring when restoring smaller PVs containing around 10 GB worth of data (also Kopia).

I could increase my node's ephemeral storage, but I don't even know how much storage is required for my restore to succeed. So this is not a favourable method since a lot of trial and error would be involved and restore sizes can vary (depending on the PV) and/or increase over time. I would rather have more options when it comes to configuring the cache. It should be possible to either set a lower value than 10 minutes or to set a hard GB limit after which the cache is cleared. I wouldn't mind if this resulted in lower restore performance, since Kopia restores for larger PVs don't work for me at all right now.

I would greatly appreciate it if you could look into this!

Lyndon-Li · 2024-05-07T02:47:06Z

@u3813

But in my case, it seems that too much data is being restored within these 10 minutes

As the current cache management policy, we only control the living time of the cache data, we don't check the size of the cache in a period of time. Then how much data is cached relies on the total size to be restored, the parallelism of the restore. The larger of the restore size, the more parallelism (by default decided by the CPU cores in the node), the more data will be cached.
Therefore, the final solution is to control the cache by checking the total size, if it exceeds a defined size, delete the oldest data.

I would rather have more options when it comes to configuring the cache. It should be possible to either set a lower value than 10 minutes or to set a hard GB limit after which the cache is cleared

This will be done in the fix of this issue.
And another issue #7725 gives another option for users to configure a dedicate cache volume then the restore performance will not be compromised.

janpipan · 2024-05-31T07:05:20Z

We also had a problem with disk pressure, which we resolved by adding additional disk capacity, but I think this is not something desirable as the ephemeral storage consumption is basically the same size as the PV being restored.

@Lyndon-Li

As the current cache management policy, we only control the living time of the cache data, we don't check the size of the cache in a period of time.

Was this sufficiently tested, as I believe there is also a bug with a cache cleanup, because in our case it didn't execute as expected.

When we restored the PV with around 50 GB for the first time the cache got cleaned up after around 10 mins, but after the second restore the cleanup did not happen automatically. I left the node-agent Pod running for over a day, but the storage consumption on the worker node did not drop as expected, which is why I deleted the Pod manually today which resolved the issue.

reasonerjt · 2024-06-14T08:17:19Z

Combining #7301 we should probably design a consistent way to let user configure the advanced settings for kopia.

But before that, we should consider adjust the default policy to mitigate the risk for restore failure.

mpryc · 2024-06-25T14:21:29Z

If it's possible, could we start design on that issue together with the #7301 ? We have hit this problem and would be very helpful if we had opportunity to configure kopia.

Lyndon-Li · 2024-06-26T02:21:35Z

@mpryc Could you comment on #7301 for why compression is important to your case?

mpryc · 2024-06-26T06:22:54Z

@Lyndon-Li For me compression is not important, but the design should cover generic advanced settings for kopia as pointed in the #7620 (comment) comment. That is why I've mentioned it.

Lyndon-Li · 2024-07-02T06:37:00Z

@mpryc Design PR is submitted, feel free to add comments on it.

Lyndon-Li self-assigned this Apr 3, 2024

Lyndon-Li added backlog area/datamover area/fs-backup Kopia labels Apr 3, 2024

Lyndon-Li mentioned this issue Apr 3, 2024

Restore fails due to insufficient ephemeral node storage #7499

Closed

danfengliu mentioned this issue May 7, 2024

Restore from S3 repo - failed with error: found a datadownload with status "InProgress" during the node-agent starting, mark it as cancel #7761

Closed

Lyndon-Li added the 1.15-candidate label May 27, 2024

reasonerjt added the Needs Design label Jun 14, 2024

reasonerjt added the 2024 Q2 reviewed label Jun 14, 2024

Lyndon-Li mentioned this issue Jul 2, 2024

Add design for backup repository configurations #7963

Merged

reasonerjt removed the 1.15-candidate label Jul 24, 2024

reasonerjt added this to the v1.15 milestone Jul 24, 2024

Lyndon-Li removed the backlog label Jul 26, 2024

Lyndon-Li mentioned this issue Aug 7, 2024

Backup repo config #8093

Merged

Lyndon-Li added the doc-change-required label Aug 9, 2024

Lyndon-Li closed this as completed Aug 13, 2024

Lyndon-Li mentioned this issue Aug 20, 2024

Add doc for backup repo config #8131

Merged

itayrin mentioned this issue Sep 3, 2024

Make kopia repo cache place configurable #7725

Open

blackpiglet mentioned this issue Nov 11, 2024

Add E2E cases for BackupRepository configuration #8390

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make kopia repo cache policy configurable #7620

Make kopia repo cache policy configurable #7620

Lyndon-Li commented Apr 3, 2024 •

edited

Loading

u3813 commented Apr 25, 2024

Lyndon-Li commented May 7, 2024

janpipan commented May 31, 2024

reasonerjt commented Jun 14, 2024

mpryc commented Jun 25, 2024

Lyndon-Li commented Jun 26, 2024

mpryc commented Jun 26, 2024

Lyndon-Li commented Jul 2, 2024

Make kopia repo cache policy configurable #7620

Make kopia repo cache policy configurable #7620

Comments

Lyndon-Li commented Apr 3, 2024 • edited Loading

u3813 commented Apr 25, 2024

Lyndon-Li commented May 7, 2024

janpipan commented May 31, 2024

reasonerjt commented Jun 14, 2024

mpryc commented Jun 25, 2024

Lyndon-Li commented Jun 26, 2024

mpryc commented Jun 26, 2024

Lyndon-Li commented Jul 2, 2024

Lyndon-Li commented Apr 3, 2024 •

edited

Loading