Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make kopia repo cache policy configurable #7620

Closed
Lyndon-Li opened this issue Apr 3, 2024 · 8 comments
Closed

Make kopia repo cache policy configurable #7620

Lyndon-Li opened this issue Apr 3, 2024 · 8 comments

Comments

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Apr 3, 2024

Related to issue #7499.
The cache policy decides the root file system disk usage in the pod where data movement is running, and on the other hand, it also impact the restore performance significantly.
Therefore, besides the hard-code policy, we should expose some of them to users so that users could decide the cache disk usage.

@u3813
Copy link

u3813 commented Apr 25, 2024

Since #7499 has been closed, I'm going to comment here. I am experiencing the same behaviour as described in #7499 (comment) and large Kopia restores are not working for me currently. As a result, I had to switch back to Restic for the moment.

I'm using Velero v1.13.0 (vmware-tanzu/velero 6.0.0) to restore a PV that has around 600 GB worth of data in a single namespace. My nodes are running out of ephemeral storage during the restore:

# node-agent Pod
  Normal   Scheduled  52s   default-scheduler  Successfully assigned velero/node-agent-kmrpk to affeccf5-d3c9-402b-a739-13021a4b0668
  Warning  Evicted    53s   kubelet            The node had condition: [DiskPressure].

Using the command kubectl get --raw "/api/v1/nodes/affeccf5-d3c9-402b-a739-13021a4b0668/proxy/stats/summary" I was able to tell that the node-agent's ephemeral-storage usedBytes kept growing until my node's total allocatable ephemeral storage was reached (60584402435 allocatable). At this point the assigned node-agent Pod was killed, a new one was spawned and the restore failed.

As mentioned in #7499 (comment), restore cache data will usually be deleted once it's older than 10 minutes. But in my case, it seems that too much data is being restored within these 10 minutes and as a result my nodes are running out of ephemeral storage before the 10 minutes are reached and the cache can be cleared. This behaviour is not occuring when restoring smaller PVs containing around 10 GB worth of data (also Kopia).

I could increase my node's ephemeral storage, but I don't even know how much storage is required for my restore to succeed. So this is not a favourable method since a lot of trial and error would be involved and restore sizes can vary (depending on the PV) and/or increase over time. I would rather have more options when it comes to configuring the cache. It should be possible to either set a lower value than 10 minutes or to set a hard GB limit after which the cache is cleared. I wouldn't mind if this resulted in lower restore performance, since Kopia restores for larger PVs don't work for me at all right now.

I would greatly appreciate it if you could look into this!

@Lyndon-Li
Copy link
Contributor Author

@u3813

But in my case, it seems that too much data is being restored within these 10 minutes

As the current cache management policy, we only control the living time of the cache data, we don't check the size of the cache in a period of time. Then how much data is cached relies on the total size to be restored, the parallelism of the restore. The larger of the restore size, the more parallelism (by default decided by the CPU cores in the node), the more data will be cached.
Therefore, the final solution is to control the cache by checking the total size, if it exceeds a defined size, delete the oldest data.

I would rather have more options when it comes to configuring the cache. It should be possible to either set a lower value than 10 minutes or to set a hard GB limit after which the cache is cleared

This will be done in the fix of this issue.
And another issue #7725 gives another option for users to configure a dedicate cache volume then the restore performance will not be compromised.

@janpipan
Copy link

We also had a problem with disk pressure, which we resolved by adding additional disk capacity, but I think this is not something desirable as the ephemeral storage consumption is basically the same size as the PV being restored.

@Lyndon-Li

As the current cache management policy, we only control the living time of the cache data, we don't check the size of the cache in a period of time.

Was this sufficiently tested, as I believe there is also a bug with a cache cleanup, because in our case it didn't execute as expected.

When we restored the PV with around 50 GB for the first time the cache got cleaned up after around 10 mins, but after the second restore the cleanup did not happen automatically. I left the node-agent Pod running for over a day, but the storage consumption on the worker node did not drop as expected, which is why I deleted the Pod manually today which resolved the issue.

@reasonerjt
Copy link
Contributor

Combining #7301 we should probably design a consistent way to let user configure the advanced settings for kopia.

But before that, we should consider adjust the default policy to mitigate the risk for restore failure.

@mpryc
Copy link
Contributor

mpryc commented Jun 25, 2024

If it's possible, could we start design on that issue together with the #7301 ? We have hit this problem and would be very helpful if we had opportunity to configure kopia.

@Lyndon-Li
Copy link
Contributor Author

@mpryc Could you comment on #7301 for why compression is important to your case?

@mpryc
Copy link
Contributor

mpryc commented Jun 26, 2024

@Lyndon-Li For me compression is not important, but the design should cover generic advanced settings for kopia as pointed in the #7620 (comment) comment. That is why I've mentioned it.

@Lyndon-Li
Copy link
Contributor Author

@mpryc Design PR is submitted, feel free to add comments on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants