-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make kopia repo cache policy configurable #7620
Comments
Since #7499 has been closed, I'm going to comment here. I am experiencing the same behaviour as described in #7499 (comment) and large Kopia restores are not working for me currently. As a result, I had to switch back to Restic for the moment. I'm using Velero
Using the command As mentioned in #7499 (comment), restore cache data will usually be deleted once it's older than 10 minutes. But in my case, it seems that too much data is being restored within these 10 minutes and as a result my nodes are running out of ephemeral storage before the 10 minutes are reached and the cache can be cleared. This behaviour is not occuring when restoring smaller PVs containing around 10 GB worth of data (also Kopia). I could increase my node's ephemeral storage, but I don't even know how much storage is required for my restore to succeed. So this is not a favourable method since a lot of trial and error would be involved and restore sizes can vary (depending on the PV) and/or increase over time. I would rather have more options when it comes to configuring the cache. It should be possible to either set a lower value than 10 minutes or to set a hard GB limit after which the cache is cleared. I wouldn't mind if this resulted in lower restore performance, since Kopia restores for larger PVs don't work for me at all right now. I would greatly appreciate it if you could look into this! |
As the current cache management policy, we only control the living time of the cache data, we don't check the size of the cache in a period of time. Then how much data is cached relies on the total size to be restored, the parallelism of the restore. The larger of the restore size, the more parallelism (by default decided by the CPU cores in the node), the more data will be cached.
This will be done in the fix of this issue. |
We also had a problem with disk pressure, which we resolved by adding additional disk capacity, but I think this is not something desirable as the ephemeral storage consumption is basically the same size as the PV being restored.
Was this sufficiently tested, as I believe there is also a bug with a cache cleanup, because in our case it didn't execute as expected. When we restored the PV with around 50 GB for the first time the cache got cleaned up after around 10 mins, but after the second restore the cleanup did not happen automatically. I left the node-agent Pod running for over a day, but the storage consumption on the worker node did not drop as expected, which is why I deleted the Pod manually today which resolved the issue. |
Combining #7301 we should probably design a consistent way to let user configure the advanced settings for kopia. But before that, we should consider adjust the default policy to mitigate the risk for restore failure. |
If it's possible, could we start design on that issue together with the #7301 ? We have hit this problem and would be very helpful if we had opportunity to configure kopia. |
@Lyndon-Li For me compression is not important, but the design should cover generic advanced settings for kopia as pointed in the #7620 (comment) comment. That is why I've mentioned it. |
@mpryc Design PR is submitted, feel free to add comments on it. |
Related to issue #7499.
The cache policy decides the root file system disk usage in the pod where data movement is running, and on the other hand, it also impact the restore performance significantly.
Therefore, besides the hard-code policy, we should expose some of them to users so that users could decide the cache disk usage.
The text was updated successfully, but these errors were encountered: