Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do something about cache housekeeping? #6956

Open
pfmoore opened this issue Aug 31, 2019 · 10 comments
Open

Do something about cache housekeeping? #6956

pfmoore opened this issue Aug 31, 2019 · 10 comments
Labels
C: cache Dealing with cache and files in it state: needs discussion This needs some more discussion type: feature request Request for a new feature

Comments

@pfmoore
Copy link
Member

pfmoore commented Aug 31, 2019

What's the problem this feature will solve?
Pip stores data in its cache, but never clears out obsolete data.

Describe the solution you'd like
Some means (automated or manual) for no-longer-needed cache entries to be cleared out.

Alternative Solutions
It's possible to just delete the cache altogether, as there is nothing in there that won't be recreated as needed, but this is an "all or nothing" solution.

Additional context
Prompted by the discussion here

@chrahunt
Copy link
Member

chrahunt commented Sep 2, 2019

Related to #3138.

@pradyunsg
Copy link
Member

Also related to #4685, since that could be how you perform this.

@pfmoore pfmoore changed the title Do something about cache hosekeeping? Do something about cache housekeeping? Sep 2, 2019
@chrahunt chrahunt added C: cache Dealing with cache and files in it state: needs discussion This needs some more discussion type: feature request Request for a new feature labels Sep 2, 2019
@triage-new-issues triage-new-issues bot removed the S: needs triage Issues/PRs that need to be triaged label Sep 2, 2019
@gutsytechster
Copy link
Contributor

Is this still relevant as we finally have a pip cache command with some useful options like purge, remove etc?

@duckinator
Copy link
Contributor

I think, given pip cache as it is and #7372 (having it handle more than wheels), I think the main thing left for this issue is the possibility of having automatic cache cleanup. Is that correct, @pfmoore, or do you have other ideas?

@pfmoore
Copy link
Member Author

pfmoore commented Oct 18, 2020

Agreed, and as discussed in #8474, I'm ambivalent at best over the idea of an automatic cleanup. So I'm going to close this as complete, and leave automatic cleanup as something for someone else to raise if they feel like it.

Thanks for your work in making this happen @duckinator!

@pfmoore
Copy link
Member Author

pfmoore commented Oct 18, 2020

Reopening, as I realised via this discussion that this issue was originally triggered by the question of tidying up outdated selfcheck files.

IMO we still need an automated solution for clearing up obsolete selfcheck files. I'd expect that to be something along the lines of whenever we do a selfcheck, we check all the other selfcheck files and delete any that refer to directories that no longer exist. I don't think it should be down to the user to run a purge, nor do I think we should leave files for non-existent environments indefinitely.

@hugovk

This comment has been minimized.

@pfmoore pfmoore reopened this Oct 18, 2020
@pfmoore
Copy link
Member Author

pfmoore commented Oct 18, 2020

Bah. Long day. Thanks for letting me know!

@itamarst
Copy link
Contributor

itamarst commented May 2, 2022

#2984 is going to make this worse, by potentially have two copies of the same file in the cache.

Here is a proposal for a solution:

No more than once a day (to prevent performance impacts), after installing a package (so cache information is accurate), clear old entries from the cache.

Two potential strategies for deciding which entries to delete and how many:

  1. Date-based: Find all cached entries with access time (atime on Unix) more than 90 days in the past, delete them.
  2. Size-based: Try to keep the cache below e.g. 5GB. Delete entries until that size is reached, from oldest access time to newest.

Given data science packages can be huge, and that disk space is cheap, I would suggest that date-based caching is probably better.

@pooh22
Copy link

pooh22 commented Sep 11, 2023

As sysadmin at a university, we get mails on a regular basis from students who can't figure out why they get over quota mails. Most of the time it's their pip cache filling up the initial 5G of storage they get. It would be very nice if pip would respect a setting (global) that maximises their cache size. It should be checked every time the pip command runs and enforced either automatically or via a clear suggestion to the user. The all or nothing option of purge seems like a poor way of cache management.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: cache Dealing with cache and files in it state: needs discussion This needs some more discussion type: feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

8 participants