Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pip install does not support 303 (See Other) as valid cacheable response #8489

Closed
shadargee1982 opened this issue Jun 23, 2020 · 9 comments
Closed
Labels
C: cache Dealing with cache and files in it

Comments

@shadargee1982
Copy link

Environment

  • pip version: 20.1.1
  • Python version: Python 3.6.10 x64
  • OS: Ubuntu 18.04.4 LTS

Description
We have a private packages host that provides capabilities to install python packages. Along the way, the download URL that returns a 303 to the client with a signed URL that is then used as the url for download. Because pip does not save the original download URL, there is always a cache miss in pip resulting in longer build times.

Expected behavior
pip's caching happens at a per-http call level. In other words, pip's cache is Dictionary<Uri,Response>. If pip's cache were a Dictionary<PackageIdentity,byte[]>, then this problem would not occur.

How to Reproduce
requirements.txt has a single numpy v.1.18.1
python -m pip install --upgrade pip wheel && pip install -v -r requirements.txt

Output

Created temporary directory: /tmp/pip-unpack-7o8lnaf3 

 Looking up "https://<host>/<downloadURL>/pypi/download/numpy/1.18.1/numpy-1.18.1-cp36-cp36m-manylinux1_x86_64.whl" in the cache 

 No cache entry available 

 https://<host> "GET /<downloadURL>/pypi/download/numpy/1.18.1/numpy-1.18.1-cp36-cp36m-manylinux1_x86_64.whl HTTP/1.1" 303 0 

 Status code 303 not in (200, 203, 300, 301) 

 Looking up "https://<redirectedHost>/<signedURL>" in the cache 

 No cache entry available 

 Starting new HTTPS connection (1): <redirectedHost> 

 https://<redirectedHost> "GET /<signedURL> HTTP/1.1" 200 20143300 

 Downloading https://<host>/<downloadURL>/pypi/download/numpy/1.18.1/numpy-1.18.1-cp36-cp36m-manylinux1_x86_64.whl (20.1 MB) 

 Updating cache with response from "https://<redirectedHost>/<downloadURL>" 

 Caching due to etag 

 Added numpy==1.18.1 from https://<host>/<downloadURL>/pypi/download/numpy/1.18.1/numpy-1.18.1-cp36-cp36m-manylinux1_x86_64.whl#sha256=b765ed3930b92812aa698a455847141869ef755a87e099fddd4ccf9d81fffb57 (from -r requirements.txt (line 1)) to build tracker '/tmp/pip-req-tracker-p3qwvlzb' 

 Removed numpy==1.18.1 from https://<host>/<downloadURL>/pypi/download/numpy/1.18.1/numpy-1.18.1-cp36-cp36m-manylinux1_x86_64.whl#sha256=b765ed3930b92812aa698a455847141869ef755a87e099fddd4ccf9d81fffb57 (from -r requirements.txt (line 1)) from build tracker '/tmp/pip-req-tracker-p3qwvlzb' 

Installing collected packages: numpy```
@triage-new-issues triage-new-issues bot added the S: needs triage Issues/PRs that need to be triaged label Jun 23, 2020
@SailingYYC
Copy link

This is a significant blocker for out teams. What would be required to get this pushed through?

@uranusjr
Copy link
Member

AFAIK, 303 is generally treated as temporary redirect in practice, and the client is supposed to query the original URL again next time. So IMO pip is correct to not cache 303.

@SailingYYC
Copy link

From a purely technical perspective, but I believe in this instance their implementation uses the 303 to handle authentication. In this case the URL with authentication is 303 redirected to the URL to pull the actual package, thus a temporary redirect is a viable use of a 303 redirect, that ultimately returns the exact same package. Each user requesting the package would be doing so under a different set of credentials.

@uranusjr
Copy link
Member

I’m likely missing something here. So the authentication system returns 303 that points to the actual package, which pip would proceed to request for download. Wouldn’t pip be able to cache that request to the actual package instead, assuming it returns a cachable response?

@SailingYYC
Copy link

For context, this is relative to Azure DevOps' implementation of a python artifact (PyPI) style feed.

I'm hoping the individual who has debugged the code replies, but it is my understanding that the initial request is performed with authentication parameters, the 303 is issued to then point to a URL where the actual package resides.

Thus in the case where I have had pip pull down all the packages to a local cache dir from the Artifacts feed, when I attempt to perform a pull for the same package from the same feed it will always re-download, even though it is present on disk, because the pip code sees a 303 redirect as an immediate cache miss.

@uranusjr
Copy link
Member

because the pip code sees a 303 redirect as an immediate cache miss.

Ah that would make sense, thanks. Let’s hope @shadargee1982. In the mean time I’ll put this on my backlog and look into it when I get the time.

@dstufft
Copy link
Member

dstufft commented Jul 30, 2020

pip's caching works at the HTTP layer, so it should not cache the 303 response, but it should still be able to cache the URL that the 303 redirects to, assuming that URL is cacheable at all at the HTTP layer. It would be useful if someone could do a HEAD request against the actual file URL and see if there are any cache control headers.

@v-kyljon
Copy link

@shadargee1982 any thoughts on the conversations above?

@pradyunsg pradyunsg added C: cache Dealing with cache and files in it S: awaiting response Waiting for a response/more information labels Jul 30, 2020
@triage-new-issues triage-new-issues bot removed the S: needs triage Issues/PRs that need to be triaged label Jul 30, 2020
@no-response
Copy link

no-response bot commented Aug 14, 2020

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.

@no-response no-response bot closed this as completed Aug 14, 2020
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 12, 2021
@pradyunsg pradyunsg removed the S: awaiting response Waiting for a response/more information label Mar 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
C: cache Dealing with cache and files in it
Projects
None yet
Development

No branches or pull requests

6 participants