Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache responses on the "other side" of redirects with status code 303 #10694

Open
1 task done
zooba opened this issue Nov 30, 2021 · 30 comments
Open
1 task done

Cache responses on the "other side" of redirects with status code 303 #10694

zooba opened this issue Nov 30, 2021 · 30 comments
Labels
C: cache Dealing with cache and files in it help wanted For requesting inputs from other members of the community type: enhancement Improvements to functionality

Comments

@zooba
Copy link
Contributor

zooba commented Nov 30, 2021

Description

See #8489 (cannot comment there because the bot locked it)

Expected behavior

See #8489

pip version

Latest

Python version

any

OS

any

How to Reproduce

See #8489

Output

No response

Code of Conduct

@zooba zooba added S: needs triage Issues/PRs that need to be triaged type: bug A confirmed bug or unintended behavior labels Nov 30, 2021
@zooba
Copy link
Contributor Author

zooba commented Nov 30, 2021

Okay, now this is effectively "reopen" (feel free to transfer this comment to #8489 and just reopen it if you like):

I have a private conversation going on with the Azure DevOps team about this issue, and they're using the 303 response to hide the fact that the target URL is not cacheable (because of query parameters). However, it is intentional that subsequent requests should come to the original URL and treat it as cached 1.

The intent on their side is that the content should be cached, but not the redirect. That is, it'll always be the same file at the end, it'll just come from another source/mirror. There don't appear to be any 30x codes that specifically represent this, but maybe someone is aware of some other precedent?

(FWIW, the Azure DevOps implementation here probably uses the same headers for their other package manager implementations, and may actually be the best precedent there is. But if we can at least agree on a code/header to indicate that pip can cache the resulting data against that URL, even though the final URL may change over time, we can make changes upstream to work with that.)

Footnotes

  1. Apart from it having a bad Expires header right now... so it can cache once that is fixed ;)

@zooba
Copy link
Contributor Author

zooba commented Nov 30, 2021

Also somewhat related is #10075, which it turns out only mattered because the 303 was being followed and the final URL used for caching instead of the original one.

If I hack enable caching on 303's (and bypass the currently not-okay headers) then that issue is redundant.

@Ivoz
Copy link
Contributor

Ivoz commented Jan 4, 2022

I think dstuff's comment that it would be useful to see the headers for both/all responses from the server is still pertinent

@pradyunsg
Copy link
Member

This needs the following information: #8489 (comment)

@pradyunsg pradyunsg added S: awaiting response Waiting for a response/more information C: cache Dealing with cache and files in it and removed S: needs triage Issues/PRs that need to be triaged type: bug A confirmed bug or unintended behavior labels Jan 4, 2022
@zooba
Copy link
Contributor Author

zooba commented Jan 8, 2022

I can get the headers, but they don't really matter. The response being checked is the redirected URL right now, which expires within minutes and is not going to be returned by subsequent 303s.

Basically, pip ought to cache packages against the URL returned by the index, not by where that URL may eventually direct to (e.g. a local CDN endpoint). That way an index can consistently return the same URL when it knows the package is the same, rather than having that control taken away.

If there's a particular header that could be used to signal that this is a canonical URL, we can add it. However, the existing Cache-Control options seem to sufficiently cover the case where an index wants to return a stable URL that should not be cached, because they want to return different files at different times from the same URL. There don't seem to be any good options for indicating that the redirecting URL is canonical for caching content (as opposed to caching the redirect location), so we're a little bit stuck.

@no-response no-response bot removed the S: awaiting response Waiting for a response/more information label Jan 8, 2022
@uranusjr
Copy link
Member

uranusjr commented Jan 8, 2022

IIRC the cache behaviour is inherited from cachecontrol so it may make sense to loop in its maintainers. (First someone needs to check and make sure it is indeed the case, of course.)

@pradyunsg
Copy link
Member

FWIW, it would also be useful for whomever maintains these bits of software, to have a public reproducer for this.

@zooba
Copy link
Contributor Author

zooba commented Jan 10, 2022

It definitely comes from cachecontrol, but the option to cache redirects is part of their public API. And I'm proposing that pip use its additional knowledge about the semantics of the requests it is making to change its behaviour, not proposing that the behaviour should change generally for all 303 redirects (though I'm not opposed to a global status code that works this way).

it would also be useful for whomever maintains these bits of software, to have a public reproducer for this.

You mean like a test index that 303 redirects file links? This one will do it (it only has a pip package on there right now): https://pkgs.dev.azure.com/Python/cpython/_packaging/TestFeed%40Release/pypi/simple/

Append pip to the URL and you get this response (newlines added):

<html>
<head><title>Links for pip</title></head>
<body>
<h1>Links for pip</h1>
<a href="https://pkgs.dev.azure.com/Python/8e426817-76c0-4b99-ba9e-a48a1e4bd5db/_packaging/ad28b313-ee18-4ca2-912f-58714a0d2a78@c804ff44-3b35-4e5d-b661-69d5809c7788/pypi/download/pip/21.3.1/pip-21.3.1-py3-none-any.whl#sha256=deaf32dcd9ab821e359cd8330786bcd077604b5c5730c0b096eda46f95c24a2d" data-requires-python="&gt;=3.6">pip-21.3.1-py3-none-any.whl</a>
<br/>
</body>
</html>

The file URL here is perfectly static, but it returns 303 to point to a temporary, authenticated download link. The 303 response itself should not be cached, because it only contains a temporary URL, but the content at that URL is cacheable, it just won't be available from the redirected URL in the future (it remains accessible at the URL provided in the index data). There's no existing status code or cache control directive that adequately captures these semantics, but I also see no reason why pip can't cache the final response at the initial URL (as well as the final one too, if you want).

@pfmoore
Copy link
Member

pfmoore commented Jan 10, 2022

I tried that URL (with curl -I, not with pip) and I got a 405 response, not a 303...

@zooba
Copy link
Contributor Author

zooba commented Jan 10, 2022

The URL that returns the 303 is https://pkgs.dev.azure.com/Python/8e426817-76c0-4b99-ba9e-a48a1e4bd5db/_packaging/ad28b313-ee18-4ca2-912f-58714a0d2a78@c804ff44-3b35-4e5d-b661-69d5809c7788/pypi/download/pip/21.3.1/pip-21.3.1-py3-none-any.whl#sha256=deaf32dcd9ab821e359cd8330786bcd077604b5c5730c0b096eda46f95c24a2d

The index URL doesn't return anything until you specify a package. Try https://pkgs.dev.azure.com/Python/cpython/_packaging/TestFeed%40Release/pypi/simple/pip/

@pfmoore
Copy link
Member

pfmoore commented Jan 10, 2022

curl -I https://pkgs.dev.azure.com/Python/8e426817-76c0-4b99-ba9e-a48a1e4bd5db/_packaging/ad28b313-ee18-4ca2-912f-58714a0d2a78@c804ff44-3b35-4e5d-b661-69d5809c7788/pypi/download/pip/21.3.1/pip-21.3.1-py3-none-any.whl#sha256=deaf32dcd9ab821e359cd8330786bcd077604b5c5730c0b096eda46f95c24a2d
HTTP/1.1 200 Connection Established
Proxy-Agent: Zscaler/6.1

HTTP/1.1 405 Method Not Allowed
Cache-Control: no-cache
Pragma: no-cache
Allow: GET
Content-Length: 93
Content-Type: application/json; charset=utf-8
Expires: -1
P3P: CP="CAO DSP COR ADMa DEV CONo TELo CUR PSA PSD TAI IVDo OUR SAMi BUS DEM NAV STA UNI COM INT PHY ONL FIN PUR LOC CNT"
X-TFS-ProcessId: f727d422-f9f8-4929-9961-1992fef68238
Strict-Transport-Security: max-age=31536000; includeSubDomains
ActivityId: c422ece6-0e95-44ab-8dea-d8852b8cb61a
X-TFS-Session: c422ece6-0e95-44ab-8dea-d8852b8cb61a
X-VSS-E2EID: c422ece6-0e95-44ab-8dea-d8852b8cb61a
X-VSS-SenderDeploymentId: c9342659-5d46-33a5-295b-de367e0464e7
X-TFS-FedAuthRealm: https://pkgsprodcus1.pkgs.visualstudio.com/
X-TFS-FedAuthIssuer: https://www.visualstudio.com/
X-VSS-AuthorizationEndpoint: https://vssps.dev.azure.com/Python/
X-VSS-ResourceTenant: 00000000-0000-0000-0000-000000000000
X-FRAME-OPTIONS: SAMEORIGIN
Request-Context: appId=cid-v1:540f64bd-7388-47ab-bdf2-a94451f9dd55
Access-Control-Expose-Headers: Request-Context
X-Content-Type-Options: nosniff
X-Cache: CONFIG_NOCACHE
X-MSEdge-Ref: Ref A: 90B74EBCC16B4D2396A7520AB5D51DB6 Ref B: MAN30EDGE0414 Ref C: 2022-01-10T16:55:53Z
Date: Mon, 10 Jan 2022 16:55:54 GMT

When @pradyunsg said there should be a public reproducer for this, I think what we need is a script that can be run by anyone with a Python installation, that clearly demonstrates the issue without needing any external setup/knowledge. If I can't get the URLs you're providing to work just for a sanity check, they won't be much help to whoever tries to work out what needs to be done about this request.

The URL you give that you describe as a test index doesn't appear to even conform to PEP 503 for me:

❯ curl https://pkgs.dev.azure.com/Python/cpython/_packaging/TestFeed%40Release/pypi/simple/
{"$id":"1","innerException":null,"message":"TF400813: The user '' is not authorized to access this resource.","typeName":"Microsoft.TeamFoundation.Framework.Server.UnauthorizedRequestException, Microsoft.TeamFoundation.Framework.Server","typeKey":"UnauthorizedRequestException","errorCode":0,"eventId":3000}

@zooba
Copy link
Contributor Author

zooba commented Jan 10, 2022

pip wheel pip -i https://pkgs.dev.azure.com/Python/cpython/_packaging/TestFeed%40Release/pypi/simple/

Run that twice (deleting the file in between) and you'll see it download the file twice. It should have been in the local cache.

On the results from your tests:

  • the file store doesn't support HEAD (which is weird, but the headers clearly show it), you should do a GET
  • the index doesn't support the top-level page (I know, I've reported it, they don't like it...) but it follows all the rest of PEP 503 just fine. Specify a package (in this case, pip, because that's all I put on the feed) and you'll get the right page

@pfmoore
Copy link
Member

pfmoore commented Jan 10, 2022

Thanks, but I won't bother experimenting further. I'll leave that to someone who's inclined to actually look at the issue itself.

@pradyunsg
Copy link
Member

Getting the index page:

❯ http GET https://pkgs.dev.azure.com/Python/cpython/_packaging/TestFeed%40Release/pypi/simple/pip/       
HTTP/1.1 200 OK
Access-Control-Expose-Headers: Request-Context
ActivityId: 7cc2c8b3-6612-4dd0-a6e2-2540bdfa2ab0
Cache-Control: no-cache
Content-Encoding: gzip
Content-Type: text/html
Date: Thu, 13 Jan 2022 04:41:31 GMT
Expires: -1
P3P: CP="CAO DSP COR ADMa DEV CONo TELo CUR PSA PSD TAI IVDo OUR SAMi BUS DEM NAV STA UNI COM INT PHY ONL FIN PUR LOC CNT"
Pragma: no-cache
Request-Context: appId=cid-v1:540f64bd-7388-47ab-bdf2-a94451f9dd55
Strict-Transport-Security: max-age=31536000; includeSubDomains
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cache: CONFIG_NOCACHE
X-Content-Type-Options: nosniff
X-FRAME-OPTIONS: SAMEORIGIN
X-MSEdge-Ref: Ref A: 6F245469BF604DD0A739EF7C259676C4 Ref B: LTSEDGE1015 Ref C: 2022-01-13T04:41:31Z
X-Packaging-Migration: PyPiBlobMetadataV2
X-TFS-ProcessId: 9b82894a-06f2-42f4-862e-33046ac75451
X-TFS-Session: 7cc2c8b3-6612-4dd0-a6e2-2540bdfa2ab0
X-VSS-E2EID: 7cc2c8b3-6612-4dd0-a6e2-2540bdfa2ab0
X-VSS-SenderDeploymentId: c9342659-5d46-33a5-295b-de367e0464e7

<html><head><title>Links for pip</title></head><body><h1>Links for pip</h1><a href="https://pkgs.dev.azure.com/Python/8e426817-76c0-4b99-ba9e-a48a1e4bd5db/_packaging/ad28b313-ee18-4ca2-912f-58714a0d2a78@c804ff44-3b35-4e5d-b661-69d5809c7788/pypi/download/pip/21.3.1/pip-21.3.1-py3-none-any.whl#sha256=deaf32dcd9ab821e359cd8330786bcd077604b5c5730c0b096eda46f95c24a2d" data-requires-python="&gt;=3.6">pip-21.3.1-py3-none-any.whl</a><br/></body></html>

Getting the actual artifact:

❯ http GET https://pkgs.dev.azure.com/Python/8e426817-76c0-4b99-ba9e-a48a1e4bd5db/_packaging/ad28b313-ee18-4ca2-912f-58714a0d2a78@c804ff44-3b35-4e5d-b661-69d5809c7788/pypi/download/pip/21.3.1/pip-21.3.1-py3-none-any.whl#sha256=deaf32dcd9ab821e359cd8330786bcd077604b5c5730c0b096eda46f95c24a2d
HTTP/1.1 303 See Other
Access-Control-Expose-Headers: Request-Context
ActivityId: 624814ce-c012-4121-925f-0fe23740ba0f
Cache-Control: no-cache
Content-Length: 0
Date: Thu, 13 Jan 2022 04:41:44 GMT
Expires: -1
Location: https://m6xvsblobprodcus342.blob.core.windows.net/b-c0fc90aaa9034cf78191b925daa75b5c/B6FA6804C3FEAE9F9C9CA911355EA45FDBC9C8835A6F77BB272289A7B6E44AA900.blob?sv=2019-07-07&sr=b&si=1&sig=ckLD9SuztSbKgccJH0iiIA3ExspVOyJFDNshXnj6V64%3D&spr=https&se=2022-01-14T04%3A41%3A44Z&rscl=x-e2eid-624814ce-c0124121-925f0fe2-3740ba0f-session-624814ce-c0124121-925f0fe2-3740ba0f&rscd=attachment%3B%20filename%3D%22pip-21.3.1-py3-none-any.whl%22
P3P: CP="CAO DSP COR ADMa DEV CONo TELo CUR PSA PSD TAI IVDo OUR SAMi BUS DEM NAV STA UNI COM INT PHY ONL FIN PUR LOC CNT"
Pragma: no-cache
Request-Context: appId=cid-v1:540f64bd-7388-47ab-bdf2-a94451f9dd55
Strict-Transport-Security: max-age=31536000; includeSubDomains
X-Cache: CONFIG_NOCACHE
X-Content-Type-Options: nosniff
X-FRAME-OPTIONS: SAMEORIGIN
X-MSEdge-Ref: Ref A: 972FC9B1AC9348008D3FCA759BC227E9 Ref B: LTSEDGE0920 Ref C: 2022-01-13T04:41:44Z
X-Packaging-Migration: PyPiBlobMetadataV2
X-TFS-ProcessId: 015cce23-7aba-49b1-b1ee-5e330c46ccee
X-TFS-Session: 624814ce-c012-4121-925f-0fe23740ba0f
X-VSS-E2EID: 624814ce-c012-4121-925f-0fe23740ba0f
X-VSS-SenderDeploymentId: c9342659-5d46-33a5-295b-de367e0464e7

There's Cache-Control: no-cache on both. 🤷🏽

@pradyunsg
Copy link
Member

The final artifact provided:

❯ http GET "https://m6xvsblobprodcus342.blob.core.windows.net/b-c0fc90aaa9034cf78191b925daa75b5c/B6FA6804C3FEAE9F9C9CA911355EA45FDBC9C8835A6F77BB272289A7B6E44AA900.blob?sv=2019-07-07&sr=b&si=1&sig=ckLD9SuztSbKgccJH0iiIA3ExspVOyJFDNshXnj6V64%3D&spr=https&se=2022-01-14T04%3A41%3A44Z&rscl=x-e2eid-624814ce-c0124121-925f0fe2-3740ba0f-session-624814ce-c0124121-925f0fe2-3740ba0f&rscd=attachment%3B%20filename%3D%22pip-21.3.1-py3-none-any.whl%22"
HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Disposition: attachment; filename="pip-21.3.1-py3-none-any.whl"
Content-Language: x-e2eid-624814ce-c0124121-925f0fe2-3740ba0f-session-624814ce-c0124121-925f0fe2-3740ba0f
Content-Length: 1723581
Content-MD5: yEmkQSH4I8gG9gTWVo2eiQ==
Content-Type: application/octet-stream
Date: Thu, 13 Jan 2022 04:44:01 GMT
ETag: "0x8D9D453B5EF4F84"
Last-Modified: Mon, 10 Jan 2022 16:10:24 GMT
Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
x-ms-blob-type: BlockBlob
x-ms-creation-time: Mon, 10 Jan 2022 16:10:24 GMT
x-ms-lease-state: available
x-ms-lease-status: unlocked
x-ms-request-id: 1662baeb-f01e-0049-1538-08f27c000000
x-ms-server-encrypted: true
x-ms-version: 2019-07-07


+-----------------------------------------+
| NOTE: binary data not shown in terminal |
+-----------------------------------------+

This seems to be what @zooba wants to have stored in the cache.

@pradyunsg
Copy link
Member

pradyunsg commented Jan 13, 2022

It definitely comes from cachecontrol, but the option to cache redirects is part of their public API.

Hmm... do you know how this is exposed?

Looking at https://github.com/ionrock/cachecontrol/blob/7815847b52aa370b3eb146a3db6bfb177d81be8d/cachecontrol/controller.py#L254, the only thing I see exposed is the ability to change which status codes are cached. I'm pretty sure we don't want to cache other status codes (eg: 303 here) and we likely also don't want to be implementing our own logic to make cachecontrol do the right thing (cache the 200 on the other side of a 303).

A good next step for solving this is to use a simple linear Python script with requests + cachecontrol + DictCache, and show how it can behave in a manner consistent with what is being requested here. If you're able to make cachecontrol cache the 200 but not the 303, then... well, then you have figured out more than I have; and can likely figure out the next steps from there. :)

@pradyunsg pradyunsg added help wanted For requesting inputs from other members of the community type: enhancement Improvements to functionality labels Jan 13, 2022
@pradyunsg pradyunsg changed the title pip install does not support 303 (See Other) as valid cacheable response Cache responses on the "other side" of redirects with status code 303 Jan 13, 2022
@pradyunsg
Copy link
Member

pradyunsg commented Jan 13, 2022

Running with max verbosity:

❯ pip wheel pip -i https://pkgs.dev.azure.com/Python/cpython/_packaging/TestFeed%40Release/pypi/simple/ --log third.txt -vvv --progress-bar off
Created temporary directory: /private/var/folders/y1/j465wvf92vs938kmgqh63bj80000gn/T/pip-ephem-wheel-cache-s7cqgcb9
Created temporary directory: /private/var/folders/y1/j465wvf92vs938kmgqh63bj80000gn/T/pip-req-tracker-6tv6nuxd
Initialized build tracking at /private/var/folders/y1/j465wvf92vs938kmgqh63bj80000gn/T/pip-req-tracker-6tv6nuxd
Created build tracker: /private/var/folders/y1/j465wvf92vs938kmgqh63bj80000gn/T/pip-req-tracker-6tv6nuxd
Entered build tracker: /private/var/folders/y1/j465wvf92vs938kmgqh63bj80000gn/T/pip-req-tracker-6tv6nuxd
Created temporary directory: /private/var/folders/y1/j465wvf92vs938kmgqh63bj80000gn/T/pip-wheel-_xt6lhbu
Looking in indexes: https://pkgs.dev.azure.com/Python/cpython/_packaging/TestFeed%40Release/pypi/simple/
1 location(s) to search for versions of pip:
* https://pkgs.dev.azure.com/Python/cpython/_packaging/TestFeed%40Release/pypi/simple/pip/
Fetching project page and analyzing links: https://pkgs.dev.azure.com/Python/cpython/_packaging/TestFeed%40Release/pypi/simple/pip/
Getting page https://pkgs.dev.azure.com/Python/cpython/_packaging/TestFeed%40Release/pypi/simple/pip/
Found index url https://pkgs.dev.azure.com/Python/cpython/_packaging/TestFeed%40Release/pypi/simple/
Looking up "https://pkgs.dev.azure.com/Python/cpython/_packaging/TestFeed%40Release/pypi/simple/pip/" in the cache
Request header has "max_age" as 0, cache bypassed
Starting new HTTPS connection (1): pkgs.dev.azure.com:443
https://pkgs.dev.azure.com:443 "GET /Python/cpython/_packaging/TestFeed%40Release/pypi/simple/pip/ HTTP/1.1" 200 None
Updating cache with response from "https://pkgs.dev.azure.com/Python/cpython/_packaging/TestFeed%40Release/pypi/simple/pip/"
Caching b/c of expires header
  Found link https://pkgs.dev.azure.com/Python/8e426817-76c0-4b99-ba9e-a48a1e4bd5db/_packaging/ad28b313-ee18-4ca2-912f-58714a0d2a78@c804ff44-3b35-4e5d-b661-69d5809c7788/pypi/download/pip/21.3.1/pip-21.3.1-py3-none-any.whl#sha256=deaf32dcd9ab821e359cd8330786bcd077604b5c5730c0b096eda46f95c24a2d (from https://pkgs.dev.azure.com/Python/cpython/_packaging/TestFeed%40Release/pypi/simple/pip/) (requires-python:>=3.6), version: 21.3.1
Skipping link: not a file: https://pkgs.dev.azure.com/Python/cpython/_packaging/TestFeed%40Release/pypi/simple/pip/
Given no hashes to check 1 links for project 'pip': discarding no candidates
Collecting pip
  Created temporary directory: /private/var/folders/y1/j465wvf92vs938kmgqh63bj80000gn/T/pip-unpack-elmchy2y
  Looking up "https://pkgs.dev.azure.com/Python/8e426817-76c0-4b99-ba9e-a48a1e4bd5db/_packaging/ad28b313-ee18-4ca2-912f-58714a0d2a78@c804ff44-3b35-4e5d-b661-69d5809c7788/pypi/download/pip/21.3.1/pip-21.3.1-py3-none-any.whl" in the cache
  No cache entry available
  https://pkgs.dev.azure.com:443 "GET /Python/8e426817-76c0-4b99-ba9e-a48a1e4bd5db/_packaging/ad28b313-ee18-4ca2-912f-58714a0d2a78@c804ff44-3b35-4e5d-b661-69d5809c7788/pypi/download/pip/21.3.1/pip-21.3.1-py3-none-any.whl HTTP/1.1" 303 0
  Status code 303 not in (200, 203, 300, 301)
  Looking up "https://m6xvsblobprodcus342.blob.core.windows.net/b-c0fc90aaa9034cf78191b925daa75b5c/B6FA6804C3FEAE9F9C9CA911355EA45FDBC9C8835A6F77BB272289A7B6E44AA900.blob?sv=2019-07-07&sr=b&si=1&sig=7j8E67xnBoY6xnSLBcMY6M1Dj9hHkh8dF7tILf%2FyL8E%3D&spr=https&se=2022-01-14T05%3A13%3A01Z&rscl=x-e2eid-bee0eb4e-a9b3405a-b7943306-c6ea9f84-session-bee0eb4e-a9b3405a-b7943306-c6ea9f84&rscd=attachment%3B%20filename%3D%22pip-21.3.1-py3-none-any.whl%22" in the cache
  No cache entry available
  Starting new HTTPS connection (1): m6xvsblobprodcus342.blob.core.windows.net:443
  https://m6xvsblobprodcus342.blob.core.windows.net:443 "GET /b-c0fc90aaa9034cf78191b925daa75b5c/B6FA6804C3FEAE9F9C9CA911355EA45FDBC9C8835A6F77BB272289A7B6E44AA900.blob?sv=2019-07-07&sr=b&si=1&sig=7j8E67xnBoY6xnSLBcMY6M1Dj9hHkh8dF7tILf%2FyL8E%3D&spr=https&se=2022-01-14T05%3A13%3A01Z&rscl=x-e2eid-bee0eb4e-a9b3405a-b7943306-c6ea9f84-session-bee0eb4e-a9b3405a-b7943306-c6ea9f84&rscd=attachment%3B%20filename%3D%22pip-21.3.1-py3-none-any.whl%22 HTTP/1.1" 200 1723581
  Downloading https://pkgs.dev.azure.com/Python/8e426817-76c0-4b99-ba9e-a48a1e4bd5db/_packaging/ad28b313-ee18-4ca2-912f-58714a0d2a78@c804ff44-3b35-4e5d-b661-69d5809c7788/pypi/download/pip/21.3.1/pip-21.3.1-py3-none-any.whl (1.7 MB)
  Updating cache with response from "https://m6xvsblobprodcus342.blob.core.windows.net/b-c0fc90aaa9034cf78191b925daa75b5c/B6FA6804C3FEAE9F9C9CA911355EA45FDBC9C8835A6F77BB272289A7B6E44AA900.blob?sv=2019-07-07&sr=b&si=1&sig=7j8E67xnBoY6xnSLBcMY6M1Dj9hHkh8dF7tILf%2FyL8E%3D&spr=https&se=2022-01-14T05%3A13%3A01Z&rscl=x-e2eid-bee0eb4e-a9b3405a-b7943306-c6ea9f84-session-bee0eb4e-a9b3405a-b7943306-c6ea9f84&rscd=attachment%3B%20filename%3D%22pip-21.3.1-py3-none-any.whl%22"
  Caching due to etag

  Added pip from https://pkgs.dev.azure.com/Python/8e426817-76c0-4b99-ba9e-a48a1e4bd5db/_packaging/ad28b313-ee18-4ca2-912f-58714a0d2a78@c804ff44-3b35-4e5d-b661-69d5809c7788/pypi/download/pip/21.3.1/pip-21.3.1-py3-none-any.whl#sha256=deaf32dcd9ab821e359cd8330786bcd077604b5c5730c0b096eda46f95c24a2d to build tracker '/private/var/folders/y1/j465wvf92vs938kmgqh63bj80000gn/T/pip-req-tracker-6tv6nuxd'
  Removed pip from https://pkgs.dev.azure.com/Python/8e426817-76c0-4b99-ba9e-a48a1e4bd5db/_packaging/ad28b313-ee18-4ca2-912f-58714a0d2a78@c804ff44-3b35-4e5d-b661-69d5809c7788/pypi/download/pip/21.3.1/pip-21.3.1-py3-none-any.whl#sha256=deaf32dcd9ab821e359cd8330786bcd077604b5c5730c0b096eda46f95c24a2d from build tracker '/private/var/folders/y1/j465wvf92vs938kmgqh63bj80000gn/T/pip-req-tracker-6tv6nuxd'
Created temporary directory: /private/var/folders/y1/j465wvf92vs938kmgqh63bj80000gn/T/pip-unpack-dfvvmu77
Saved ./pip-21.3.1-py3-none-any.whl
Removed build tracker: '/private/var/folders/y1/j465wvf92vs938kmgqh63bj80000gn/T/pip-req-tracker-6tv6nuxd'

Notably:

  Updating cache with response from "https://m6xvsblobprodcus342.blob.core.windows.net/b-c0fc90aaa9034cf78191b925daa75b5c/B6FA6804C3FEAE9F9C9CA911355EA45FDBC9C8835A6F77BB272289A7B6E44AA900.blob?sv=2019-07-07&sr=b&si=1&sig=7j8E67xnBoY6xnSLBcMY6M1Dj9hHkh8dF7tILf%2FyL8E%3D&spr=https&se=2022-01-14T05%3A13%3A01Z&rscl=x-e2eid-bee0eb4e-a9b3405a-b7943306-c6ea9f84-session-bee0eb4e-a9b3405a-b7943306-c6ea9f84&rscd=attachment%3B%20filename%3D%22pip-21.3.1-py3-none-any.whl%22"
  Caching due to etag

Looking at the wheel cache:

❯ pip cache list pip                                                                                                           
Nothing cached.

Very curious what's happening here!

@pradyunsg
Copy link
Member

Well... The responses are being stored in the http cache: 🤷🏽

❯ tree /Users/pradyunsg/Library/Caches/pip/http 
/Users/pradyunsg/Library/Caches/pip/http
├── 3
│   └── 9
│       └── 4
│           └── e
│               └── 2
│                   └── 394e2b8436899adb1ceffcd75b8b08a782385fc7af78ed99650f2dab
└── c
    └── 9
        └── 3
            └── e
                └── 7
                    └── c93e7becd134c70cf5a0c0ca6defb17ae8f30bd4bd26cf5aad59eddc

10 directories, 2 files

Looks like the wheel:

❯ head /Users/pradyunsg/Library/Caches/pip/http/3/9/4/e/2/394e2b8436899adb1ceffcd75b8b08a782385fc7af78ed99650f2dab
cc=4,��response��body�L�P�yVS�:4@�epip/__init__.py=P�J�@

Looks like the index page:

❯ head /Users/pradyunsg/Library/Caches/pip/http/c/9/3/e/7/c93e7becd134c70cf5a0c0ca6defb17ae8f30bd4bd26cf5aad59eddc
cc=4,��response��body����`I�%&/m�{J�J��t�`$ؐ@������iG#)�*��eVe]f@�

@zooba
Copy link
Contributor Author

zooba commented Jan 13, 2022

There's Cache-Control: no-cache on both. 🤷🏽

Yes, I said earlier to ignore that for now because we're figuring out what to change it to :) Even if it's "fixed", the wrong response is still cached.

Very curious what's happening here!

Pretty sure the wheel cache is only for wheels built by pip? And the HTTP cache is for wheels downloaded from the internet?

Adding downloaded wheels into the wheel cache might be the easier option - possibly less surprising as well. Though I see a few assumptions that only locally built wheels are in the wheel cache, so that's probably too hard to untangle.

A good next step for solving this is to use a simple linear Python script with requests + cachecontrol + DictCache, and show how it can behave in a manner consistent with what is being requested here. If you're able to make cachecontrol cache the 200 but not the 303, then... well, then you have figured out more than I have; and can likely figure out the next steps from there. :)

Yeah, I'm not sure I can figure this one out either :) It looks like as requests+cachecontrol step through the redirect they lose all prior context, or at least lose the request context necessary to recognise that the final URL was the result of a redirect (and even if we kept that context and tried to use a heuristic in the cache controller, it wouldn't be able to update the request to lie about where it came from).

Meanwhile, pip is fully aware of which URL it came from - the first "Downloading " message below comes after the redirect has been resolved, as does the final "Added pip from " message. So conceptually it seems that associating the wheel with the static URL is an okay thing to do, and we've just got to figure out the mechanics of actually storing it?

  Downloading https://pkgs.dev.azure.com/Python/8e426817-76c0-4b99-ba9e-a48a1e4bd5db/_packaging/ad28b313-ee18-4ca2-912f-58714a0d2a78@c804ff44-3b35-4e5d-b661-69d5809c7788/pypi/download/pip/21.3.1/pip-21.3.1-py3-none-any.whl (1.7 MB)
  Updating cache with response from "https://m6xvsblobprodcus342.blob.core.windows.net/b-c0fc90aaa9034cf78191b925daa75b5c/B6FA6804C3FEAE9F9C9CA911355EA45FDBC9C8835A6F77BB272289A7B6E44AA900.blob?sv=2019-07-07&sr=b&si=1&sig=7j8E67xnBoY6xnSLBcMY6M1Dj9hHkh8dF7tILf%2FyL8E%3D&spr=https&se=2022-01-14T05%3A13%3A01Z&rscl=x-e2eid-bee0eb4e-a9b3405a-b7943306-c6ea9f84-session-bee0eb4e-a9b3405a-b7943306-c6ea9f84&rscd=attachment%3B%20filename%3D%22pip-21.3.1-py3-none-any.whl%22"
  Caching due to etag

  Added pip from https://pkgs.dev.azure.com/Python/8e426817-76c0-4b99-ba9e-a48a1e4bd5db/_packaging/ad28b313-ee18-4ca2-912f-58714a0d2a78@c804ff44-3b35-4e5d-b661-69d5809c7788/pypi/download/pip/21.3.1/pip-21.3.1-py3-none-any.whl#sha256=deaf32dcd9ab821e359cd8330786bcd077604b5c5730c0b096eda46f95c24a2d to build tracker '/private/var/folders/y1/j465wvf92vs938kmgqh63bj80000gn/T/pip-req-tracker-6tv6nuxd'

@pradyunsg
Copy link
Member

It is getting stored, but isn’t being found in the cache by cachecontrol. I’d imagine the redirect is implicated somehow.

Basically, the 200 response is being cached in the http cache (which should result in skipping the download). It doesn’t, however, get used on a subsequent lookup which… is likely the bug here. I think isolating this to determine whether cachecontrol has this behaviour or if something inside pip’s codebase is causing discrepancies would be the right thing to do.

@zooba
Copy link
Contributor Author

zooba commented Jan 18, 2022

The 303 isn't being stored, because that's (approx.) correct, but the subsequent 200 is stored. Next time around, the fresh 303 provides a different final URL that is not in the cache.

So nobody is wrong right now, it's just that the current behaviour is operating at too low a level to capture the (implied) semantics of a package index. There's no reason why cachecontrol should transparently cache the target of a redirect as the result of the redirect, because only pip knows that the content at the end is meant to be the same each time.

I honestly don't see a fix for this other than replacing some/all of pip's use of cachecontrol with custom cache management. It's probably best for us to just promote users manually caching wheels from their indexes rather than relying on pip to do it. I can't really imagine the change needed to be a simple PR...

@pfmoore
Copy link
Member

pfmoore commented Jan 18, 2022

I honestly don't see a fix for this other than replacing some/all of pip's use of cachecontrol with custom cache management. It's probably best for us to just promote users manually caching wheels from their indexes rather than relying on pip to do it. I can't really imagine the change needed to be a simple PR...

Agreed, this seems like far too specialised and complex behaviour to include in pip.

Normally, I'd recommend putting a proxying index (like simpleindex) in front of your actual index, to handle the caching, but I suspect you know enough to have considered and discarded that option before posting here. It's frustrating that I can't see a reasonable way of pip having a plugin architecture that would allow users to develop plugins for this sort of thing, without also committing us to maintaining a stable internal API 🙁

Hmm, requests has a plugin ecosystem - I wonder if we could expose that somehow from our vendored copy of requests, so that you could write a caching plugin for requests that did what you want and enable it within pip without needing to use any pip internal APIs? I'm not planning on looking any further into this, so it's purely speculation, but maybe it's a possibility? (TBH, though, after the keyring experience, I'm a bit wary of the support costs of any form of plugin system, no matter how well decoupled it is).

@notatallshaw
Copy link
Member

Hmm, requests has a plugin ecosystem - I wonder if we could expose that somehow from our vendored copy of requests, so that you could write a caching plugin for requests that did what you want and enable it within pip without needing to use any pip internal APIs?

I would personally love this as if there was an official way for pip to hook in to requests plugin system I could get pip using our enterprises authentication mechanisms and drop "trusted-host" in most cases.

But partly because I just read your post here from 2020: #4475 (comment) and partly because of the current state of requests it is maybe not the best idea for pip to tie itself to requests forever?

Maybe a config option like:

pip.experimental.this.can.break.any.release.you.have.been.warned.requests.session.mount = MyHttpAdapter

😉

@pfmoore
Copy link
Member

pfmoore commented Jan 18, 2022

Yeah good point, committing to always vendor requests is even worse than committing to a stable API...

@njsmith
Copy link
Member

njsmith commented Dec 2, 2022

Another approach would be for pip to have a separate cache for artifacts that have associated content hashes. Like, instead of caching a wheel under its url with http cache semantics, cache it under its sha256 (which makes the cache semantics trivial).

Advantages:

  • you already need a special path for downloading these artifacts, so you can validate the hashes, so it shouldn't require major refactoring
  • you potentially get better cache hit rates when using multiple indices with overlapping content
  • you don't need to revalidate the hash every time, just once when putting the file into the cache
  • no need to invent your own proprietary variant of http cache semantics or talk to the ietf or anything

@pradyunsg
Copy link
Member

That'd require trusting that the artifact has the hash that a remote says it'd have, no? In other words, trusting the remote-provided hash.

@dstufft
Copy link
Member

dstufft commented Dec 2, 2022

Content addressable caching seems like a good path generally.

Trusting the remote provided hash is something we already do today, and in fact there's no way around trusting the remote for pip.

The bigger issue is just that none of the remote API specs require having a hash available, which is why we used URLs to start with, because it was guaranteed to exist and let us scope the cache key to be specific to the repo (and we just trusted that the repo wouldn't change the contents of the file we're caching).

@njsmith
Copy link
Member

njsmith commented Dec 3, 2022 via email

@zooba
Copy link
Contributor Author

zooba commented Dec 5, 2022

Yes, a content-addressed cache would work fine.

For now I'm just recommending people use simpleindex as a workaround (which has the added benefit of being able to reference multiple indexes without dependency confusion).

@jmyersmsft
Copy link

jmyersmsft commented Jan 6, 2023

We (Azure Artifacts) do currently always provide the sha256 hash for all files. Currently, that should include all files from upstreams. Although we only fetch the contents of files from upstreams when they're first requested for download, we should have hashes whether we've got the content locally or not since pypi.org and Azure Artifacts feeds provide hashes.

Hypothetically, if we added support for upstreaming to external indexes other than pypi.org (we currently have no such plans, but it's well within the realm of possibility), if the index doesn't provide hashes we wouldn't be able to provide hashes either until the file had been requested at least once (the process of ingesting the file into Azure Artifacts includes computing the hash). This behavior would be on a file-by-file basis, even within the same package version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: cache Dealing with cache and files in it help wanted For requesting inputs from other members of the community type: enhancement Improvements to functionality
Projects
None yet
Development

No branches or pull requests

9 participants