-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache responses on the "other side" of redirects with status code 303 #10694
Comments
Okay, now this is effectively "reopen" (feel free to transfer this comment to #8489 and just reopen it if you like): I have a private conversation going on with the Azure DevOps team about this issue, and they're using the 303 response to hide the fact that the target URL is not cacheable (because of query parameters). However, it is intentional that subsequent requests should come to the original URL and treat it as cached 1. The intent on their side is that the content should be cached, but not the redirect. That is, it'll always be the same file at the end, it'll just come from another source/mirror. There don't appear to be any 30x codes that specifically represent this, but maybe someone is aware of some other precedent? (FWIW, the Azure DevOps implementation here probably uses the same headers for their other package manager implementations, and may actually be the best precedent there is. But if we can at least agree on a code/header to indicate that pip can cache the resulting data against that URL, even though the final URL may change over time, we can make changes upstream to work with that.) Footnotes
|
Also somewhat related is #10075, which it turns out only mattered because the 303 was being followed and the final URL used for caching instead of the original one. If I hack enable caching on 303's (and bypass the currently not-okay headers) then that issue is redundant. |
I think dstuff's comment that it would be useful to see the headers for both/all responses from the server is still pertinent |
This needs the following information: #8489 (comment) |
I can get the headers, but they don't really matter. The response being checked is the redirected URL right now, which expires within minutes and is not going to be returned by subsequent 303s. Basically, pip ought to cache packages against the URL returned by the index, not by where that URL may eventually direct to (e.g. a local CDN endpoint). That way an index can consistently return the same URL when it knows the package is the same, rather than having that control taken away. If there's a particular header that could be used to signal that this is a canonical URL, we can add it. However, the existing Cache-Control options seem to sufficiently cover the case where an index wants to return a stable URL that should not be cached, because they want to return different files at different times from the same URL. There don't seem to be any good options for indicating that the redirecting URL is canonical for caching content (as opposed to caching the redirect location), so we're a little bit stuck. |
IIRC the cache behaviour is inherited from cachecontrol so it may make sense to loop in its maintainers. (First someone needs to check and make sure it is indeed the case, of course.) |
FWIW, it would also be useful for whomever maintains these bits of software, to have a public reproducer for this. |
It definitely comes from cachecontrol, but the option to cache redirects is part of their public API. And I'm proposing that pip use its additional knowledge about the semantics of the requests it is making to change its behaviour, not proposing that the behaviour should change generally for all 303 redirects (though I'm not opposed to a global status code that works this way).
You mean like a test index that 303 redirects file links? This one will do it (it only has a Append
The file URL here is perfectly static, but it returns 303 to point to a temporary, authenticated download link. The 303 response itself should not be cached, because it only contains a temporary URL, but the content at that URL is cacheable, it just won't be available from the redirected URL in the future (it remains accessible at the URL provided in the index data). There's no existing status code or cache control directive that adequately captures these semantics, but I also see no reason why pip can't cache the final response at the initial URL (as well as the final one too, if you want). |
I tried that URL (with |
The URL that returns the 303 is The index URL doesn't return anything until you specify a package. Try |
When @pradyunsg said there should be a public reproducer for this, I think what we need is a script that can be run by anyone with a Python installation, that clearly demonstrates the issue without needing any external setup/knowledge. If I can't get the URLs you're providing to work just for a sanity check, they won't be much help to whoever tries to work out what needs to be done about this request. The URL you give that you describe as a test index doesn't appear to even conform to PEP 503 for me:
|
Run that twice (deleting the file in between) and you'll see it download the file twice. It should have been in the local cache. On the results from your tests:
|
Thanks, but I won't bother experimenting further. I'll leave that to someone who's inclined to actually look at the issue itself. |
Getting the index page:
Getting the actual artifact:
There's |
The final artifact provided:
This seems to be what @zooba wants to have stored in the cache. |
Hmm... do you know how this is exposed? Looking at https://github.com/ionrock/cachecontrol/blob/7815847b52aa370b3eb146a3db6bfb177d81be8d/cachecontrol/controller.py#L254, the only thing I see exposed is the ability to change which status codes are cached. I'm pretty sure we don't want to cache other status codes (eg: 303 here) and we likely also don't want to be implementing our own logic to make cachecontrol do the right thing (cache the 200 on the other side of a 303). A good next step for solving this is to use a simple linear Python script with requests + cachecontrol + DictCache, and show how it can behave in a manner consistent with what is being requested here. If you're able to make cachecontrol cache the 200 but not the 303, then... well, then you have figured out more than I have; and can likely figure out the next steps from there. :) |
Running with max verbosity:
Notably:
Looking at the wheel cache:
Very curious what's happening here! |
Well... The responses are being stored in the http cache: 🤷🏽
Looks like the wheel:
Looks like the index page:
|
Yes, I said earlier to ignore that for now because we're figuring out what to change it to :) Even if it's "fixed", the wrong response is still cached.
Pretty sure the wheel cache is only for wheels built by pip? And the HTTP cache is for wheels downloaded from the internet? Adding downloaded wheels into the wheel cache might be the easier option - possibly less surprising as well. Though I see a few assumptions that only locally built wheels are in the wheel cache, so that's probably too hard to untangle.
Yeah, I'm not sure I can figure this one out either :) It looks like as requests+cachecontrol step through the redirect they lose all prior context, or at least lose the request context necessary to recognise that the final URL was the result of a redirect (and even if we kept that context and tried to use a Meanwhile, pip is fully aware of which URL it came from - the first "Downloading " message below comes after the redirect has been resolved, as does the final "Added pip from " message. So conceptually it seems that associating the wheel with the static URL is an okay thing to do, and we've just got to figure out the mechanics of actually storing it?
|
It is getting stored, but isn’t being found in the cache by cachecontrol. I’d imagine the redirect is implicated somehow. Basically, the 200 response is being cached in the http cache (which should result in skipping the download). It doesn’t, however, get used on a subsequent lookup which… is likely the bug here. I think isolating this to determine whether cachecontrol has this behaviour or if something inside pip’s codebase is causing discrepancies would be the right thing to do. |
The 303 isn't being stored, because that's (approx.) correct, but the subsequent 200 is stored. Next time around, the fresh 303 provides a different final URL that is not in the cache. So nobody is wrong right now, it's just that the current behaviour is operating at too low a level to capture the (implied) semantics of a package index. There's no reason why cachecontrol should transparently cache the target of a redirect as the result of the redirect, because only pip knows that the content at the end is meant to be the same each time. I honestly don't see a fix for this other than replacing some/all of pip's use of cachecontrol with custom cache management. It's probably best for us to just promote users manually caching wheels from their indexes rather than relying on pip to do it. I can't really imagine the change needed to be a simple PR... |
Agreed, this seems like far too specialised and complex behaviour to include in pip. Normally, I'd recommend putting a proxying index (like simpleindex) in front of your actual index, to handle the caching, but I suspect you know enough to have considered and discarded that option before posting here. It's frustrating that I can't see a reasonable way of pip having a plugin architecture that would allow users to develop plugins for this sort of thing, without also committing us to maintaining a stable internal API 🙁 Hmm, requests has a plugin ecosystem - I wonder if we could expose that somehow from our vendored copy of requests, so that you could write a caching plugin for requests that did what you want and enable it within pip without needing to use any pip internal APIs? I'm not planning on looking any further into this, so it's purely speculation, but maybe it's a possibility? (TBH, though, after the keyring experience, I'm a bit wary of the support costs of any form of plugin system, no matter how well decoupled it is). |
I would personally love this as if there was an official way for pip to hook in to requests plugin system I could get pip using our enterprises authentication mechanisms and drop "trusted-host" in most cases. But partly because I just read your post here from 2020: #4475 (comment) and partly because of the current state of requests it is maybe not the best idea for pip to tie itself to requests forever? Maybe a config option like:
😉 |
Yeah good point, committing to always vendor requests is even worse than committing to a stable API... |
Another approach would be for pip to have a separate cache for artifacts that have associated content hashes. Like, instead of caching a wheel under its url with http cache semantics, cache it under its sha256 (which makes the cache semantics trivial). Advantages:
|
That'd require trusting that the artifact has the hash that a remote says it'd have, no? In other words, trusting the remote-provided hash. |
Content addressable caching seems like a good path generally. Trusting the remote provided hash is something we already do today, and in fact there's no way around trusting the remote for pip. The bigger issue is just that none of the remote API specs require having a hash available, which is why we used URLs to start with, because it was guaranteed to exist and let us scope the cache key to be specific to the repo (and we just trusted that the repo wouldn't change the contents of the file we're caching). |
Yeah, you'd need some logic to do 'if there's a hash, fetch it with http
caching disabled and stick it in the content-addressed cache, otherwise,
fetch it with http caching enabled as usual'. Then Steve just needs to tell
the Azure folks always provide a hash, which I assume they would be doing
anyway.
…On Fri, Dec 2, 2022 at 9:49 AM Donald Stufft ***@***.***> wrote:
Content addressable caching seems like a good path generally.
Trusting the remote provided hash is something we already do today, and in
fact there's no way around trusting the remote for pip.
The bigger issue is just that none of the remote API specs *require*
having a hash available, which is why we used URLs to start with, because
it was guaranteed to exist and let us scope the cache key to be specific to
the repo (and we just trusted that the repo wouldn't change the contents of
the file we're caching).
—
Reply to this email directly, view it on GitHub
<#10694 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEU42CXMRGDRTX2M5SLLGDWLIZA3ANCNFSM5JCQ7MTQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
Nathaniel J. Smith -- https://vorpus.org <http://vorpus.org>
|
Yes, a content-addressed cache would work fine. For now I'm just recommending people use simpleindex as a workaround (which has the added benefit of being able to reference multiple indexes without dependency confusion). |
We (Azure Artifacts) do currently always provide the sha256 hash for all files. Currently, that should include all files from upstreams. Although we only fetch the contents of files from upstreams when they're first requested for download, we should have hashes whether we've got the content locally or not since pypi.org and Azure Artifacts feeds provide hashes. Hypothetically, if we added support for upstreaming to external indexes other than pypi.org (we currently have no such plans, but it's well within the realm of possibility), if the index doesn't provide hashes we wouldn't be able to provide hashes either until the file had been requested at least once (the process of ingesting the file into Azure Artifacts includes computing the hash). This behavior would be on a file-by-file basis, even within the same package version. |
Description
See #8489 (cannot comment there because the bot locked it)
Expected behavior
See #8489
pip version
Latest
Python version
any
OS
any
How to Reproduce
See #8489
Output
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: