Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pip 20.3.1 wheel command downloads multiple versions of the same package but keeps only one #9271

Closed
ianw opened this issue Dec 14, 2020 · 16 comments
Labels
C: download About fetching data from PyPI and other sources C: wheel The wheel format and 'pip wheel' command state: needs discussion This needs some more discussion

Comments

@ianw
Copy link
Contributor

ianw commented Dec 14, 2020

$ ./bin/pip --version
pip 20.3.1 from /tmp/foo/lib64/python3.9/site-packages/pip (python 3.9)

This comes from OpenStack evironments where we are using an upper-constraints.txt file.

The following command with upper-constraints.txt

/bin/pip --verbose  wheel --exists-action=i  -c ./upper-constraints.txt -w . gabbi===1.49.0

seems to download 8 and discard 7 different versions of pytest for some reason

  Downloading gabbi-1.49.0-py2.py3-none-any.whl (208 kB)
  ...
  Downloading pytest-6.2.0-py3-none-any.whl (279 kB)
  Downloading pytest-6.1.2-py3-none-any.whl (272 kB)
  Downloading pytest-6.1.1-py3-none-any.whl (272 kB)
  Downloading pytest-6.1.0-py3-none-any.whl (272 kB)
  Downloading pytest-6.0.2-py3-none-any.whl (270 kB)
  Downloading pytest-6.0.1-py3-none-any.whl (270 kB)
  Downloading pytest-6.0.0-py3-none-any.whl (270 kB)
  Downloading pytest-5.4.3-py3-none-any.whl (248 kB)
  ...
  Downloading zipp-0.6.0-py2.py3-none-any.whl (4.1 kB)

5.4.3 is the one it sticks with an the .whl file left behind. Watching with strace the others appear to be downloaded, but ultimately unlinked.

If you try this with --use-deprecated=legacy-resolver it chooses pytest-6.2.0-py3-none-any.whl (wrong I guess, and the new resolver is getting it right), but only downloads it once. So I think the resolver is getting this right, but I don't think it's quite right that it downloads and discards the same package multiple times to get to that point.

Unfortunately I haven't yet been able to reduce the upper-constraints.txt to something smaller that replicates this. I'm sure it involves transitive dependencies I can't see on the large number of packages specified there.

For reference, we've found this because we build our own wheel caches with pip wheel. We are parsing the logs to see which wheels pip downloaded, and which we built locally (i.e. not available from pypi and are thus worth keeping in our cache). Our script was assuming that anything pip reports as "Downloading " was on-disk and could be deleted (see here). There are several other occurrences of similar behaviour with other packages visible in the logs linked below, I just pulled this one as an example.

Files:

@uranusjr
Copy link
Member

uranusjr commented Dec 14, 2020

FWIW this is actually intentionally “fixed” because the behaviour you expected (keep all wheels downloaded and built during resolution) was reported as a bug.

@pradyunsg
Copy link
Member

pradyunsg commented Dec 14, 2020

#8827 was the report, I think.

@ianw
Copy link
Contributor Author

ianw commented Dec 14, 2020

FWIW this is actually intentionally “fixed” because the behaviour you expected (keep all wheels downloaded and built during resolution) was reported as a bug.

OK; I'm not particularly fussed about the download and save behaviour :) However I do think it's not great that this downloaded 7 versions of the whl to only keep the 8th. This is one case with a small wheel but probably does play out in CI in various ways many, many times.

I think it has to do with the large upper-constraints.txt? In theory, this should be a list of packages that all should work together and the resolver should essentially have nothing to do. As I mentioned I didn't have much luck breaking that down to replicate with something smaller. Is there some way to dump why pip thinks it needs to download all the intermediate versions?

@ianw

This comment has been minimized.

@pabloa

This comment has been minimized.

@uranusjr

This comment has been minimized.

@pradyunsg
Copy link
Member

Folks hitting "pip's downloading everything" are likely hitting #9284.

@pradyunsg pradyunsg added C: download About fetching data from PyPI and other sources C: new resolver C: wheel The wheel format and 'pip wheel' command state: needs discussion This needs some more discussion labels Dec 15, 2020
@ianw
Copy link
Contributor Author

ianw commented Dec 16, 2020

FWIW the original issue still replicates with 20.3.3

$ ./bin/pip --version
pip 20.3.3 from /tmp/req/lib64/python3.9/site-packages/pip (python 3.9)

$ ./bin/pip  wheel --exists-action=i  -c ../upper-constraints.txt -w . gabbi===1.49.0 2>&1 | tee out.log
... blah ...
Collecting pytest
  Downloading pytest-6.2.1-py3-none-any.whl (279 kB)
  Downloading pytest-6.2.0-py3-none-any.whl (279 kB)
  Downloading pytest-6.1.2-py3-none-any.whl (272 kB)
  Downloading pytest-6.1.1-py3-none-any.whl (272 kB)
  Downloading pytest-6.1.0-py3-none-any.whl (272 kB)
  Downloading pytest-6.0.2-py3-none-any.whl (270 kB)
  Downloading pytest-6.0.1-py3-none-any.whl (270 kB)
  Downloading pytest-6.0.0-py3-none-any.whl (270 kB)
  Downloading pytest-5.4.3-py3-none-any.whl (248 kB)
....

still not sure what about this constraints file make pytest special

@binbjz
Copy link

binbjz commented Dec 18, 2020

Same issue.

$ pip install -U ansible
Requirement already satisfied: ansible in /Users/binbjz/.pyenv/versions/3.9.1/lib/python3.9/site-packages (2.10.4)
Collecting ansible
  Using cached ansible-2.10.4-py3-none-any.whl
  Using cached ansible-2.10.3-py3-none-any.whl
  Using cached ansible-2.10.2.tar.gz (40.6 MB)

So upgrading all packages, I used the second method to avoid downloading all versions of the same package, I think it is an issue.

  1. Upgrade all packages
$ pip freeze | cut -d'=' -f1 | xargs -n1 pip install -U
  1. Upgrade all packages
$ pip freeze | cut -d'=' -f1 | xargs -n1 pip install -U --use-deprecated=legacy-resolver

@pabloa
Copy link

pabloa commented Dec 18, 2020

Confirmed the issue 20.3.3
I cleaned the caches and I installed another virtual machine without a previous cache (or python) and still happens.

@uranusjr
Copy link
Member

I thought about this for a while, and decided that removing unmatched downloads is the correct behaviour. Additional version downloads happen when pip discovers incompatibilities in the dependency graph and performs backtracking; those downloads are, therefore, unusable in the environment (otherwise pip wouldn’t need to download other things). Both pip download and pip wheel are generally used by people building a local cache of installable packages that they can run pip install --no-index --find-link <directory> against. And when that’s invoked, pip needs to visit those additional unusable artifacts again to know they are not compatible. In other words, deleting them from the final downloads would actually make subsequent installations faster by eliminating those incompatible distributions, which is better than keeping them.

@pabloa
Copy link

pabloa commented Dec 20, 2020

What about CI server builds and similar stuff: they do not keep a pip cache. They start from zero. In our case, we build everything from zero (including installing the last pip and starting with no cache) to discover installing issues. This is a valid use case. We do not want to roll new versions up to our clusters without testing the installing process.

The previous pip resolver did not need so much time (an almost extra hour) to build everything. Is it possible to postpone this resolver until the metadata could be accessed without downloading gigabytes of pip files? Perhaps with the Simple API or some RESTful service?

Disclosure: This is a good reason against investing in python. I do not consider Python mature enough to be used to solve machine learning problems because of this periodical devops drama. Every new issue that does it more difficult is an issue useful to consider other alternatives as Julia, or C++. So keep up. :)

@uranusjr
Copy link
Member

Sorry but I don’t get what you’re trying to say. This issue is about pip wheel (and pip download) not “exporting” artifacts pip discarded during resolution. The cache has nothing to do with this, and your comment is the first ever to mention it.

@ianw
Copy link
Contributor Author

ianw commented Dec 20, 2020

This issue is about pip wheel (and pip download) not “exporting” artifacts pip discarded during resolution.

I wouldn't argue that discarding the unnecessary downloads is the right thing to do. We were caught by this because of our admittedly obscure wheel-building scripts that were parsing "Downloading ..." lines and assuming those files existed on disk.

My main concern now is that it does download so much only to throw it away. If this is playing out in many other situations among our thousands of CI jobs that's not good, so anything we can do to understand and mitigate it would probably help.

It's not 100% clear to me this is the same issue as mentioned in #9271 (comment) but maybe it is? I'm not sure what about my original report triggered just the pytest package to exhibit this behaviour?

To refine it a little more; the wheel building is a bit of red-herring I guess ... the extra downloads happen with just

python3 -m venv test-env
./test-env/bin/pip install --ugprade pip
wget https://github.com/pypa/pip/files/5686307/upper-constraints.txt
./test-env/bin/pip install -c ./upper-constraints.txt gabbi===1.49.0

@uranusjr
Copy link
Member

Is the issue here “pip downloads too many unnecessary things” or “pip throws away downloads it deems unnecessary”? The topic in this thread is changing from message to message, and I no longer follow what you really think is the issue, and what you think pip should do to fix things for you.

pip downloading multiple versions of a package is an infrastructure issue, and has been heavily discussed in pypi/warehouse#8254, #9215, and many other places. pip has to download them, as least for now, due to how Python defines package metadata, to provide you a workable dependency environment.

As for the second topic, “throws away downloads” is not an accurate description. The downloads are not thrown away (unless you tell pip to with --no-cache), but kept in a place not visible to the user (pip’s cache). What pip wheel does is more like exporting the cache content into an actual wheel. And as I said above, pip exporting only usable artifacts is a better behaviour for people using the commands in their designed usages, so we are not going to change it.

@ianw
Copy link
Contributor Author

ianw commented Dec 21, 2020

Thanks, I don't think at this point this issue will help.

The original issue was the output changing to show downloads that were not then kept in the wheel directory.

It has become clear that these extra downloads are not a unique bug and related to the issues you mention. I agree these intermediate downloads should not be in the final output.

Unfortunately in a CI situation these downloads are effectively thrown away as we start with fresh environments.

@ianw ianw closed this as completed Dec 21, 2020
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
C: download About fetching data from PyPI and other sources C: wheel The wheel format and 'pip wheel' command state: needs discussion This needs some more discussion
Projects
None yet
Development

No branches or pull requests

5 participants