Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster no-ops for VCS dependencies #1415

Closed
wants to merge 4 commits into from
Closed

Faster no-ops for VCS dependencies #1415

wants to merge 4 commits into from

Conversation

FuegoFro
Copy link
Contributor

This allows pip-sync to no-op faster for VCS dependencies that haven't changed. Before this change when running pip-sync with a requirements file that has VCS packages it would always uninstall and reinstall those dependencies. This is because pip primarily operates on versions, and assumes that different code will have a different version. However this assumption breaks down with VCS packages, in which two different commits may very well have different code but the same version. Therefore to be safe we historically have had to reinstall these VCS dependencies.

This extends pip-sync to write some extra metadata to the installed package to be able to track what revision the package was installed from, and then skip the uninstall-and-reinstall if the revision that we want to install is the same as the revision that is already installed. This significantly speeds up no-op pip-sync invocations for requirements that have multiple VCS packages.

Tests

The tests added are written as best I could for unit tests. I think a better test would likely be to spin up an entirely new virtualenv, do a sync in it, and verify that a subsequent sync doesn't reinstall anything, but the exsting tests seem less like full integration tests so I didn't go down that route. As is, there's a somewhat tricky bit of logic that reaches into the workings of pip that's mocked out in the tests (namely the _reload_installed_distributions_by_key function).

Performance

Given a requirements.in of

-e git+https://github.com/python/[email protected]#egg=mypy
-e git+https://github.com/psf/[email protected]#egg=black
-e git+https://github.com/yaml/[email protected]#egg=pyyaml

and a corresponding requirements.txt of

-e git+https://github.com/psf/[email protected]#egg=black
-e git+https://github.com/python/[email protected]#egg=mypy
-e git+https://github.com/yaml/[email protected]#egg=pyyaml
appdirs==1.4.4
click==8.0.1
importlib-metadata==4.5.0
mypy-extensions==0.4.3
pathspec==0.8.1
regex==2021.4.4
toml==0.10.2
typed-ast==1.4.3
typing-extensions==3.10.0.0
zipp==3.4.1

I ran tests to determine how this affects the fresh install time as well as the no-op install time. I also did the tests but with non-editable VCS dependencies, both with the tag and with an exact commit hash (since pip has optimizations for the latter case):

Environment Fresh install time (approx seconds) No-op install time (approx seconds)
Editable, without optimization 20 12
Editable, with optimization 20 🟡 0.5 🟢
Non-editable, without optimization 20 20
Non-editable, with optimization 25 🔴 5 🟢
Non-editable with exact commit, without optimization 3 2
Non-editable with exact commit, with optimization 7 🔴 4 🔴

As you can see, most of the number remain similar (for fresh installs) or significantly better (for no-op installs). There are some regressions in a few places, namely due to the extra time it takes to calculate the relevant information, both when detecting no-ops before installation and when writing that into the environment after installation.

Changelog-friendly one-liner: Faster no-ops for VCS dependencies

Contributor checklist
  • Provided the tests for the changes.
  • Gave a clear one-line description in the PR (that the maintainers can add to CHANGELOG.md on release).
  • Assign the PR to an existing or new milestone for the target version (following Semantic Versioning).

@FuegoFro
Copy link
Contributor Author

Hey there! I wasn't sure if this is a change you'd be interested in. I think it can be pretty helpful in a number of situations (the 10x speedup for the editable case feels pretty compelling), but it does have the downside of mucking around with pip internals, changing packages after they're installed (understandably spooky), and comes with a bit of a performance hit in certain situations. In any case, I figured I'd clean it up and send it along in case it is something you all think would be useful to have.

@codecov

This comment has been minimized.

@FuegoFro
Copy link
Contributor Author

Sorry for the force pushes, they're to (hopefully 🤞) fix the tests. I should have run the full tox suite locally first, as the docs say to do 🤦

@ssbarnea
Copy link
Member

@FuegoFro No worries about you revision preference, github got better recently with force pushes. Still, please fix conflicts and assure everything reports green, maybe we can merge it for next release.

@ssbarnea ssbarnea added the enhancement Improvements to functionality label Jun 19, 2021
@ssbarnea ssbarnea requested review from atugushev and webknjaz June 19, 2021 11:44
@ssbarnea
Copy link
Member

ssbarnea commented Jul 1, 2021

Please address conflicts. Thanks

FuegoFro added 4 commits July 15, 2021 23:14
This allows pip-sync to no-op faster for VCS dependencies that haven't changed. Before this change when running `pip-sync` with a requirements file that has VCS packages it would always uninstall and reinstall those dependencies. This is because `pip` primarily operates on versions, and assumes that different code will have a different version. However this assumption breaks down with VCS packages, in which two different commits may very well have different code but the same version. Therefore to be safe we historically have had to reinstall these VCS dependencies.

This extends `pip-sync` to write some extra metadata to the installed package to be able to track what revision the package was installed from, and then skip the uninstall-and-reinstall if the revision that we want to install is the same as the revision that is already installed. This significantly speeds up no-op `pip-sync` invocations for requirements that have multiple VCS packages.

The tests added are written as best I could for unit tests. I think a better test would likely be to spin up an entirely new virtualenv, do a sync in it, and verify that a subsequent sync doesn't reinstall anything, but the exsting tests seem less like full integration tests so I didn't go down that route. As is, there's a somewhat tricky bit of logic that reaches into the workings of `pip` that's mocked out in the tests (namely the `_reload_installed_distributions_by_key` function).

Given a `requirements.in` of

```
-e git+https://github.com/python/[email protected]#egg=mypy
-e git+https://github.com/psf/[email protected]#egg=black
-e git+https://github.com/yaml/[email protected]#egg=pyyaml
```

and a corresponding `requirements.txt` of

```
-e git+https://github.com/psf/[email protected]#egg=black
-e git+https://github.com/python/[email protected]#egg=mypy
-e git+https://github.com/yaml/[email protected]#egg=pyyaml
appdirs==1.4.4
click==8.0.1
importlib-metadata==4.5.0
mypy-extensions==0.4.3
pathspec==0.8.1
regex==2021.4.4
toml==0.10.2
typed-ast==1.4.3
typing-extensions==3.10.0.0
zipp==3.4.1
```

I ran tests to determine how this affects the fresh install time as well as the no-op install time. I also did the tests but with non-editable VCS dependencies, both with the tag and with an exact commit hash (since pip has optimizations for the latter case):

| Environment                                          | Fresh install time (approx seconds) | No-op install time (approx seconds) |
| ---------------------------------------------------- | ----------------------------------- | ----------------------------------- |
| Editable, without optimization                       | 20                                  | 12                                  |
| Editable, with optimization                          | 20 🟡                               | 0.5 🟢                              |
| Non-editable, without optimization                   | 20                                  | 20                                  |
| Non-editable, with optimization                      | **25** 🔴                           | 5 🟢                                |
| Non-editable with exact commit, without optimization | 3                                   | 2                                   |
| Non-editable with exact commit, with optimization    | **7** 🔴                            | **4** 🔴                            |

As you can see, most of the number remain similar (for fresh installs) or significantly better (for no-op installs). There are some regressions in a few places, namely due to the extra time it takes to calculate the relevant information, both when detecting no-ops before installation and when writing that into the environment after installation.
@FuegoFro
Copy link
Contributor Author

Hey, sorry for the long delay getting back on this! I've rebased and resolved conflicts and have tests passing. I didn't have a Windows machine to test them on (and the failure seems to be Windows specific), so I apologize for the iterative approach here. The "fix" involved just... ignoring PermissionErrors when cleaning up dirs, so not sure how you feel about that haha.

Also I wasn't sure if the code coverage provided by my current tests will be sufficient, but I wanted to get a sense of whether this is even worth investing more time in before trying to add more tests (though, I'm also at a bit of a loss for what meaningful other tests could be added; would love advice there if we do want more coverage).

Thanks!

Copy link
Member

@atugushev atugushev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to get a sense of whether this is even worth investing more time in before trying to add more tests

We tend to move away from using pip private internals, see #1654. So I would vote against bringing more internals to the codebase, even if it's speed improvement which looks great.

@ssbarnea
Copy link
Member

I agree with @atugushev too much pip internals. If we want speed, we need changes to pip itself to address these issues. Any pip internal that we use does make our code much harder to maintain.

@ssbarnea ssbarnea closed this Nov 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvements to functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants