Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metadata-only resolve with pip download --dry-run --report! #10748

Closed

Conversation

cosmicexplorer
Copy link
Contributor

@cosmicexplorer cosmicexplorer commented Dec 25, 2021

Hello! This is the beginning of an attempt to follow up on the gratuitous speedup promised in #7819. While I still need to spend more time on parallelizing downloads in another branch, I'd like to propose this change first as it's totally orthogonal to downloading files!

Problem

With this change, users can use pip's powerful new resolver to extract metadata from a package resolution, such as what dependencies each package requires. This is currently not possible in pip or any other tool I am aware of.

Getting more specific, this result can then be processed by external tools to:

  • generate a lockfile.
  • check whether transitive dependencies would introduce a conflict.
  • download packages directly from download URLs without having to traverse PyPI again.

#53 has a discussion in greater depth on this proposed feature.

Solution

Add a pip download --report flag which can produce a JSON output file to examine the relationships between dependencies during pip download. This PR also implements PEP 658 support, allowing pip download --report to test PEP 658 compliance.

@cosmicexplorer
Copy link
Contributor Author

I think this needs:

  • test cases
  • bikeshedding over the flag names / whether two separate flags are appropriate (I think so)

@uranusjr
Copy link
Member

This should use PEP 658 unless there are explicit reasons otherwise.

@pradyunsg
Copy link
Member

pradyunsg commented Dec 25, 2021

We can also utilise that in a follow up, which adds support for PEP 658 across the board (i.e. allowing regular installations to also benefit from that). I don't think that PEP's use should be a blocking concern for this PR -- but it's definitely a really good idea to also utilise that.

@cosmicexplorer
Copy link
Contributor Author

@uranusjr: absolutely will do! I assumed there was prior work on how to frame this change and am ready to conform to it. Looking that up right now. Thank you @pradyunsg for help on framing that.

@cosmicexplorer
Copy link
Contributor Author

Ok, remembering the content of that PEP and checking out https://www.python.org/dev/peps/pep-0658/#specification -- @uranusjr I'm thinking that it might be appropriate to:

  1. Determine whether the data-dist-info-metadata attribute is present.
  2. If so, append .metadata to the download URL to retrieve the metadata file and add that to the info printed out by --print-download-urls.
  3. If the hash of the metadata file is provided, add that to the info that is printed out by --print-download-urls.

Separately, I've realized I shouldn't be keying the output of --print-download-urls by req.name, but rather by the stringification of the req.

@pradyunsg does that sound like a reasonable approach? I plan to stick with this work until done, so I can promise to do that in a followup as well if appropriate.

@cosmicexplorer cosmicexplorer marked this pull request as draft December 26, 2021 10:35
@cosmicexplorer
Copy link
Contributor Author

@uranusjr: made a ton of progress on PEP 658 support and I think it's feasible to include in this PR! There are a few hacks I made in the most recent commit, so I just converted this to a draft PR.

I don't believe too many changes should be necessary to remove the hacks, and I think in fact that we will actually end up converting some of the existing Link attribute parsing to a more correct form in the process! Will ping again when I think this is ready for review.

@cosmicexplorer cosmicexplorer force-pushed the metadata-only-resolve branch 12 times, most recently from 72d8357 to 8c755e8 Compare December 28, 2021 13:01
@cosmicexplorer cosmicexplorer changed the title Metadata only resolve metadata-only resolve with --print-download-urls and --avoid-wheel-downloads Dec 28, 2021
@cosmicexplorer cosmicexplorer marked this pull request as ready for review December 28, 2021 14:09
@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Dec 28, 2021

Ok, I've marked this PR as ready for review again. I think the bits that:

  • manage mapping Requirements to Links, and
  • implement PEP 658 support

are pretty readable and don't modify the surrounding code too much.

You'll note I moved some functions from collector.py to link.py, since I was able to move a lot of the Link parsing into the Link.from_element() function, which is then called exactly once from collector.py. I also found that a method _clean_link() from collector.py was only used in testing, so I moved it into test_collector.py. Please let me know if that seems reasonable.

The two issues I'd like some advice on relate to testing:

  • test_from_link_vcs_with_source_dir_obtains_commit_id(), test_from_link_vcs_without_source_dir(), and test_should_cache_git_sha() are failing in CI on multiple python versions including 3.10, which I can't reproduce locally with nox -s test-3.10. I will try using an ubuntu docker image to repro the issue next--is there anything else I should try?
  • It's not clear to me how to add tests for the final user-facing change made by this PR, namely the actual flags --avoid-wheel-downloads and --print-download-urls, since there is currently no testing for the download command. Is there any command test that checks files output by pip that I can use as a reference, or should I try to write that part myself? I am totally ready to do so if necessary.

Thanks for any input! I will try to address the two issues I mentioned above presently.

@pfmoore
Copy link
Member

pfmoore commented Dec 28, 2021

Can the types of situation where this PR would be useful be summarised in a general way, without reference to specific examples? As someone who's never needed anything like this, I'm a bit unclear when (and how) someone would find this useful. What I'd like to be clear about is how we evaluate the benefit of a change like this, which has a visible cost in terms of (code and maintenance) complexity.

I'm also concerned about the user interface here. Having a pip download --but-dont-actually-download command seems conceptually confusing, and gives me the sense that the code is being added wherever it can be made to fit, rather than trying to achieve an intuitive and understandable user interface. (Note that a lot of pip has been developed this way, so I'm not suggesting this PR is uniquely bad in this regard, but I do think that at some point we need to stop and re-think how we integrate new features like this into our UI).

@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Dec 28, 2021

@pfmoore:

Can the types of situation where this PR would be useful be summarised in a general way, without reference to specific examples? As someone who's never needed anything like this, I'm a bit unclear when (and how) someone would find this useful. What I'd like to be clear about is how we evaluate the benefit of a change like this, which has a visible cost in terms of (code and maintenance) complexity.

With this change, users can use pip's powerful new resolver to extract metadata from a package resolution, such as what dependencies each package requires. This is currently not possible in pip or any other tool I am aware of.

Getting more specific, this result can then be processed by external tools to:

  • generate a lockfile.
  • check whether transitive dependencies would introduce a conflict.
  • download packages directly from download URLs without having to traverse PyPI again.

#53 has a discussion in greater depth on this proposed feature.

I'm also concerned about the user interface here. Having a pip download --but-dont-actually-download command seems conceptually confusing, and gives me the sense that the code is being added wherever it can be made to fit, rather than trying to achieve an intuitive and understandable user interface.

This totally makes sense. I think shoving this into download was a hack. In #53 (comment) I propose moving this functionality into a new command pip resolve, and after some work cleaning up the implementation here, it's actually pretty easy to see how I would implement it in a new command. I think a strong argument for this is that --avoid-wheel-downloads without --print-download-urls doesn't seem to make any sense.

So I think if I move this into a new pip resolve command, that would avoid introducing confusing new flags, and probably require no changes to any other files except download.py.

What are your thoughts on that approach?

@pradyunsg
Copy link
Member

pradyunsg commented Dec 28, 2021

That sounds like something to put into #53's resolve command; which would be what I'd prefer.

@pradyunsg
Copy link
Member

The new command would need:

  • code changes (register the command, implement it in a new resolve.py file, and test it to the extent that you want.
  • documentation pages added for it (html and man)
  • more?

@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Dec 28, 2021

@pradyunsg:

register the command, implement it in a new resolve.py file

This was just done in c78bb3d!

  • test it to the extent that you want.
  • documentation pages added for it (html and man)
  • more?

I will look into doing the above (as well as covering other goals of #53's resolve command that I may have missed) presently!

@pfmoore
Copy link
Member

pfmoore commented Dec 28, 2021

Thanks, @cosmicexplorer that was exactly the explanation I needed. I agree this would be good, and I like @pradyunsg's suggestion that this should go under pip resolve.

Personally, I wish that functionality like this could be separated out into more modular tools, because I don't see this as being a core piece of pip functionality, rather it's a piece of functionality that needs to be in pip just to re-use chunks of pip's machinery. But that's a whole other discussion, and not for now.

@sbidoul
Copy link
Member

sbidoul commented May 2, 2022

Thanks for your feedback @groodt ! I certainly hope this does not come across as any kind of competition.
My PR started merely as a way to illustrate my review comments - which I hope were not too heavy.

Now if I get to continue this work (which I'm happy to, as I sincerely believe this is going to be very valuable for the community), I'll definitely reuse part of the very valuable work done here (with proper credits of course).

There might also be use cases of @cosmicexplorer or others that the slightly different format I propose may not address, and I'm eager to get feedback about that too.

@cosmicexplorer cosmicexplorer force-pushed the metadata-only-resolve branch from 7cb145a to 18bf0f8 Compare May 10, 2022 07:03
@pypa-bot pypa-bot removed the needs rebase or merge PR has conflicts with current master label May 10, 2022
@cosmicexplorer cosmicexplorer force-pushed the metadata-only-resolve branch from 18bf0f8 to dc36f0e Compare May 10, 2022 07:04
- create LinkWithSource to retain attrs from InstallRequirement
- add tests for report output for top-level requirements
- add tests for more of the report JSON format
- add passing tests for JSON report output including PEP 658!
- add docstrings to several classes and functions, including tests!
- move the --report implementation into resolvelib
- use an abstract base class instead of a Union for InfoType
- use frozen dataclasses for InfoType subclasses
these two objects are intended to be fungible, but play different roles, since LinkHash actually
parses its input and ArchiveInfo does not. ArchiveInfo's JSON (de)?serialization does not employ
hash name or value validation, while LinkHash does not offer a JSON serialization.
@cosmicexplorer cosmicexplorer force-pushed the metadata-only-resolve branch from dc36f0e to c068eb5 Compare May 10, 2022 07:06
@cosmicexplorer
Copy link
Contributor Author

Thanks so much to @ofek for pinging me on twitter about this!! I absolutely have the time to get this mergeable in the next few weeks. I just fixed the merge conflicts and tests are passing again locally, so I will now review the thoughtful proposals from others in this thread.

@cosmicexplorer
Copy link
Contributor Author

@sbidoul:

At first sight, I'd say direct_url.py and ArchiveInfo in particular can be extended to do hash validation if needed.
...
As a general note, if the report composes existing standards as much as possible (such as PEP 610, PEP 566 json metadata), it will be easier to consume downstream.

I'll look to do this. I believe that some of the reason for my decision to add functionality to link.py was to avoid an import cycle, and I think the other part of it was just wanting to get a working implementation of PEP 658 before changing something else. However, since I'm thinking of basing this PR off of #10771 (see below), we may be able to avoid this conundrum entirely.

As I mentioned above, I think it will be easier to review if you could at some point split this PR in smaller ones each addressing a distinct concern (i.e. the report, the dry-run part, PEP 658) which are mostly orthogonal to each other. Thinking about them independently may also help achieving the best possible design.

This seems absolutely correct, and I was going to immediately agree to it, before realizing that there is another possible split we could do in this work which relies on #10771 (where we would split out dist-info-metadata extraction instead). See below.

@groodt:

After carefully reading the comments from @cosmicexplorer @sbidoul and @pfmoore I think moving this behaviour to install makes the most sense.

I really appreciate your thoughtful comment. I now agree with both of you, and have acknowledged this in #10771 (comment). The below section came from thinking about how to reformulate this PR to augment the much smaller and nicer implementation in #10771, instead of being in conflict.

Comparison of Approaches to --dry-run

In that regard, have you had a chance to look at #10771? This is just an experiment but I can't help thinking the implementation approach is simpler. But I also can't help thinking I'm missing something with that approach.

I would absolutely prefer the much smaller implementation in that PR (which wouldn't have caused (minor) merge conflicts as in this one, for example). When I looked it over to see if you were missing anything, I realized there just seems to be a difference between our conception of --dry-run, which might motivate using both of our solutions:

Proposal: first install --report, then make --dry-run faster

So my proposal here is:

  1. Switch our focus to use Installation/resolution report (aka pip install --dry-run --report) #10771 as the PR that initially implements the --report json output, keeping the --dry-run and --ignore-installed options. This means we will have a PR that is much easier to review for maintainers, and which will expose most of the same functionality for end users as this one.
  2. Make a new PR that moves dist-info-metadata extraction into the implementation of --use-feature=fast-deps. PEP 658 support would then be totally orthogonal to the resolution report.
    • We would use dist-info-metadata whenever available, and fall back to LazyZipOverHTTP otherwise.
  3. Make this PR branch off of Installation/resolution report (aka pip install --dry-run --report) #10771, and use the code in this PR to make pip install --dry-run avoid downloading any dists.

@uranusjr: is it appropriate to consider dist-info-metadata purely a way to speed up resolves? Or is there a reason we'd want to use it outside of --use-feature=fast-deps? It seems like the strategy of relying on PEP 658, but falling back to a heuristic like LazyZipOverHTTP or #8929 if not available, would mean --use-feature=fast-deps could probably be stabilized sooner.

@cosmicexplorer
Copy link
Contributor Author

So to be clear on next steps after that lengthy comment, I'm planning to:

  1. split out PEP 658 / dist-info-metadata support into a new PR, which uses it to implement --use-feature=fast-deps.
  2. rebase this PR off of Installation/resolution report (aka pip install --dry-run --report) #10771, so that pip install --dry-run avoids downloading any dists, and pip install --report also includes download URLs along with each distribution's metadata.

@sbidoul
Copy link
Member

sbidoul commented May 11, 2022

Thanks for your detailed feedback @cosmicexplorer. It looks like we have a path forward !

Regarding PEP 658 I totally agree it can and should be implemented independently. In my mind it is an optimization of the metadata preparation step when the index provides dist-info-metadata, and then delay the actual downloading of the sdist or wheel. That makes me think it will imply some refactoring of the download process (some kind of "download on demand" mechanism, not sure).

@cosmicexplorer
Copy link
Contributor Author

That makes me think it will imply some refactoring of the download process (some kind of "download on demand" mechanism, not sure).

It turns out that this refactoring was already done for the purpose of --use-feature=fast-deps by @McSinyx! I still want to implement parallel downloads on top of that in a separate change soon.

@pradyunsg
Copy link
Member

@cosmicexplorer What's the next piece here, with #11111 resolved? IIUC, we need to now make --dry-run faster?

@hrfuller
Copy link

hrfuller commented Oct 4, 2022

I was talking to @jjhelmus about a way to make --dry-run faster...apparently he discovered that pip install --dry-run --report downloads wheels after the pep658 resolve to construct the report. I think he may have a patch to generate the report from the resolve metadata only.

@cosmicexplorer
Copy link
Contributor Author

I think he may have a patch to generate the report from the resolve metadata only.

@jjhelmus @hrfuller I would love to see that patch! That's definitely what I wanted to accomplish here.

What's the next piece here, with #11111 resolved? IIUC, we need to now make --dry-run faster?

@pradyunsg Sorry for writing such a sprawling comment above! Yes, I think that would be the way to finish up this workstream.

I suspect we may find a speedup if we can implement some form of pipelining, where we pull down wheel metadata or execute setup.py for an sdist in parallel with the resolution. I can vaguely see a way to make this work nicely with coroutines, but that might end up being a big project.

I think a conclusive win would be to avoid downloading wheels as well with --dry-run, and we can think about bigger changes separately from that.

@jjhelmus
Copy link

As @hrfuller mentioned I have been exploring metadata only resolves. What I found was that with the changes from #11111 metadata only resolves are possible but at the end of the resolve the prepare_linked_requirements_more method triggers a download of wheels for the resolved distribution. This allows the existing logic to report information (names, version, metadata, and download URLs and hashes) about the resolved distributions.

With the changes in #11512, this information is provided by the .metadata files directly and no wheels need to be downloaded.

Thanks for your work on this @cosmicexplorer . I was really excited when I discovered that this was possible.

@sbidoul
Copy link
Member

sbidoul commented Jan 22, 2023

I this we can close this, as it is covered by pip install --dry-run --report? Let me know if not.

@sbidoul sbidoul closed this Jan 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs rebase or merge PR has conflicts with current master
Projects
None yet
Development

Successfully merging this pull request may close these issues.