Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"files" is wrong when installed without "wheel" #115

Closed
jaraco opened this issue Oct 22, 2020 · 10 comments · Fixed by #437
Closed

"files" is wrong when installed without "wheel" #115

jaraco opened this issue Oct 22, 2020 · 10 comments · Fixed by #437
Labels
enhancement New feature or request

Comments

@jaraco
Copy link
Member

jaraco commented Oct 22, 2020

In GitLab by @blueyed on Mar 17, 2020, 12:32

When "wheel" is not installed/used, installing a package will not contain "RECORD", and it falls back to reading "SOURCES.txt" (https://gitlab.com/python-devs/importlib_metadata/blob/3150ed4da9e1267d0787c6f4c1f8258a26a1dd93/importlib_metadata/__init__.py#L270), which then might result in paths not being locate()able, when a src-based setup is used:

setup(
    ...
    packages=find_packages('src'),
    package_dir={'': 'src'},
)

This is similar to https://gitlab.com/python-devs/importlib_metadata/-/issues/112, but in this case here it could use installed-files.txt from the egg-info.

@jaraco
Copy link
Member Author

jaraco commented Oct 22, 2020

In GitLab by @blueyed on Mar 17, 2020, 12:34

mentioned in commit blueyed/importlib_metadata@5d73789b0619d5ac53dfb2966902c13b24d31e2b

@jaraco
Copy link
Member Author

jaraco commented Oct 22, 2020

In GitLab by @blueyed on Mar 17, 2020, 12:34

mentioned in merge request !114

@jaraco
Copy link
Member Author

jaraco commented Oct 22, 2020

In GitLab by @jaraco on Mar 25, 2020, 15:55

Before adding support for installed-files.txt, I'd like to know what command generates that file. It looks like it's generated in the legacy install.

I'm leaning toward saying that importlib_metadata shouldn't support this use-case, that if someone wants to have metadata support, they should use the non-legacy features of the packaging ecosystem.

What is the impact of this issue? Are there real-world use-cases that experience issues as a result of this non-support?

@jaraco
Copy link
Member Author

jaraco commented Oct 22, 2020

In GitLab by @jaraco on Jun 5, 2020, 15:37

Bump.

@jaraco
Copy link
Member Author

jaraco commented Oct 22, 2020

In GitLab by @jaraco on Sep 23, 2020, 13:41

Happy to revisit when there's additional information available, but for now, I'm declaring "won't fix" without prejudice.

@jherland
Copy link
Contributor

jherland commented Mar 9, 2023

I've recently run into this issue with the Qiskit (meta)-package:

  • The package does not actually contain any (runtime) Python code itself, but serves as a mechanism to install its transitive dependencies (which populate the qiskit package namespace).
  • The package is distributed as a source archive.
  • When installed as an egg, it provides a SOURCES.txt which is incorrect from a runtime POV: it references 3 .py files, a setup.py and two files under test/, none of which are actually installed.
  • The packages includes a top_level.txt which is empty (contains a single newline). This is arguably correct given that it does not directly install any importable packages/modules.

IMHO, there are two issues with how importlib_metadata handles this package:

  1. Because SOURCES.txt does not reflect the installed files, importlib_metadata.files("qiskit") ends up returning paths that don't actually exist on disk. According to setuptools docs, SOURCES.txt should not be relied upon at runtime (although I understand that it is likely used as a last resort in this case).
  2. packages_distributions() returns nonsense for this package: _top_level_declared() does not differentiate between an empty top_level.txt and a missing top_level.txt, which triggers a fallback to _top_level_inferred(). This fallback ultimately ends up looking for *.py paths in SOURCES.txt, which finds the three non-existing paths mentioned above. The end result is that packages_distributions() associates the import names setup and test with this package, neither of which exist, or make any sense.

Some possible ways to fix this:

  • For (1), try installed-files.txt before falling back to SOURCES.txt. This file is written by pip on installation, and is a more accurate source than SOURCES.txt. (If installed-files.txt does not exist for any reason, then fall back to SOURCES.txt as last resort.)
  • For (1), if/when we need to use SOURCES.txt, then at least make sure that only the paths in SOURCES.txt that actually remain after installation are returned.
  • For (2), consider whether it's worthwhile to treat an empty top_level.txt different from a missing top_level.txt. In the above case the empty top_level.txt correctly communicates that this packages does not expose any importable names. Having said that, there might be good reasons for the current behavior.

@jaraco
Copy link
Member Author

jaraco commented Mar 9, 2023

When installed as an egg,

I believe the issue may be rooted in how qiskit is installed.

The preferred way to install Python packages is as a wheel using dist-info metadata. As eggs are deprecated and superseded by wheels, support for egg-info is incomplete and best-effort.

Consider this example where I've installed qiskit from the source archive as a wheel and am able to avoid the undesirable behavior:

 ~ $ pip-run --no-deps --no-binary qiskit --no-cache-dir --use-feature no-binary-enable-wheel-cache --use-pep517 qiskit
Python 3.11.1 (main, Dec 23 2022, 09:28:24) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import importlib.metadata
>>> 'test' in importlib.metadata.packages_distributions()
False
>>> importlib.metadata.files('qiskit')
[PackagePath('qiskit-0.41.1.dist-info/AUTHORS'), PackagePath('qiskit-0.41.1.dist-info/INSTALLER'), PackagePath('qiskit-0.41.1.dist-info/LICENSE.txt'), PackagePath('qiskit-0.41.1.dist-info/METADATA'), PackagePath('qiskit-0.41.1.dist-info/RECORD'), PackagePath('qiskit-0.41.1.dist-info/REQUESTED'), PackagePath('qiskit-0.41.1.dist-info/WHEEL'), PackagePath('qiskit-0.41.1.dist-info/top_level.txt')]

The key here was to pass --use-pep517, which caused the source archive to build as a wheel and produce dist-info metadata. Other parameters to pip install (via pip run) are there to force the source dist install and bypass other warnings.

This approach utilizes the preferred code path for metadata and avoids the undesirable behavior.

Honestly, if it weren't for the fact that editable installs still rely heavily on egg-info metadata, I'd deprecate that format outright.

When I wrote files() to resolve from SOURCES.txt, I did that knowing it was an imperfect approximation for editable installs. I was unaware that installed-files was also a thing :(.

I'm not confident that installed-files.txt is properly supported either. I see now there was some discussion some years ago in pypa/setuptools#371 about installed-files.txt. Ideally, I'd like to rally around supported standards and not add support for inferred interfaces.

Does the --use-pep517 method of installation provide a suitable approach for your use-case?

@jherland
Copy link
Contributor

jherland commented Mar 9, 2023

Yes, installing it with dist-info metadata (either using --use-pep517 or by installing with modern pip in a venv that already has wheel available) does solve the issue (*.dist-info/RECORD is obviously more accurate than *.egg-info/SOURCES.txt).

I 100% agree on using modern packaging practices. However, in this case, I'm not in the position to control the package installation process: I'm working on a tool to find undeclared and unused dependencies in the user's project, and as such I'm using importlib_metadata inside the user's existing venv, and I do not control which packages are installed, or how they were installed.

installed-files.txt seems to have been written by pip since version 0.3 (Jan 2009), and even if it has not been adopted by the rest of the packaging ecosystem, I think the current prevalence of pip makes installed-files.txt useful as a better source than SOURCES.txt when it exists. When it does not exist (e.g. it seems to not be created in the case of editable installs), falling back to SOURCES.txt is still available as a final - albeit imperfect - resort).

Do you foresee problems with any of the three fixes I outlined above? In any case, I'm happy to whip up a PR.

@jaraco
Copy link
Member Author

jaraco commented Mar 9, 2023

All good points. Yes, I think I agree with the recommendation, and I'm delighted to hear you'd like to work on an implementation. Crucial will be to have tests that capture the relevant factors and expectations (including which scenarios apply to which logic branches). I'm slightly tempted to deprecate the use of SOURCES.txt altogether, but that probably should happen separately. Feel free to take a stab at it and let me know if you have any questions. I can be reached in Gitter (except I haven't enabled this project) and Discord.

@FFY00
Copy link
Member

FFY00 commented Mar 10, 2023

I recently touched this code and also think looking at installed-files.txt when available is a good approach 👍

jherland added a commit to jherland/importlib_metadata that referenced this issue Mar 11, 2023
This corresponds to the qiskit[1] meta-package which:

  - does not contain any (runtime) Python code itself, but serves as a
    mechanism to install its transitive dependencies (which populate the
    qiskit package namespace).

  - is distributed as a source archive.

  - includes a top_level.txt which is empty (contains a single newline),
    arguably correct given that it does not directly install any
    importable packages/modules.

  - when installed as an egg, provides a SOURCES.txt which is incorrect
    from a runtime POV: it references 3 .py files, a setup.py and two
    files under test/, none of which are actually installed.

  - when installed (as an egg) by pip, provides an installed-files.txt
    file which is _more_ accurate than SOURCES.txt, since it reflects
    the files that are actually available after installation.

importlib_metadata reports incorrect .files for this package, because
we end up using SOURCES.txt. It is better to use installed-files.txt
when it is available.

Furthermore, as a result of this, packages_distributions() also
incorrectly reports that this packages provides imports names that do
not actually exist ("setup" and "test", in qiskit's case).

This commit adds EggInfoPkgPipInstalledNoModules, a test project that
mimics the egg installation of qiskit, and adds it to existing test
cases, as well as adding a new test cases specifically for verifying
packages_distributions() with egg-info packages. The following tests
fail in this commit, but will be fixed in the next commit:

  - PackagesDistributionsTest.test_packages_distributions_on_eggs
  - APITests.test_files_egg_info

See the python#115 issue for more details.

[1]: qiskit is found at https://pypi.org/project/qiskit/0.41.1/#files
jherland added a commit to jherland/importlib_metadata that referenced this issue Mar 11, 2023
When listing the files in a *.egg-info distribution, prefer using
*.egg-info/installed-files.txt instead of *.egg-info/SOURCES.txt.

installed-files.txt is written by pip[1] when installing a package,
whereas the SOURCES.txt is written by setuptools when creating a
source archive[2].

installed-files.txt is only present when the package has been installed
by pip, so we cannot depend on it always being available. However, when
it _is_ available, it is an accurate record of what files are installed.
SOURCES.txt, on the other hand, is always avaiable, but is not always
accurate: Since it is generated from the source archive, it will often
include files (like 'setup.py') that are no longer available after the
package has been installed.

Fixes python#115 for the cases where a installed-files.txt file is available.

[1]: https://pip.pypa.io/en/stable/news/#v0-3
[2]: https://setuptools.pypa.io/en/latest/deprecated/python_eggs.html#sources-txt-source-files-manifest
jherland added a commit to jherland/importlib_metadata that referenced this issue Mar 11, 2023
As established in previous commits, the SOURCES.txt file is not always
an accurate source of files that are present after a package has been
installed.

One situation where this inaccuracy is problematic is when top_level.txt
is also missing, and packages_distributions() is forced to infer the
provided import names based on Distribution.files. In this situation we
end up with incorrect mappings between import packages and distribution
packages, including import packages that clearly do not exist at all.

For example, a SOURCES.txt that lists setup.py (which is used _when_
installing, but is not available after installation), will see that
setup.py returned from .files, which then will cause
packages_distributions() to claim a mapping from the non-existent
'setup' import name to this distribution.

This commit adds EggInfoPkgSourcesFallback which demostrates such a
scenario, and adds this new class to a couple of relevant tests.
A couple of these tests are currently failing, to demonstrate the
issue at hand. These test failures will be fixed in the next commit.

See the python#115 issue for more details.
jherland added a commit to jherland/importlib_metadata that referenced this issue Mar 11, 2023
Add an extra filter on the paths returned from Distribution.files, to
prevent paths that don't exist on the filesystem from being returned.

This attempts to solve the issue of .files returning incorrect
information based on the inaccuracies of SOURCES.txt.

As the code currently is organized, it is more complicated to write
this such that it only applies to the information read from SOURCES.txt
specifically, hence we apply it to _all_ of .files instead.

This fixes python#115, also in the case where there is no installed-files.txt
file available.

[1]: https://pip.pypa.io/en/stable/news/#v0-3
[2]: https://setuptools.pypa.io/en/latest/deprecated/python_eggs.html#sources-txt-source-files-manifest
jherland added a commit to jherland/importlib_metadata that referenced this issue Mar 11, 2023
This corresponds to the qiskit[1] meta-package which:

  - does not contain any (runtime) Python code itself, but serves as a
    mechanism to install its transitive dependencies (which populate the
    qiskit package namespace).

  - is distributed as a source archive.

  - includes a top_level.txt which is empty (contains a single newline),
    arguably correct given that it does not directly install any
    importable packages/modules.

  - when installed as an egg, provides a SOURCES.txt which is incorrect
    from a runtime POV: it references 3 .py files, a setup.py and two
    files under test/, none of which are actually installed.

  - when installed (as an egg) by pip, provides an installed-files.txt
    file which is _more_ accurate than SOURCES.txt, since it reflects
    the files that are actually available after installation.

importlib_metadata reports incorrect .files for this package, because
we end up using SOURCES.txt. It is better to use installed-files.txt
when it is available.

Furthermore, as a result of this, packages_distributions() also
incorrectly reports that this packages provides imports names that do
not actually exist ("setup" and "test", in qiskit's case).

This commit adds EggInfoPkgPipInstalledNoModules, a test project that
mimics the egg installation of qiskit, and adds it to existing test
cases, as well as adding a new test cases specifically for verifying
packages_distributions() with egg-info packages. The following tests
fail in this commit, but will be fixed in the next commit:

  - PackagesDistributionsTest.test_packages_distributions_on_eggs
  - APITests.test_files_egg_info

See the python#115 issue for more details.

[1]: qiskit is found at https://pypi.org/project/qiskit/0.41.1/#files
jherland added a commit to jherland/importlib_metadata that referenced this issue Mar 11, 2023
When listing the files in a *.egg-info distribution, prefer using
*.egg-info/installed-files.txt instead of *.egg-info/SOURCES.txt.

installed-files.txt is written by pip[1] when installing a package,
whereas the SOURCES.txt is written by setuptools when creating a
source archive[2].

installed-files.txt is only present when the package has been installed
by pip, so we cannot depend on it always being available. However, when
it _is_ available, it is an accurate record of what files are installed.
SOURCES.txt, on the other hand, is always avaiable, but is not always
accurate: Since it is generated from the source archive, it will often
include files (like 'setup.py') that are no longer available after the
package has been installed.

Fixes python#115 for the cases where a installed-files.txt file is available.

[1]: https://pip.pypa.io/en/stable/news/#v0-3
[2]: https://setuptools.pypa.io/en/latest/deprecated/python_eggs.html#sources-txt-source-files-manifest
jherland added a commit to jherland/importlib_metadata that referenced this issue Mar 11, 2023
As established in previous commits, the SOURCES.txt file is not always
an accurate source of files that are present after a package has been
installed.

One situation where this inaccuracy is problematic is when top_level.txt
is also missing, and packages_distributions() is forced to infer the
provided import names based on Distribution.files. In this situation we
end up with incorrect mappings between import packages and distribution
packages, including import packages that clearly do not exist at all.

For example, a SOURCES.txt that lists setup.py (which is used _when_
installing, but is not available after installation), will see that
setup.py returned from .files, which then will cause
packages_distributions() to claim a mapping from the non-existent
'setup' import name to this distribution.

This commit adds EggInfoPkgSourcesFallback which demostrates such a
scenario, and adds this new class to a couple of relevant tests.
A couple of these tests are currently failing, to demonstrate the
issue at hand. These test failures will be fixed in the next commit.

See the python#115 issue for more details.
jherland added a commit to jherland/importlib_metadata that referenced this issue Mar 11, 2023
Add an extra filter on the paths returned from Distribution.files, to
prevent paths that don't exist on the filesystem from being returned.

This attempts to solve the issue of .files returning incorrect
information based on the inaccuracies of SOURCES.txt.

As the code currently is organized, it is more complicated to write
this such that it only applies to the information read from SOURCES.txt
specifically, hence we apply it to _all_ of .files instead.

This fixes python#115, also in the case where there is no installed-files.txt
file available.

[1]: https://pip.pypa.io/en/stable/news/#v0-3
[2]: https://setuptools.pypa.io/en/latest/deprecated/python_eggs.html#sources-txt-source-files-manifest
@jaraco jaraco added enhancement New feature or request and removed wont-fix labels Apr 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants