Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document that parallelizing pylint by giving it only a few files at once will create incorrect type inference leading to a worse experience #9341

Closed
apmorton opened this issue Jan 2, 2024 · 13 comments · Fixed by #9463
Assignees
Labels
Documentation 📗 Needs PR This issue is accepted, sufficiently specified and now needs an implementation Usability Issues related to the usability/UX of pylint

Comments

@apmorton
Copy link

apmorton commented Jan 2, 2024

Bug description

import pandas as pd


def function1():
    """
    some function declared anywhere else in the codebase.
    in another file and never called in my case.
    """
    df = pd.DataFrame()
    df.columns = [1, 2, 3]


def function2():
    """
    some function which sometimes operates on
    a default constructed DataFrame object
    """
    df = pd.DataFrame()
    # pylint incorrectly infers the type of `k` to be int
    df.rename(columns={k: k.lower() for k in df.columns})

EDIT: a simpler repro without pandas

class SomeType:
    def __init__(self):
        self.columns: list[str] = []

def function1():
    """
    some function declared anywhere else in the codebase.
    in another file and never called in my case.
    """
    df = SomeType()
    df.columns = [1, 2, 3]


def function2():
    df = SomeType()
    # pylint incorrectly infers the type of `k` to be int
    return {k: k.lower() for k in df.columns}

Configuration

No response

Command used

pylint repro.py

Pylint output

************* Module repro
repro.py:20:26: E1101: Instance of 'int' has no 'lower' member (no-member)

Expected behavior

No diagnostic

Pylint version

pylint 3.0.3
astroid 3.0.2
Python 3.12.1 | packaged by conda-forge | (main, Dec 23 2023, 08:03:24) [GCC 12.3.0]

OS / Environment

No response

Additional dependencies

pandas 2.1.4 or 1.5.3
@apmorton apmorton added the Needs triage 📥 Just created, needs acknowledgment, triage, and proper labelling label Jan 2, 2024
@Pierre-Sassoulas Pierre-Sassoulas added inference Needs decision 🔒 Needs a decision before implemention or rejection and removed Needs triage 📥 Just created, needs acknowledgment, triage, and proper labelling labels Jan 4, 2024
@Pierre-Sassoulas
Copy link
Member

Pierre-Sassoulas commented Jan 4, 2024

Thank you for opening the issue. I'm hesitating to label this a false positive, because it seems to me that pylint being able to detect that sometime columns can be list[int] (and looking everywhere it can to find that info) bring a lot of value (especially in half typed / untyped code base).

@apmorton
Copy link
Author

apmorton commented Jan 4, 2024

I'm hesitating to label this a false positive

Ack that the behavior may be useful in some contexts, but I would definitely say this is a false positive.
It is not possible for int to be observed in function2, and so the warning pylint emits is not helpful.

Additionally, this causes "spooky action at a distance" because it is not deterministic.

If function1 and function2 are in different files you will get different diagnostics when running:
pylint file1.py; pylint file2.py and pylint file1.py file2.py.

This comes up all the time in large projects which run pylint on subsets of the project in parallel (for example, using pre-commit).

An almost daily interaction I am having right now at work is with someone less knowledgeable about pylint/pre-commit/etc asking why seemingly unrelated changes cause this (or similar) error to appear and fail their build.

@jacobtylerwalls
Copy link
Member

jacobtylerwalls commented Feb 21, 2024

Thanks for the report. I'm hearing two requests. One is to distinguish:

def function2():
    df = SomeType()
    return {k: k.lower() for k in df.columns}

from this, which is an error:

def function2():
    df = function1()  # assume function1 returns `df`
    return {k: k.lower() for k in df.columns}

That's not really on pylint's roadmap, which I know might be frustrating, but pylint/astroid's inference capabilities advertise 'all the values your variables might take' -- in other words, we've inherited a system which collects all the possible values for SomeType.columns and doesn't know what functions you've called on them by the time we lint one of the instances you have in hand.

The other request I'm hearing is to make this more deterministic. That's totally reasonable, but I'm having trouble reproducing.

% cat a.py
class SomeType:
    def __init__(self):
        self.columns: list[str] = []

def function2():
    df = SomeType()
    return {k: k.lower() for k in df.columns}

% cat b.py
from .a import SomeType

def function1():
    de = SomeType()
    de.columns = [1, 2, 3]

% pylint a.py b.py
************* Module pylint.a
a.py:7:15: E1101: Instance of 'int' has no 'lower' member (no-member)

------------------------------------------------------------------
Your code has been rated at 5.00/10 (previous run: 5.00/10, +0.00)

@jacobtylerwalls jacobtylerwalls added Needs reproduction 🔍 Need a way to reproduce it locally on a maintainer's machine and removed Needs decision 🔒 Needs a decision before implemention or rejection labels Feb 21, 2024
@apmorton
Copy link
Author

$ pylint a.py
$ pylint b.py
$ pylint a.py b.py
************* Module a
a.py:7:15: E1101: Instance of 'int' has no 'lower' member (no-member)

The issue is that when linted in isolation these files produce no diagnostic, but when linted together they do.

pylint 3.0.3
astroid 3.0.3
Python 3.11.6 | packaged by conda-forge | (main, Oct  3 2023, 10:40:35) [GCC 12.3.0]

@jacobtylerwalls
Copy link
Member

But my example shows that I tried that, no? I ran pylint a.py b.py and got a no-member.

@jacobtylerwalls
Copy link
Member

Are you defining the class SomeType anew in each file? That could explain the difference.

@apmorton
Copy link
Author

Are you defining the class SomeType anew in each file? That could explain the difference.

The contents of a.py and b.py in my above post are identical to what you showed.

But my example shows that I tried that, no? I ran pylint a.py b.py and got a no-member.

Your example shows only that you ran pylint a.py b.py and got a diagnostic.

The problem I'm trying to convey is that you cannot reproduce the diagnostic when linting either file separately.

That is what makes this inconsistent and causes us problems in our environment.

If you have a large project that uses pre-commit you will get spooky action at a distance where simply adding or removing a completely unrelated file will change diagnostic output because different files are passed to a single invocation of pylint.

I guess the root question is whether or not you consider the diagnostic emitted by the following code to be a false positive or not:

class SomeType:
    def __init__(self):
        self.columns: list[str] = []

def function1():
    """
    some function declared anywhere else in the codebase.
    in another file and never called in my case.
    """
    df = SomeType()
    df.columns = [1, 2, 3]


def function2():
    df = SomeType()
    # pylint incorrectly infers the type of `k` to be int
    return {k: k.lower() for k in df.columns}

I do, because the code in function1 is not modifying anything about the SomeType type, it is changing an instance property - it can never be observed in function2.

I understand fixing this may not be on the roadmap, or may not be feasible given the design of astroid, but can we all agree that it is a false positive?

@jacobtylerwalls
Copy link
Member

jacobtylerwalls commented Feb 21, 2024

Got it, I think I ran one of the pylint a.py runs with the wrong contents and got mixed up. Thanks for bearing with 🙏

I understand fixing this may not be on the roadmap, or may not be feasible given the design of astroid, but can we all agree that it is a false positive?

It would be a false positive for a linter that's control-flow aware. What if SomeType.__init__() calls function1() (edited to return something) and assigns the result to .columns? Or does that conditionally? Then it's no longer a false positive and pylint got it right. I know I'm changing your hypothetical, so whether we call it a false positive or not, I'm just trying to illuminate the attitude behind the original design we inherited. I agree with you that in the example you supplied here, the message is not helpful.


Given that, I think you raise a really good usability point about pre-commit. I don't think we've documented anywhere that if you just run pre-commit on your git diff, you're going to get fewer pylint messages for those specific files than you'd get if you linted your whole project. Linting the whole project is impractical for pre-commit, so maybe we need some sort of "confidence" level that can allow users to silence diagnostics like the one you've illustrated here.

I'm open to that, and would be eager to see someone explore the feasibility. If you don't mind, I'd like to retitle the issue to focus this for a contributor who might want to take this up. Thanks again for the issue. 👍

@jacobtylerwalls jacobtylerwalls changed the title Incorrect type inference leading to false positive Create a pre-commit confidence level Feb 21, 2024
@jacobtylerwalls jacobtylerwalls added Documentation 📗 Usability Issues related to the usability/UX of pylint Needs design proposal 🔒 This is a huge feature, some discussion should happen before a PR is proposed and removed Needs reproduction 🔍 Need a way to reproduce it locally on a maintainer's machine inference labels Feb 21, 2024
@apmorton
Copy link
Author

FWIW, I wasn't even thinking about pre-commit for the git diff subset, but as a CI step run against all files.

pre-commit takes the full list of files and "deterministically" shuffles them before chunking them into command line length limit subsets and running pylint on each subset.

If you add or remove any file from the list, the deterministic shuffling changes, and the group of files passed to pylint changes.

That is what causes the most frustration for us.

@Pierre-Sassoulas
Copy link
Member

Pierre-Sassoulas commented Feb 22, 2024

I feel like not being able to specify that you don't want to parallelize is an issue with pre-commit and not pylint.

That being said, I think more and more that pylint should not be used in pre-commit:

  • pylint require an installed environment so it can't be used with pre-commit.ci (https://github.com/pylint-dev/pylint/blob/main/.pre-commit-config.yaml#L2), so you need a CI job for it
  • pylint use multiple files analysis so pre-commit parallelization make it worse and less deterministic
  • above all, pylint is slow and will always be (duplicate-code is intrinsically a costly thing to check) and committing should be fast

I'm considering adding a disclaimer in the doc to "use pylint in a continous integration job or periodically and do not use pylint during pre-commit or inside your IDE on save, pylint is not meant to do that and will only frustrate you". And adding caveat to the pre-commit doc for those that still want to use pylint in pre-commit.

@jacobtylerwalls
Copy link
Member

pylint use multiple files analysis so pre-commit parallelization make it worse and less deterministic

Unless we're ready to say that running pylint with a list of files is a second-class experience--which it sort of is, but should we bite that bullet?--I had in mind something relatively simple.

When I looked at the root cause yesterday, it was that the information from both files is collated together in ClassDef.instance_attrs. When building the ast's, we could collect slightly more information (what file created that instance attr?) and then in pylint, when going through all the filters to prevent false positives for no-member, we could, assuming the user is running with higher than pre-commit confidence, or with a flag --file-subset-mode or whatever, decide to be quiet when the heterogenous instance_attrs are coming from two different files. Something like that.

@Pierre-Sassoulas
Copy link
Member

I'm a little afraid about the maintenance implication of that pre-commit confidence. (I don't think I understand what it entices and fear that we'll have to think about what happens in pre-commit for all messages, which is scary).

Unless we're ready to say that running pylint with a list of files is a second-class experience--which it sort of is, but should we bite that bullet?--I had in mind something relatively simple.

But pylint analyse all the calls in the code, if you don't analyses the files with a function call in it, you won't have all the possible values that can enter a function. It's kinda expected. I don't think it's a wrong assumption to make or keep. Only we need to warn user that parallelizing pylint deterministically is not easy and they shouldn't try to do it themselves. Also, I searched a little and it seems that pre-commit has a require_serial option that could be be set to true.

@jacobtylerwalls
Copy link
Member

That's fair. Probably not worth putting on the roadmap.

We should probably document that linting project files is a poorer experience than linting a project.

@Pierre-Sassoulas Pierre-Sassoulas added Needs PR This issue is accepted, sufficiently specified and now needs an implementation and removed Needs design proposal 🔒 This is a huge feature, some discussion should happen before a PR is proposed labels Feb 23, 2024
@Pierre-Sassoulas Pierre-Sassoulas changed the title Create a pre-commit confidence level Document that parallelizing pylint by giving it only a few file at once will create incorrect type inference leading to false positive and a worse experience Feb 23, 2024
@Pierre-Sassoulas Pierre-Sassoulas changed the title Document that parallelizing pylint by giving it only a few file at once will create incorrect type inference leading to false positive and a worse experience Document that parallelizing pylint by giving it only a few files at once will create incorrect type inference leading to false positive and a worse experience Feb 23, 2024
@Pierre-Sassoulas Pierre-Sassoulas changed the title Document that parallelizing pylint by giving it only a few files at once will create incorrect type inference leading to false positive and a worse experience Document that parallelizing pylint by giving it only a few files at once will create incorrect type inference leading to a worse experience Feb 23, 2024
@Pierre-Sassoulas Pierre-Sassoulas self-assigned this Feb 23, 2024
Pierre-Sassoulas added a commit to Pierre-Sassoulas/pylint that referenced this issue Feb 25, 2024
Pierre-Sassoulas added a commit to Pierre-Sassoulas/pylint that referenced this issue Feb 25, 2024
Pierre-Sassoulas added a commit to Pierre-Sassoulas/pylint that referenced this issue May 18, 2024
Pierre-Sassoulas added a commit to Pierre-Sassoulas/pylint that referenced this issue May 19, 2024
Pierre-Sassoulas added a commit to Pierre-Sassoulas/pylint that referenced this issue May 19, 2024
…tiple processes

Refs microsoft/vscode-pylint#454
Also add known caveats for custom parallization.

Closes pylint-dev#9341

Co-authored-by: Jacob Walls <[email protected]>
Pierre-Sassoulas added a commit that referenced this issue May 19, 2024
…tiple processes

Refs microsoft/vscode-pylint#454
Also add known caveats for custom parallization.

Closes #9341

Co-authored-by: Jacob Walls <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation 📗 Needs PR This issue is accepted, sufficiently specified and now needs an implementation Usability Issues related to the usability/UX of pylint
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants