-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checker plugin parallelization breaks algorithm #7263
Comments
My bad. I think I see my mistake now. The checks should be done in the So that brings me to my next question: I have now moved the algorithm which does the checks into a separate method called Is this the correct way of doing that? |
Thank you for opening the issue. As far as I know you're navigating in uncharted territory here. Some pylint checkers should use this but are not right now or have issues (see the multiprocessing label) ! So I welcome you're insight on a more intuitive design. It seems that the close function should handle the reduce map generically if I understand right ? The base class could be extended if required. (Pinging @doublethefish as they designed and implemented this). |
The map/reduce mixin ( The multi-jobs aspect of pylint itself isn't really fit for purpose, mainly because it doesn't make runs faster (!), IIRC because astroid computes a full ast-graph, on each thread, before doing the work, so the actual compute-work by the checkers get marginal gain in a multithreaded runtime (with caveats). Also, the whole parallelized code path is so different from the single-threaded code path we get differences in reporting on stats. As such, I strongly do not recommend anyone uses One of the problems with the map/reduce functionality is that it also relies on the checker to collate its stats into the calling-linter rather than rely on the normal apis for doing so. The map/reduce functionality is designed to work with the current parallelized system, that is:
hth |
That's very insightful, thank you. Somehow I did not realize we had so much to do before this comment, even as I realized the file filtering was not done efficiently a few days ago in #7220. So thanks, and to sum-up before having "real" multiprocessing we need to:
I agree with @doublethefish that we should fix at least the two first points before trying anything else (third point would happen in astroid, and it's not going to be a small task.) and that multiprocessing is unusable or at least not very useful until then. |
My view remains that That is, the for loops in those functions should make adding a basic multiprocess divide-and-conquer pretty painless, as long as the stats and data generated, can be gathered at the end. The major benefit of this is that you have one way to do all the work, instead of two that drift - and therefore have a more stable product. |
I feel like this should be emphasized a lot more. Pylint has had the Speaking of documentation, the part where it mentions that
I'm able to run pylint with |
Maybe. Although I'm not sure the work has been done to demonstrate/understand the ins and outs here to document it properly. For example, I am not sure when the performance of That is why I've mentioned before that Personally, I stopped using |
@doublethefish Your previous comments triggered some thinking about this issue for me. Can we split the ast creation and ast checking in separate functions and only serialise the second part in pylint? The biggest performance benefits would definitely be in parallelising ast creation, but that should be handled by astroid instead of the current (dysfunctional) pylint approach. This would greatly reduce the performance benefit the jobs options can currently provide but does allow fixing all issues related to it in the package that we actually control in this repository. We should then create an astroid issue for an "parallelised ast creation API" |
I imagine there would be several stages of attack, in order:
1 has the fastest route-to-success. 2-4 could be transparent to pyint itself. If we're not careful the gains here would be minimal, especially in CiCd situations with a cold-cache. 1 and 2 could be worked on asynchronously. Without wanting to sound too grandiose, with all four, given the widespread use of pylint, there would be a significant impact on energy used by global python CiCd processes. |
I should also add that during my attempts to make multiprocessing work, any attempts to return an astroid node in So if you plan on sharing the ast with various process, it has to be in a form that makes sharing possible. This is what I had to do |
Agree with that, we'd remove a whole class of errors until pylint is actually multi-process capable and this is the current situation anyway (#2525). Maybe we add a user warning to make it obvious.
This cannot be overstated, making pylint more efficient is going to yield huge power savings worldwide. Also we have a lot more documented tooling around performance now than when you started working on this Franck see https://pylint.pycqa.org/en/latest/development_guide/contributor_guide/profiling.html
We have a dependency to |
The dependency on I agree that the waiting has been quite long, but if we really need to I can create a fork and make a release with the fix included. Due to the summer period I haven't really had the time to work on
I would prefer if we could focus on 1. All other three options should be handled by |
I have done some digging into how this is currently set up and compared it against Since we are reusing the I thought about doing some refactoring to mimic this, but I don't think that is really feasible. Even if we created a |
Thanks for sharing the fascinating differences between
Yes, I noticed. I don't understand quite why that is yet, but I would love to (?).
Agreed. The map-reduce design I put in place is an optional mixin that gives us the "self-contained iterable of data to which you can apply a function". So that's a partial solve, on part of the data. IIRC the mixin is at the Checker level and the missing piece is handling the messages and the stats at the PyLinter level.
The remaining problems stem from this I feel. In previous posts I've intimated that this might be an API problem in pylint, but that can also be viewed as a data-access problem (as you say), same thing, different perspective. There're several ways to approach this:
|
See below.
Yeah, the only issue is that the current design makes the Checker level not really sufficient for splitting the work into processes 😅
Yeah, I think this is ultimately an API problem.
I have no experience here, but I am indeed not sure how feasible this is.
This would probably be the best solution, but I don't think we can do it in a backwards compatible way.
At least the first 5 need to be accessed by the checkers as well. This is why |
Going back a few steps, this is why I asked if perhaps splitting the work into processes after the linter has been setup and configured (in check_files) might be better? instead of splitting the work before the linter has been constructed (that is IIRC and please forgive my poor memory here, I don't have time to open the code right now). It all sounds quite complex when put in your list. I remember that Pierrre did a pretty good job of splitting the code into core responsibilities (because I had to resolve a nasty merge conflict in the map-reduce stuff 😭) and it seems some more of that would be helpful. It also seems like there are several major areas of responsibility, at the linter level: start-up, run, report - but they're a bit intermingled at the moment. To map it to something completely different, game engines: the Unreal Engine divides its work into several discrete areas; game-logic, render-logic, shader-logic, physics & animation - it works so well because it divides and conquers those concerns, with neat channels of work hand-off. Unity is more of a component/task system and sounds more akin to what
We already have two systems. Making that explicit would help clarify the system and likely reduce bugs. It is likely that there is a nice and tidy migration path to the next Major version. |
Yeah, that would be best. However, they way checkers are currently designed we need access to the Ideally, a checker would have a
I have actually been working on this as well by creating a lot more base classes for What Furthermore, consider a checker that emits Lastly, we even support changing |
I agree with @doublethefish that we need to make PyLinter more modular. I think we can keep the interface of PyLinter by simply calling a more modular class we're now using. Inside pylint we then use the underlying modular class that is lighter and faster to construct. For example, a first actionable step would be to create a |
I think the separation of |
But can't we use the modular classes in threads instead of the PyLinter, and then keep the public API like it is right now in PyLinter by using composition and calling the modular class directly ? I.e. class PyLinter:
def __init__(main_checker: MainChecker):
self._main_checker = main_checker
def add_message(...):
warnings.warn("in pylint 3.0...")
return self._main_checker.add_message(...) |
That wouldn't have any effect on the size of the object to be pickled. In fact, due to differences in pickling of attributes instead of parent classes. As long as there is a reference to some instantiated class on |
But isn't the point of this also to make it possible for the modular classes to work in isolation for multi threading ? (i.e. the size of the pickled data isn't the only concern) |
It's not so much about the size of the data as about the amount of shared memory, which I think will be an issue for |
The way I understand it what needs to be done is:
What do you think ? |
I think In terms of refactoring I'm not sure. I have tried numerous things today and none of them seemed to work. I'm certainly no expert on |
Question
I am building a checker plugin which is supposed to use data from all the files checked to compose a report. However, I am seeing that whenever I run pylint, my plugin is ran multiple times, and each time it is using incomplete data to do its tests.
My plugin is already making use of
get_map_data
andreduce_map_data
to merge data between processes. Now I want to know how to avoid doing the actual checks (which are done in theclose
method), until the last minute.How do I save all data between plugin runs until the last one so that only the last thread/last run for the checker actually does the work required?
Is this even possible? Is there a way to tell pylint that a checker should not be run in parallel? If not, what do you advice?
Documentation for future user
https://pylint.pycqa.org/en/latest/development_guide/how_tos/plugins.html
Additional context
No response
The text was updated successfully, but these errors were encountered: