Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrency turning out useless on codebase & machine #2525

Open
xmo-odoo opened this issue Sep 27, 2018 · 13 comments
Open

Concurrency turning out useless on codebase & machine #2525

xmo-odoo opened this issue Sep 27, 2018 · 13 comments
Labels
multiprocessing Needs PR This issue is accepted, sufficiently specified and now needs an implementation performance

Comments

@xmo-odoo
Copy link
Contributor

xmo-odoo commented Sep 27, 2018

This is on a codebase with 260kLOC across ~3600 files (python-only, according to tokei), on a 2010 MBP (2 cores 2 HT) running OSX 10.11, under Python 3.6.6 from macports

Using -j with a number of jobs different from 1 significantly increases CPU consumption (~90%/core), but yields no improvement in wallclock time:

> pylint -j1 *
pylint -v -j1 * 1144.10s user 44.51s system 96% cpu 20:36.81 total
> pylint -j2 *
pylint -j2 * 2386.66s user 117.09s system 184% cpu 22:37.15 total
> pylint -j4 *
pylint -j4 * 3897.49s user 161.62s system 340% cpu 19:50.96 total
> pylint -j0 *
pylint -j * 3850.79s user 155.45s system 341% cpu 19:31.81 total

Not sure what other informations to provide.

@PCManticore
Copy link
Contributor

Wow, that's incredible, thanks for reporting an issue. I wonder if the overhead of pickling the results back to the workers is too big at this point, we'll have to switch our approach for the parallel runner if that's the case.

@belm0
Copy link
Contributor

belm0 commented Nov 29, 2018

I've observing this too (OS X, i7 processor). --jobs merely multiplies the CPU time, with negligible effect on wall time.

$ pylint --version
pylint 2.1.1

$ time pylint my_package/
real	1m27.865s
user	1m25.645s
sys	0m1.996s

$ time pylint --jobs 4 my_package/
real	1m17.986s
user	4m14.076s
sys	0m12.917s

@Tenzer
Copy link

Tenzer commented Feb 17, 2020

I found Pylint got a lot faster by removing concurrency (jobs=1) compared to trying to make it as concurrent as it should (jobs=0). The execution time across a number of different project code sizes sped up by 2.5-3 times.

A concrete project has 134k LOC across 1662 Python files and a Pylint run across all the files dropped from 3m 33s to 1m 30s on average on a MBP dual core (with HT). CPU utilisation also dropped to less than half according to CPU time.

I wonder if there's any cases where running Pylint concurrently is helping, or if it would be better to disable the feature for now?

@owillebo
Copy link

Some results on Windows 10 2004, Intel 9850H 6 cores/12 threads 32 bit pylinting matplotlib.
Interesting is that with roughly half the threads we get fastest result.
Results in seconds wall clock duration.

pylint --version

pylint 2.6.0
astroid 2.4.2
Python 3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020, 18:01:55) [MSC v.1900 32 bit (Intel)]

cloc matplotlib

 360 text files.
 352 unique files.
 154 files ignored.

github.com/AlDanial/cloc v 1.86  T=1.09 s (226.7 files/s, 144433.3 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                         221          25052          39919          85792

Running pylint with the default configuration;

pylint -j12 matplotlib  1>NUL
71.7007919
pylint -j11 matplotlib  1>NUL
71.2478186
pylint -j10 matplotlib  1>NUL
69.9589435
pylint -j9 matplotlib  1>NUL
69.8973282
pylint -j8 matplotlib  1>NUL
66.9836301
pylint -j7 matplotlib  1>NUL
67.7956229
pylint -j6 matplotlib  1>NUL
65.0402625
pylint -j5 matplotlib  1>NUL
67.2663403
pylint -j4 matplotlib  1>NUL
73.0464569
pylint -j3 matplotlib  1>NUL
88.9432869
pylint -j2 matplotlib  1>NUL
120.4550162
pylint -j1 matplotlib  1>NUL
238.5004808

@Pierre-Sassoulas
Copy link
Member

@owillebo thank for the data, I think intuitively it makes sense that the optimal is 6 threads on a 6 core machine. Apparently this bug is not affecting you.

@owillebo
Copy link

owillebo commented Sep 27, 2020 via email

@xmo-odoo
Copy link
Contributor Author

@owillebo thank for the data, I think intuitively it makes sense that the optimal is 6 threads on a 6 core machine. Apparently this bug is not affecting you.

Indeed, hyperthreading can lead to better use of the underlying hardware, but if there are no significant stalls (or both hyperthreads are stalled in similar ways) and all the threads are competing for the same underlying units the hyperthreads are just going to sequentially use the same resources.

And the can is so conditional that, given the security issues of their implementation, Intel is actually moving away from HT: the 9th gen only uses HT at the very high end (i9) and very low (Celeron) ends, none of the 9th gen i3, i5 and i7 supports hyperthreading.

@owillebo
Copy link

owillebo commented Sep 28, 2020 via email

@Pierre-Sassoulas Pierre-Sassoulas added the Needs PR This issue is accepted, sufficiently specified and now needs an implementation label Jul 2, 2022
@DanielNoord
Copy link
Collaborator

#6978 (comment) is quite an interesting result.

@xmo-odoo
Copy link
Contributor Author

#6978 (comment) is quite an interesting result.

That is true but I think it's a different issue: in the original post the CPU% does grow pretty linearly with the number of workers, which indicates that the core issue isn't a startup stall (#6978 shows clear CPU usage dips).

@xmo-odoo
Copy link
Contributor Author

Also FWIW I've re-run pylint on the original project, though only on a subset (as I think the old run was on 1.x, and pylint has slowed a fair bit in the meantime, plus the project has grown).

This is on an 4 cores 8 threads Linux machine (not macOS this time), Python 3.8.12, 2.14.5.

The subsection I linted is 71kLOC in 400 files. The results are as follow:

-j0 pylint -j$i * > /dev/null  206.82s user 1.05s system 99% cpu 3:27.90 total
-j1 pylint -j$i * > /dev/null  205.74s user 1.08s system 99% cpu 3:26.85 total
-j2 pylint -j$i * > /dev/null  163.57s user 1.59s system 199% cpu 1:22.77 total
-j3 pylint -j$i * > /dev/null  198.93s user 2.15s system 298% cpu 1:07.29 total
-j4 pylint -j$i * > /dev/null  238.08s user 2.52s system 384% cpu 1:02.55 total
-j5 pylint -j$i * > /dev/null  304.31s user 3.00s system 450% cpu 1:08.26 total
-j6 pylint -j$i * > /dev/null  374.35s user 3.96s system 551% cpu 1:08.61 total
-j7 pylint -j$i * > /dev/null  462.39s user 4.68s system 639% cpu 1:13.04 total
-j8 pylint -j$i * > /dev/null  487.39s user 5.20s system 688% cpu 1:11.56 total
  • pylint does seem to scale to -j2, there's even a minor gain at -j3 (though far from 50%), beyond that it again spins its wheels and burns CPU with no improvement (the opposite really)
  • I thought j0 would be equivalent to j8 but apparently it's j1?
  • I'm not entirely sure why j1 costs so much more than j2 (almost 3x the wallclock time, and 20% higher USER), but it is repeatedly reproducible, I ran each 5 times in a row, and they exhibited those behaviors and wallclocks (roughly) very reliably, in fact 200-ish USER is on the lower end of j1 (it goes as high as 300), while 160-ish USER is about par for j2.

@olivierlefloch
Copy link

On M1 Macs, on large codebases, -j=0 is equivalent to -j=10, and seems (unsurprisingly, given high performance vs efficiency cores) to perform worse than -j=6 ; this makes it difficult to specify a single value in the shared pylintrc config file for a repository shared between developers using a broad variety of machines, and likely makes -j=0 undesirable on recent Apple machines.

@xmo-odoo
Copy link
Contributor Author

@olivierlefloch I don't think that can be solved by pylint (or any other auto-worker program): a while back I tried to see if the stdlib had a way to know real cores (as opposed to vcores / hyper threads) due to the comment preceding yours and didn't find one. I don't remember seeing anything for efficiency/performance either.

I think the best solution would be to run under a bespoke hardware layout (make it so pylint can only see an enumerated set of cores of your choice), but I don't know if macos supports this locally (I don't remember something similar to linux's taskset). There is a program called CPUSetter which allows disabling cores globally, but...

Also it doesn't seem like e.g. multiprocessing.cpu_count() is aware of CPU affinity, however pylint already uses os.sched_getaffinity so it should work properly on linux.

fruch added a commit to fruch/scylla-cluster-tests that referenced this issue Feb 9, 2023
Seems since we move to new builder setup (weaker machines, 4 executors)
CI jobs were stuck on pre-commit pylint stage taking 100%
and timing out the 15min we gave for precommit stage

seem like usin `-j 1` remove stop of the load on the CPU
and even finishes faster

Ref: pylint-dev/pylint#2525
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multiprocessing Needs PR This issue is accepted, sufficiently specified and now needs an implementation performance
Projects
None yet
Development

No branches or pull requests

8 participants