Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"All-threads allocating garbage" Multithreading Benchmark shows significant slowdown as nthreads increases #33033

Closed
NHDaly opened this issue Aug 22, 2019 · 26 comments
Labels
multithreading Base.Threads and related functionality performance Must go faster

Comments

@NHDaly
Copy link
Member

NHDaly commented Aug 22, 2019

The https://github.com/RelationalAI-oss/MultithreadingBenchmarks.jl package contains benchmarking experiments we've written to measure performance scaling for the new experimental cooperative multithreading introduced in Julia v1.3.

I'm opening this issue to discuss surprising results from our "all-threads allocating garbage" experiment, detailed here:
RelationalAI-oss/MultithreadingBenchmarks.jl#3

In that benchmark, we @spawn 1000 "queries", each of which perform work(...), which performs 1e6 multiplications in a type-unstable way, causing O(1e6) allocations.

The experiment then runs this benchmark with increasing values for JULIA_NUM_THREADS= to measure the performance scaling on the benchmark as julia has more threads. The experiment was run on a 48-core (96-vCPU) machine, though the results are similar on my local 6-core laptop.

The experiment shows that, surprisingly, total time increases as the number of threads increases, despite the total work remaining constant (number of queries handled, number of allocations, and total memory allocated):
all_tasks_allocating-abs_time_plot-96vCPUs

This seems to be explained by two factors, as outlined in the linked results:

Notice that there seem to be two causes of the increase in time for each addition of threads:

  1. The gc_time is increasing in each run, until we hit nthreads > number of physical cores. After that, the gc_time drops to 0, presumably because the GC never gets run at all. This is one surprise: we would expect the GC time to remain constant, since the number of allocations and size of allocated memory remain constant.
  2. The latency for every individual query is increasing in each run -- even though the gc_time drops to 0 between 48 threads and 72 threads. So this increase is (at least in part) independent of the increase in gc_time.

So the GC time is actually taking longer as you add threads, even though there's the same amount of work to do, and also something else is also taking longer as you add threads (maybe the allocation), even though there's the same amount of work to do.

@staticfloat and I have noticed via profiling (detailed in: RelationalAI-oss/MultithreadingBenchmarks.jl#3 (comment)) that the profiles spend increasing amounts of time in jl_mutex_wait as the number of threads increase.
It's not clear to me whether the profiles reflect GC time or not.


To summarize:

This benchmark is getting slower when adding threads, which is surprising. Even if garbage collection / allocation acquired a single global lock, forcing everything to run serially, I would still expect near constant time as you add threads. Instead, the time increases linearly, with a slope greater than 1. So this seems like maybe a bug somewhere?

@NHDaly NHDaly changed the title "All-threads allocating garbageMultithreading Benchmark "All-threads allocating garbage" Multithreading Benchmark shows significant slowdown as nthreads increases Aug 22, 2019
@staticfloat
Copy link
Member

To be a little more specific on the jl_mutex_wait(), it appears that a very large chunk of time is spent right here:

JL_LOCK(&typecache_lock); // Might GC

I don't know this code well enough to know why we need to lock the typecache when constructing datatypes; it seems to me that the fast path should be that the type is already in the typecache, so we need read-only access, but I don't actually know how this works.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Aug 22, 2019

@vtjnash, @JeffBezanson: could the lock be put only on the case where the typecache actually needs to be modified?

@JeffBezanson
Copy link
Member

Could you provide instructions for running this?

Yes, it looks like the typecache lock scope could be narrowed somehow.

@NHDaly
Copy link
Member Author

NHDaly commented Aug 22, 2019

Could you provide instructions for running this?

Oops, sorry! Yep i've just added them to the readme: https://github.com/RelationalAI-oss/MultithreadingBenchmarks.jl/blob/master/README.md

But basically you can just run Pkg.test() on the MultithreadingBenchmarks package to run the benchmarks in the current configuration. Annoyingly, right now you have to edit the configuration in the test/runbench.jl file if you want to make changes, or you can just call the functions in that file yourself at the REPL.

@NHDaly
Copy link
Member Author

NHDaly commented Aug 22, 2019

Also, I should have included this earlier, but if you want to explore them i've just added the complete pprof profile files to the experiment results page:

The complete profiles themselves are here, which you can view interactly in PProf.jl via PProf.refresh(file="..."):
profiles-all-threads-1-2-6-threads.zip

@robsmith11
Copy link
Contributor

I've experienced this effect (poor scaling with Threads relative to Distributed) with some of my GC-heavy projects as well, but obviously not as dramatic as in the extreme case tested here.

@benhamad
Copy link

benhamad commented Jul 1, 2021

We're still facing this when scaling the number of threads. We're producing a constant amount of garbage for the whole program independently from the number of threads.

Screen Shot 2021-06-30 at 17 33 54

As you can see, the time we spend doing gc scale with the number of threads.

Screen Shot 2021-06-30 at 17 34 05

@jpsamaroo
Copy link
Member

Just taking a guess here from limited understanding of the GC, but maybe it's possible that the amount of time spent waiting to stop all threads is simply increasing because we're having to reach more GC safepoints per GC cycle as the number of threads increases?

@benhamad
Copy link

benhamad commented Jul 1, 2021

@jpsamaroo Indeed that was our understanding as well for the second chart that I posted above! But for the first one we're measuring wall-time. Which shouldn't (at least my assumption) increase because we're still doing the same amount of work (collecting garbage).

@jpsamaroo
Copy link
Member

Hitting all GC safepoints shouldn't have to do with how much garbage you're generating, but instead what code you're running. If you're in a ccall, you can't hit a GC safepoint, because they only exist in Julia code. So if a thread is stuck on a 5 second ccall when another thread decides it needs to do a GC cycle, we have to wait at least 5 seconds for the ccall-ing thread to return before it could hit a safepoint and thus be ready to "contribute" to the GC cycle (where contribute currently just means go to sleep).

@NHDaly
Copy link
Member Author

NHDaly commented Jul 1, 2021

@jpsamaroo Hrmm, seems plausible, but like @Chakerbh was saying, i don't really see how that alone could make it slower than running with only a single thread. If all-but-one of your threads are blocked waiting for GC, at worst it should slow down to approximately the single-threaded performance, right? Since there's always one thread making progress, which is exactly the thing that's starving the GC in your proposed explanation?

@elextr
Copy link

elextr commented Jul 3, 2021

Is the amount of GC work stable with number of threads, or does more threads mean more stack(s) to scan for roots?

@NHDaly
Copy link
Member Author

NHDaly commented Jul 3, 2021 via email

@elextr
Copy link

elextr commented Jul 4, 2021

Without having the time to study the details of your code or of Julia's GC my experience in other systems suggests a possible reason for performance being less than single threaded is that each thread doing the same work is having much the same stack, so the scanning part of the GC work is proportional to number of threads.

Also depending on the details of the GC and your code, if there are shared structures in your code or in the Julia memory management, are they scanned once or scanned for each thread that holds a root? Again that can add to the GC work per thread.

And are allocations per thread or shared (and so locked and causing per thread context switches on allocate), thats more per thread overhead.

It is entirely possible for multi-threaded performance to be less than single threaded when memory management overhead is significant even if the user work is the same, depending on GC and user code details.

@NHDaly
Copy link
Member Author

NHDaly commented Jul 5, 2021

Fascinating! Thanks for the details. Very interesting points to consider. I don't know the answers to any of those questions.

Sorry that my email reply was formatted so poorly - one question that I wrote didn't show up in the GitHub comment:

I don't know how Julia's GC works - does it only scan the roots on the Tasks that are currently actively scheduled on threads, or does it scan all Tasks, even those not currently scheduled?

If it scans all Tasks, I would imagine that increasing the number of Threads shouldn't increase the amount of work that GC needs to do. But if it only scans the actively scheduled Tasks, I can see what you mean! Very interesting.

I would be interested to know the answers to those questions.

@NHDaly
Copy link
Member Author

NHDaly commented Jul 5, 2021

@brenhinkeller
Copy link
Contributor

Could be interesting to re-run these benchmarks on main, and there may never perfect solution, but I think we might be able to effectively close this now that parallel GC is well on the way!

relevant progress:
#48600
#48969

@chriselrod
Copy link
Contributor

A (proprietary) GC heavy benchmark I've been using:
Julia 1.9:

  1.474915 seconds (45.52 M allocations: 9.335 GiB, 51.95% gc time)

Julia master:

  3.927770 seconds (122.22 M allocations: 24.824 GiB, 52.43% gc time)

The regression is likely caused by something else, given we have many more allocations.
The main thing of interest here is, this is what I get when replacing a few of the allocations with malloc and free, Julia 1.9:

  0.817221 seconds (24.13 M allocations: 4.687 GiB, 36.16% gc time)

Julia master:

  0.836478 seconds (20.58 M allocations: 3.961 GiB, 42.74% gc time)

This was only replacing a few of the allocations (note we're still allocating a ton of memory), but that was enough for a pretty substantial speed improvement.

See also the trivial example from here.
Note that the multithreaded version would be expected to take about 2x as long, as the multithreaded version runs the single threaded version on every thread, and I'm using two threads per physical core.

Rerunning on Julia 1.9:

julia> @time foo(GarbageCollector(), X, f, g, h, 30_000_000)
 19.511832 seconds (30.00 M allocations: 71.526 GiB, 11.61% gc time)
1.3551714877812471e10

julia> @time foo(LibcMalloc(), X, f, g, h, 30_000_000)
  3.438116 seconds (1 allocation: 16 bytes)
1.3551714877812471e10

julia> @time foo(MiMalloc(), X, f, g, h, 30_000_000)
  2.191588 seconds (1 allocation: 16 bytes)
1.3551714877812471e10

julia> @time foo(JeMalloc(), X, f, g, h, 30_000_000)
  2.062791 seconds (1 allocation: 16 bytes)
1.3551714877812471e10

julia> @show Threads.nthreads();
Threads.nthreads() = 36

julia> @time foo_threaded(GarbageCollector(), X, f, g, h, 30_000_000)
221.578470 seconds (1.08 G allocations: 2.515 TiB, 59.08% gc time)
4.878617356012494e11

julia> @time foo_threaded(LibcMalloc(), X, f, g, h, 30_000_000)
  7.894183 seconds (224 allocations: 21.062 KiB)
4.878617356012494e11

julia> @time foo_threaded(MiMalloc(), X, f, g, h, 30_000_000)
  4.110426 seconds (221 allocations: 20.969 KiB)
4.878617356012494e11

julia> @time foo_threaded(JeMalloc(), X, f, g, h, 30_000_000)
  4.310686 seconds (222 allocations: 21.000 KiB)
4.878617356012494e11

julia> (221.578470, 7.894183, 4.110426, 4.310686) ./       (19.511832, 3.438116, 2.191588, 2.062791)
(11.35610792466848, 2.296078142796811, 1.8755468637353374, 2.089734733184312)

julia> versioninfo()
Julia Version 1.9.3-DEV.0
Commit 6fc1be04ee (2023-07-06 14:55 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
  Threads: 36 on 36 virtual cores

Compared to the 2x slow down expected, we saw slow downs of (11.35610792466848, 2.296078142796811, 1.8755468637353374, 2.089734733184312) for GC, Libc.malloc, mimalloc, and jemalloc.
Of course, the GC's single threaded baseline was already an order of magnitude slower than the better mallocs.

Julia master:

julia> @time foo(GarbageCollector(), X, f, g, h, 30_000_000)
 25.309314 seconds (30.00 M allocations: 71.526 GiB, 14.69% gc time)
1.3974206146766459e10

julia> @time foo(LibcMalloc(), X, f, g, h, 30_000_000)
  3.502882 seconds (1 allocation: 16 bytes)
1.3974206146766459e10

julia> @time foo(MiMalloc(), X, f, g, h, 30_000_000)
  2.428720 seconds (1 allocation: 16 bytes)
1.3974206146766459e10

julia> @time foo(JeMalloc(), X, f, g, h, 30_000_000)
  2.196513 seconds (1 allocation: 16 bytes)
1.3974206146766459e10

julia> @show Threads.nthreads();
Threads.nthreads() = 36

julia> @time foo_threaded(GarbageCollector(), X, f, g, h, 30_000_000)
233.422309 seconds (1.08 G allocations: 2.515 TiB, 59.53% gc time)
5.030714212835928e11

julia> @time foo_threaded(LibcMalloc(), X, f, g, h, 30_000_000)
  8.116547 seconds (224 allocations: 20.766 KiB)
5.030714212835928e11

julia> @time foo_threaded(MiMalloc(), X, f, g, h, 30_000_000)
  4.311725 seconds (219 allocations: 20.609 KiB)
5.030714212835928e11

julia> @time foo_threaded(JeMalloc(), X, f, g, h, 30_000_000)
  4.439841 seconds (219 allocations: 20.609 KiB)
5.030714212835928e11

julia> (233.422309, 8.116547, 4.311725, 4.439841) ./
       (25.309314, 3.502882, 2.428720, 2.196513)
(9.222782924894764, 2.317105457734517, 1.7753075694192824, 2.0213133270779644)

julia> versioninfo()
Julia Version 1.11.0-DEV.238
Commit 8b8da91ad7 (2023-08-08 01:11 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
  Threads: 53 on 36 virtual cores

So, now we have a 9x slowdown instead of 11x by multithreading, so we at least see an improvement on master!
This was achieved by regressing the single threaded performance, awesome!

However, we'll have to regression the single threaded performance by 9x (or maybe improve the multithreaded performance), before we can claim that "all threads generating garbage causes terrible performance" is a solved problem.
Or, maybe, because we had terrible performance to begin with (order of magnitude slower than using malloc), we can claim terrible performance is only exacerbated by multithreading, not caused by it ;).

I am unhappy with all these multithreaded GC developments; it feels more like marketing nonsense to convince people things are getting better without really doing anything to help. And people eat it up.

Note the % GC time didn't even improve on master.
GC does bad on this benchmark, because GC basically guarantees that all newly allocated memory is not in cache, and thus results in extreme latencies on first access. GC's make the rest of the code run slower/do not align with modern CPU architectures.

There are things that can be done to help common patterns such as this. We are still a long ways away from being able to close this issue.

@chriselrod chriselrod reopened this Aug 9, 2023
@gbaraldi
Copy link
Member

gbaraldi commented Aug 9, 2023

@chriselrod could you maybe open a separate issue with basically the same thing as you have here? Because your case isn't a multithreading makes the GC bad. It's just a GC bad 😄

@gbaraldi
Copy link
Member

gbaraldi commented Aug 9, 2023

But I'm not sure we can close this anyway because nobody posted updated results here where we do a good job

@chriselrod
Copy link
Contributor

chriselrod commented Aug 9, 2023

@chriselrod could you maybe open a separate issue with basically the same thing as you have here? Because your case isn't a multithreading makes the GC bad. It's just a GC bad 😄

I can, but in this case GC was 5x worse or so when multithreaded than when single threaded, while malloc/free scaled fine.

@oscardssmith
Copy link
Member

GC was 5x worse or so when multithreaded than when single threaded

yes and no. The performance degrades more but the cause of the slowness is exactly the same. The fix to the single threaded version (escape analysis) would also fix the multithreaded version,

@chriselrod
Copy link
Contributor

yes and no. The performance degrades more but the cause of the slowness is exactly the same.

% time taken by the GC also increased.
But I do agree that the only good way to make a GC faster is to not use the GC.

@NHDaly
Copy link
Member Author

NHDaly commented Aug 10, 2023

@kpamnany is actually coincidentally going to take up investigating this benchmark this month, and can hopefully report back with the latest results and an explanation for what's going on! 🙏 ❤️

@NHDaly
Copy link
Member Author

NHDaly commented Oct 18, 2023

Hey, so Kiran pointed out when he looked into this last month that actually this is fixed! 😮 And indeed, running the benchmark in RelationalAI-oss/MultithreadingBenchmarks.jl#3 now shows positive scaling as you add threads!:

Results (omitted printing latencies column):
5×6 DataFrame
 Row │ nthreads  time_secs  allocs    memory      gctime_secs  thread_counts
     │ Int64     Float64    Int64     Int64       Float64      Vector{Int64}
─────┼───────────────────────────────────────────────────────────────────────────────────────────
   1 │        1   1.23007   40005511  1920583120    0.0433216  [1000]
   2 │        2   0.664655  40005018  1920575216    0.0483965  [511, 489]
   3 │        4   0.422814  40005018  1920575248    0.0921505  [253, 249, 249, 249]
   4 │        6   0.331134  40005018  1920575280    0.108474   [166, 167, 167, 167, 167, 166]
   5 │        8   0.286356  40005018  1920575312    0.110583   [126, 117, 123, 128, 129, 123, 1…
Processed:
5×6 DataFrame
 Row │ nthreads  time_secs  speedup_factor  marginal_speedup  utilization  latency_quantiles_ms
     │ Int64     Float64    Float64         Float64           Float64      Tuple…
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │        1   1.23007          1.0              1.0          1.0       (0.1=>1.16592, 0.5=>1.17721, 0.9…
   2 │        2   0.664655         1.85069          0.850695     0.925347  (0.1=>1.18833, 0.5=>1.2064, 0.9=…
   3 │        4   0.422814         2.90925          0.529279     0.727313  (0.1=>1.24266, 0.5=>1.3384, 0.9=…
   4 │        6   0.331134         3.71473          0.402738     0.619121  (0.1=>1.23654, 0.5=>1.3784, 0.9=…
   5 │        8   0.286356         4.29561          0.290441     0.536951  (0.1=>1.25115, 0.5=>1.41629, 0.9…

You can see that the speedup isn't linear, but it's pretty close to it!
Screenshot 2023-10-18 at 4 39 11 PM
Screenshot 2023-10-18 at 4 40 57 PM

So, whatever was going on here, I think it's finally resolved, and we can consider this issue closed! 🎉

There's still improvements that could be made to the scaling, but I'll call this a big win. :)

@NHDaly NHDaly closed this as completed Oct 18, 2023
@NHDaly
Copy link
Member Author

NHDaly commented Oct 18, 2023

Oh, and the above test was on RAI's internal build of julia 1.9 + some backported patches, including the parallel GC work.

However, on stock 1.9, we see very similar looking results! So it wasn't the parallel GC that fixed this after all! I'm not sure what did, but i'm very curious. :)

Results (omitted printing latencies column):
5×6 DataFrame
 Row │ nthreads  time_secs  allocs    memory      gctime_secs  thread_counts
     │ Int64     Float64    Int64     Int64       Float64      Vector{Int64}
─────┼───────────────────────────────────────────────────────────────────────────────────────────
   1 │        1   1.29346   40005511  1920583120    0.0507414  [1000]
   2 │        2   0.701653  40005019  1920575248    0.0570501  [499, 501]
   3 │        4   0.41991   40005021  1920575344    0.0727596  [252, 251, 248, 249]
   4 │        6   0.345172  40005023  1920575440    0.105949   [170, 167, 165, 166, 165, 167]
   5 │        8   0.287717  40005025  1920575536    0.0993664  [129, 125, 124, 124, 125, 124, 1…
Processed:
5×6 DataFrame
 Row │ nthreads  time_secs  speedup_factor  marginal_speedup  utilization  latency_quantiles_ms
     │ Int64     Float64    Float64         Float64           Float64      Tuple…
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │        1   1.29346          1.0              1.0          1.0       (0.1=>1.20937, 0.5=>1.22906, 0.9…
   2 │        2   0.701653         1.84344          0.843444     0.921722  (0.1=>1.2287, 0.5=>1.27608, 0.9=…
   3 │        4   0.41991          3.08032          0.618439     0.77008   (0.1=>1.29208, 0.5=>1.41265, 0.9…
   4 │        6   0.345172         3.74729          0.333483     0.624548  (0.1=>1.2702, 0.5=>1.45344, 0.9=…
   5 │        8   0.287717         4.49558          0.374148     0.561948  (0.1=>1.29458, 0.5=>1.50181, 0.9…
Screenshot 2023-10-18 at 4 46 43 PM Screenshot 2023-10-18 at 4 46 51 PM

And on master.........:

run benchmark
NUM_QUERIES: 1000
WORK_SIZE: 1000000
1 threads:
1.21075025 secs
2 threads:
0.671581125 secs
4 threads:
0.380064083 secs
6 threads:
0.302459833 secs
8 threads:
GC error (probable corruption)
Allocations: 937122657 (Pool: 937119415; Big: 3242); GC: 1953

!!! ERROR in jl_ -- ABORTING !!!

It was looking roughly the same until it hit a GC error.. 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality performance Must go faster
Projects
None yet
Development

No branches or pull requests