-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"All-threads allocating garbage" Multithreading Benchmark shows significant slowdown as nthreads increases #33033
Comments
To be a little more specific on the Line 1096 in 050160c
I don't know this code well enough to know why we need to lock the typecache when constructing datatypes; it seems to me that the fast path should be that the type is already in the typecache, so we need read-only access, but I don't actually know how this works. |
@vtjnash, @JeffBezanson: could the lock be put only on the case where the typecache actually needs to be modified? |
Could you provide instructions for running this? Yes, it looks like the typecache lock scope could be narrowed somehow. |
Oops, sorry! Yep i've just added them to the readme: https://github.com/RelationalAI-oss/MultithreadingBenchmarks.jl/blob/master/README.md But basically you can just run |
Also, I should have included this earlier, but if you want to explore them i've just added the complete pprof profile files to the experiment results page:
|
I've experienced this effect (poor scaling with Threads relative to Distributed) with some of my GC-heavy projects as well, but obviously not as dramatic as in the extreme case tested here. |
Just taking a guess here from limited understanding of the GC, but maybe it's possible that the amount of time spent waiting to stop all threads is simply increasing because we're having to reach more GC safepoints per GC cycle as the number of threads increases? |
@jpsamaroo Indeed that was our understanding as well for the second chart that I posted above! But for the first one we're measuring wall-time. Which shouldn't (at least my assumption) increase because we're still doing the same amount of work (collecting garbage). |
Hitting all GC safepoints shouldn't have to do with how much garbage you're generating, but instead what code you're running. If you're in a ccall, you can't hit a GC safepoint, because they only exist in Julia code. So if a thread is stuck on a 5 second ccall when another thread decides it needs to do a GC cycle, we have to wait at least 5 seconds for the ccall-ing thread to return before it could hit a safepoint and thus be ready to "contribute" to the GC cycle (where contribute currently just means go to sleep). |
@jpsamaroo Hrmm, seems plausible, but like @Chakerbh was saying, i don't really see how that alone could make it slower than running with only a single thread. If all-but-one of your threads are blocked waiting for GC, at worst it should slow down to approximately the single-threaded performance, right? Since there's always one thread making progress, which is exactly the thing that's starving the GC in your proposed explanation? |
Is the amount of GC work stable with number of threads, or does more threads mean more stack(s) to scan for roots? |
In my original example at the OP at the top of the thread, the amount of
work was constant. I was `@spawn`ing a fixed amount of work, and the only
thing I varied in the experiment was the number of threads in
JULIA_NUM_THREADS.
I'm not sure how @Chakerbh had this experiment set up, but I assume
something similar?
Is the amount of GC work stable with number of threads, or does more threads mean more stack(s) to scan for roots?
I don't know how GC works - does it only scan the active threads, or does it scan all Tasks, even those not currently scheduled?
|
Without having the time to study the details of your code or of Julia's GC my experience in other systems suggests a possible reason for performance being less than single threaded is that each thread doing the same work is having much the same stack, so the scanning part of the GC work is proportional to number of threads. Also depending on the details of the GC and your code, if there are shared structures in your code or in the Julia memory management, are they scanned once or scanned for each thread that holds a root? Again that can add to the GC work per thread. And are allocations per thread or shared (and so locked and causing per thread context switches on allocate), thats more per thread overhead. It is entirely possible for multi-threaded performance to be less than single threaded when memory management overhead is significant even if the user work is the same, depending on GC and user code details. |
Fascinating! Thanks for the details. Very interesting points to consider. I don't know the answers to any of those questions. Sorry that my email reply was formatted so poorly - one question that I wrote didn't show up in the GitHub comment: I don't know how Julia's GC works - does it only scan the roots on the Tasks that are currently actively scheduled on threads, or does it scan all Tasks, even those not currently scheduled? If it scans all Tasks, I would imagine that increasing the number of Threads shouldn't increase the amount of work that GC needs to do. But if it only scans the actively scheduled Tasks, I can see what you mean! Very interesting. I would be interested to know the answers to those questions. |
Also, the code for the original post for this issue is available here: |
A (proprietary) GC heavy benchmark I've been using: 1.474915 seconds (45.52 M allocations: 9.335 GiB, 51.95% gc time) Julia master: 3.927770 seconds (122.22 M allocations: 24.824 GiB, 52.43% gc time) The regression is likely caused by something else, given we have many more allocations. 0.817221 seconds (24.13 M allocations: 4.687 GiB, 36.16% gc time) Julia master: 0.836478 seconds (20.58 M allocations: 3.961 GiB, 42.74% gc time) This was only replacing a few of the allocations (note we're still allocating a ton of memory), but that was enough for a pretty substantial speed improvement. See also the trivial example from here. Rerunning on Julia 1.9: julia> @time foo(GarbageCollector(), X, f, g, h, 30_000_000)
19.511832 seconds (30.00 M allocations: 71.526 GiB, 11.61% gc time)
1.3551714877812471e10
julia> @time foo(LibcMalloc(), X, f, g, h, 30_000_000)
3.438116 seconds (1 allocation: 16 bytes)
1.3551714877812471e10
julia> @time foo(MiMalloc(), X, f, g, h, 30_000_000)
2.191588 seconds (1 allocation: 16 bytes)
1.3551714877812471e10
julia> @time foo(JeMalloc(), X, f, g, h, 30_000_000)
2.062791 seconds (1 allocation: 16 bytes)
1.3551714877812471e10
julia> @show Threads.nthreads();
Threads.nthreads() = 36
julia> @time foo_threaded(GarbageCollector(), X, f, g, h, 30_000_000)
221.578470 seconds (1.08 G allocations: 2.515 TiB, 59.08% gc time)
4.878617356012494e11
julia> @time foo_threaded(LibcMalloc(), X, f, g, h, 30_000_000)
7.894183 seconds (224 allocations: 21.062 KiB)
4.878617356012494e11
julia> @time foo_threaded(MiMalloc(), X, f, g, h, 30_000_000)
4.110426 seconds (221 allocations: 20.969 KiB)
4.878617356012494e11
julia> @time foo_threaded(JeMalloc(), X, f, g, h, 30_000_000)
4.310686 seconds (222 allocations: 21.000 KiB)
4.878617356012494e11
julia> (221.578470, 7.894183, 4.110426, 4.310686) ./ (19.511832, 3.438116, 2.191588, 2.062791)
(11.35610792466848, 2.296078142796811, 1.8755468637353374, 2.089734733184312)
julia> versioninfo()
Julia Version 1.9.3-DEV.0
Commit 6fc1be04ee (2023-07-06 14:55 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
Threads: 36 on 36 virtual cores Compared to the 2x slow down expected, we saw slow downs of Julia master: julia> @time foo(GarbageCollector(), X, f, g, h, 30_000_000)
25.309314 seconds (30.00 M allocations: 71.526 GiB, 14.69% gc time)
1.3974206146766459e10
julia> @time foo(LibcMalloc(), X, f, g, h, 30_000_000)
3.502882 seconds (1 allocation: 16 bytes)
1.3974206146766459e10
julia> @time foo(MiMalloc(), X, f, g, h, 30_000_000)
2.428720 seconds (1 allocation: 16 bytes)
1.3974206146766459e10
julia> @time foo(JeMalloc(), X, f, g, h, 30_000_000)
2.196513 seconds (1 allocation: 16 bytes)
1.3974206146766459e10
julia> @show Threads.nthreads();
Threads.nthreads() = 36
julia> @time foo_threaded(GarbageCollector(), X, f, g, h, 30_000_000)
233.422309 seconds (1.08 G allocations: 2.515 TiB, 59.53% gc time)
5.030714212835928e11
julia> @time foo_threaded(LibcMalloc(), X, f, g, h, 30_000_000)
8.116547 seconds (224 allocations: 20.766 KiB)
5.030714212835928e11
julia> @time foo_threaded(MiMalloc(), X, f, g, h, 30_000_000)
4.311725 seconds (219 allocations: 20.609 KiB)
5.030714212835928e11
julia> @time foo_threaded(JeMalloc(), X, f, g, h, 30_000_000)
4.439841 seconds (219 allocations: 20.609 KiB)
5.030714212835928e11
julia> (233.422309, 8.116547, 4.311725, 4.439841) ./
(25.309314, 3.502882, 2.428720, 2.196513)
(9.222782924894764, 2.317105457734517, 1.7753075694192824, 2.0213133270779644)
julia> versioninfo()
Julia Version 1.11.0-DEV.238
Commit 8b8da91ad7 (2023-08-08 01:11 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
Threads: 53 on 36 virtual cores So, now we have a 9x slowdown instead of 11x by multithreading, so we at least see an improvement on master! However, we'll have to regression the single threaded performance by 9x (or maybe improve the multithreaded performance), before we can claim that "all threads generating garbage causes terrible performance" is a solved problem. I am unhappy with all these multithreaded GC developments; it feels more like marketing nonsense to convince people things are getting better without really doing anything to help. And people eat it up. Note the % GC time didn't even improve on master. There are things that can be done to help common patterns such as this. We are still a long ways away from being able to close this issue. |
@chriselrod could you maybe open a separate issue with basically the same thing as you have here? Because your case isn't a multithreading makes the GC bad. It's just a GC bad 😄 |
But I'm not sure we can close this anyway because nobody posted updated results here where we do a good job |
I can, but in this case GC was 5x worse or so when multithreaded than when single threaded, while malloc/free scaled fine. |
yes and no. The performance degrades more but the cause of the slowness is exactly the same. The fix to the single threaded version (escape analysis) would also fix the multithreaded version, |
% time taken by the GC also increased. |
@kpamnany is actually coincidentally going to take up investigating this benchmark this month, and can hopefully report back with the latest results and an explanation for what's going on! 🙏 ❤️ |
Hey, so Kiran pointed out when he looked into this last month that actually this is fixed! 😮 And indeed, running the benchmark in RelationalAI-oss/MultithreadingBenchmarks.jl#3 now shows positive scaling as you add threads!:
You can see that the speedup isn't linear, but it's pretty close to it! So, whatever was going on here, I think it's finally resolved, and we can consider this issue closed! 🎉 There's still improvements that could be made to the scaling, but I'll call this a big win. :) |
The https://github.com/RelationalAI-oss/MultithreadingBenchmarks.jl package contains benchmarking experiments we've written to measure performance scaling for the new experimental cooperative multithreading introduced in Julia v1.3.
I'm opening this issue to discuss surprising results from our "all-threads allocating garbage" experiment, detailed here:
RelationalAI-oss/MultithreadingBenchmarks.jl#3
In that benchmark, we
@spawn
1000 "queries", each of which performwork(...)
, which performs 1e6 multiplications in a type-unstable way, causing O(1e6) allocations.The experiment then runs this benchmark with increasing values for
JULIA_NUM_THREADS=
to measure the performance scaling on the benchmark as julia has more threads. The experiment was run on a 48-core (96-vCPU) machine, though the results are similar on my local 6-core laptop.The experiment shows that, surprisingly, total time increases as the number of threads increases, despite the total work remaining constant (number of queries handled, number of allocations, and total memory allocated):
This seems to be explained by two factors, as outlined in the linked results:
So the GC time is actually taking longer as you add threads, even though there's the same amount of work to do, and also something else is also taking longer as you add threads (maybe the allocation), even though there's the same amount of work to do.
@staticfloat and I have noticed via profiling (detailed in: RelationalAI-oss/MultithreadingBenchmarks.jl#3 (comment)) that the profiles spend increasing amounts of time in
jl_mutex_wait
as the number of threads increase.It's not clear to me whether the profiles reflect GC time or not.
To summarize:
This benchmark is getting slower when adding threads, which is surprising. Even if garbage collection / allocation acquired a single global lock, forcing everything to run serially, I would still expect near constant time as you add threads. Instead, the time increases linearly, with a slope greater than 1. So this seems like maybe a bug somewhere?
The text was updated successfully, but these errors were encountered: