-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Communication costs are quadratic in the number of worker threads #394
Comments
They should eventually... but it depends. Right now, our sleep mechanism is not too smart, in that it wakes up all threads whenever work comes in. It's basically still oriented around a binary "on/off" setup, for the most part, although we have more capability to have only "some" threads active than before. The original intention was that they would all wake up whenever work arrives, but some would go to sleep -- the thresholds for that latter step may however be too high. Moreover, they currently take turns going to sleep, mostly because it helped to simplify the sleeping logic (originally I was trying to have multiple threads go to sleep independently), so this may exacerbate the problem in simulation particularly. |
Perhaps we also ought to try stealing more than one job from a deque - a common strategy is to steal half of the jobs from another deque. That could reduce the cost of distributing jobs among worker threads quite a bit. I'll try extending deques to support stealing multiple jobs at once and see how it works out in Rayon. Any thoughts about that? |
@stjepang That sounds like it would help in at least some cases. For the cases where But worth a try at least from my naive point of view. @nikomatsakis, what do you think? Even better would be to have an analytical cost model of the Rayon core scheduler so |
I'm certainly not opposed to giving it a try. As I mentioned to @julian-seward1 on IRC, though, it doesn't strike me as always a win. Parallel iterators, for example, purposefully lay out their tasks so that the first one to be stolen is roughly 50% of the work (and the next one represents 50% of the remaining work, and so forth). This was part of the overall Cilk design -- when possible, it's generally better I think than creating a ton of fine-grained tasks up-front, since it allows you to parallelize the effort of doing that parallelization. For example, with parallel iterators, as we pop tasks off the deque, if we find they are not being stolen, we switch to a purely sequential execution for bigger and bigger groups of items; since it is often more efficient to process a batch of items in a seq loop, that helps to amortize the total overhead (I guess you could imagine trying to pop off more than one task at a time too, to achieve the same effect?) In any case, having the ability to steal (or pop) more than one item at a time would be a great tool in the toolbox, no question. The other obvious thing to do is to batch up the the threads into groups, or a hierarchy, and have them steal from within their group first, before attempting to steal globally. This might eliminate some of the O(n^2) execution. Something that's unclear to me is how much cost is incurred during the "going to sleep" phase. Ideally, if all tasks are generating parallel opportunities, stealing should be relatively rare. But when there isn't enough work to go around, you will certainly see a lot of cycles spent trying to steal (which doesn't necessarily hurt performance, though it doesn't help; and it certainly doesn't help with power). It seems worth attempting to tune those parameters and see what happens. Naturally I'm all for analytical modeling, though I'm a bit dubious on its true predictive power. There are so many factors at play. |
It would be nice if we could move you away from the model of Maybe we could apply the idea of stealing a batch just for that injector queue? I haven't looked at coco's memory allocation, but I'm imagining if it could just hand us a whole |
@cuviper I don't really follow your proposal, can you elaborate on it? In general it seems pretty hard to move away from spawn. We don't know all the work up front, and instead discover it as we explore the DOM top-down. Moreover, the processing order matters for the style sharing cache, so we want to be careful in that regard to maintain breadth-first ordering. The more fundamental issue here is that we seem to have a Thundering Herd problem. The behavior in the Speedometer testcase is not (IIUC) that we saturate 64 threads and then run out of work later. On the contrary, most of the threads presumably spend their entire lifetime trying to steal and never find any work, because we wake up all the threads at once and then burn cycles while they inspect each others' empty queues. It seems like we should wake threads up one at a time, and only wake up thread N when thread N-1 has successfully found a work item to steal. |
OK, the way I was thinking was how to distribute lots of individual work more efficiently. If there's just not enough work to go around, then yes I agree it's a problem of managing the idle "herd". Right now we use |
Could we perhaps measure the size of the queue we just stole from? If it's less than the total number of threads we'd wake up with notify_all, we do notify_one. |
I think waking up fewer threads is good, but I'm not sure that I suspect that what we really want in any case is a kind of "quick ramp-up" -- i.e., start a few threads at first, but escalate to the full herd once we've seen some evidence that there will be a lot of items to come. It'd be worth studying more closely what other threadpools are doing in this respect. From discussions I've had with other implementors, there are a lot of trade-offs involved, and everybody has to ultimately make some choice somewhere. |
This seems to me like we're discussing policy and implementation together. It seems to me that there are four things we need to characterise, in terms of
In the case where there is little work to do, it is also important to avoid scattering small
Can you summarise them? |
Indeed. It's a good idea to separate them.
Those are good questions, and I don't now the answers. It seems to me that the best thing would be to try and read into other code-bases and see if we can get a kind of survey of the "menu" that others have attempted. Some code-bases that seem worth investigating (all of which I believe are open source, for some definition thereof):
We can also try getting in touch with the authors and see if they have any tips to share with us. =)
Not in much depth -- these were relatively casual conversations. The basic tradeoff is pretty clear: the slower you ramp up, the more you increase latency, and you have very few signals as to what the program is ultimately going to do. |
cc @Amanieu -- btw, based on our conversation at ECOOP, I thought you might be interested in tracking this issue and/or have useful thoughts. =) |
In Async++ I keep a list of threads that are sleeping. Whenever a new task is added to the pool, it will check if there is a sleeping thread and wake one up. This ensures that there is always a thread available to perform work. There are two places where a thread must do work while waiting: at the root of a worker task, and while
This model is pretty simplistic, but it allows the number of threads to quickly ramp up when new tasks are added. |
In one workload I'm benchmarking, there's only enough work to be done to keep several threads occupied, but Rayon will completely consume all 16 threads on my processor if given the chance. To be specific, I'm looking at the "zapper_par" benchmark from my upcoming (soon to be announced) library. Setting Is there a plan to work on this issue? I think Rayon is a great library, I just wanted to report an experience I was having! |
Recently, I was trying to unstable_sort a 50 GiB dataset in memory on a 48 core machine, and I might have run into this issue. Instead of remaining at 70-100% CPU usage, the machine seemed to bounce between 20-80%. The next time it comes up I will try to record more precise data. |
I'm wondering if this might relate to the performance limitations of parallel rustc compiles on large (e.g. 72-way) systems. Profiles suggest that on such systems rustc spends a huge amount of time doing synchronization/stealing. |
As a side effect of profiling Stylo for
https://bugzilla.mozilla.org/show_bug.cgi?id=1371496, I picked up some
numbers for Rayon too.
I ran the Stylo benchmark in the abovementioned bug with 1, 4, 16 and 64
worker threads, on Valgrind/Callgrind configured for fine-grain thread
interleaving -- basically all threads move forwards together.
Looking at the costs (in instructions executed, and inclusive of callees)
for rayon_core::registry::WorkerThread::wait_until, I see a cost increase
which looks extreme as the number of threads increases (MI = million insns):
p1 137.1 MI
p4 238.4 MI (1.7 x p1)
p16 3568.7 MI (15.0 x p4)
p64 177089.5 MI (49.6 x p16)
In the worst case (p64), wait_until and callees use 177 G insns, of which
about 129 G are in calls to <coco::deque::Stealer>::steal.
Looking at the code for rayon_core::registry::WorkerThread::steal, it appears that
each thread inspects the work queues of all other threads. So Rayon induces
communication costs in the underlying machine that are at least quadratic in
the number of threads.
Given that this is measured on a simulator that slows down progress hugely,
I wouldn't pay too much attention to the exact numbers. But it is clear
that Rayon might have scaling problems, especially in cases like this where
(I think) there is actually very little work to do, and so the threads spend
a lot of time searching for work. (But then why don't they end up going to
sleep? Maybe they do. I don't know.)
The text was updated successfully, but these errors were encountered: