-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler work stealing. #3095
Comments
I don't have any answers right now, but I'm going to drop this link here: http://supertech.csail.mit.edu/papers.html. These are all the cilk papers, many about work stealing. |
One thing: not having to solve these kinds of annoying---but solved---implementation decisions is the reason why I thought it make sense to consider basing our implementation on a well-tested scheduler, like TBB. I still basically think we ought to try prototyping that. That said, there are existing implementations that do solve these kinds of problems, and to the extent allowed by law we can also peek at what they do. I remember I did a few things when I was implementing my own work-stealing scheduler a while back, but I don't remember what worked well—I think I was never fully satisfied with this particular aspect of my design, as there was a potential race in which a thread might go to sleep just as new work was being posted and hence be idle. Of course that thread would wake if there were yet more work posted, or if a certain timeout elapsed, so in practice this wasn't a big deal. But I imagine there are more elegant solutions out there. |
Hmm. So on irc I said I'd like you to clarify the bug in terms of sched vs thread, and I'll repeat that request here in more detail. Here is my (possibly wrong) understanding of the current system: By default there is one sched. It's multi threaded and the threads all share a pool of tasks. So that may involve too much locking but it already means that most tasks auto balance across idle cores. If a user makes multiple scheds, it's generally because they want to pin a task to a thread. Or a smaller set of threads. That's the whole reason the api for multiple scheds exists. So in these cases, stealing tasks between scheds is incorrect, by definition. If I'm right about those assumptions, then I don't really see what this bug means, except maybe taking fewer locks in the multi threaded sched case. Am I misinformed of how the current system works? |
I had assumed that this referred to stealing within the main scheduler? It seems like we ought to have each scheduler map to a pool of threads (which possibly has size 1). Tasks may move around within this pool however we like but do not cross pools. Also—there is lots of interesting work on scheduling for better locality. In particular for Servo, Leo Meyerovich's work at Berkeley on parallelizing layout uses a variation on work-stealing in which they first do classic work stealing, but they record which processor ended up processing which nodes in the tree. In latter phases they just replay this. Or something like that. He claimed it was essential for good performance, as classical work stealing proved too much overhead. So I'm sure we'll be tweaking this as we go to give user's more control! |
@brson ooh, useful. I didn't look yet at the cilk papers, but the "worst page-replacement policy" paper (well, the abstract) is amazing. @nikomatsakis I am still not sure what the nature of the problem is. if the answer is "Well yeah, you just suck it up and bounce on a global lock for every lifecycle operation", well, that wouldn't take much effort at all to implement with what we've got. if, on the other hand, there are clever heuristics that achieve good balancing and avoiding the global lock, that wheel would be a lot worse to reinvent. @graydon No, threads within a sched do not share a pool of tasks; each thread has their own pool. The benefit which we'd lose from changing to what you described is the task block/wakeup paths only need a per-cpu lock (sched_loop->lock), rather than the global lock needed for spawn and exit (scheduler->lock). Basically what you described is what we want; the question is how can we implement it right. @nikomatsakis hmm. nodes in trees is getting a lot closer to data-aware scheduling, which is beyond the scope of the runtime scheduler. It might be easy to make the runtime expose a "cpu affinity" option, and have a data-aware library that makes use of that. |
@bblum I don't know what the best solutions are either. Clearly busy-looping trying to steal isn't it. I think what I had before was pretty straight-forward: basically a shared boolean flag indicated whether there were any idle threads combined with a linked list of idle threads. After they failed to steal, threads would acquire a lock, atomically add themselves to the list, and set the flag. When producing work, you would check the flag and, if it is true, wake up an idle task and deliver the work item directly to them rather than pushing it onto your queue. However, I don't claim this was particularly smart, as I said I was never fully satisfied with it. I don't know what sort of protocols are used in other systems, it feels like you can do better though! |
@nikomatsakis "when you had before" - what system was that for? it sounds like that's analogous to the "global lock" approach i was mentioning - producing work is task-wakeup (always), and becoming idle is task-block-if-it-was-your-last-task. it sounds like becoming idle was the complicated path in your system, which is in keeping with the point of work-stealing that busy cpus not be burdened with it. i think there is probably not a way to avoid global locking in the wakeup case. i suspect tasks blocking can bypass the global lock if it wasn't the last task on that cpu. i will also note that the cpu affinity i mentioned above is a good reason to keep one-runqueue-per-cpu instead of a global pool of tasks. |
@bblum this was a scheduler I did as part of my PhD work. Ultimately the system relied on a global lock, the flag was used to avoid touching the lock in the common case where lots of tasks are being produced and the system is generally operating at a good level of throughput (in which case, tasks are not idle). |
oh, @bblum, regarding the data-specific stuff, I am envisioning that there will be custom schedulers tailored to specific strategies, perhaps even schedulers implemented just for one application. |
I won't have time to do this justice before my internship is over. |
That's ok! The work you did on tasks was awesome. |
One important implementation detail note: in rust_sched_loop - currently, current task stays on the RQ. If it gets taken off, i.e. to steal it, that may introduce a problem where it won't get killed by kill_all if the runtime is failing. This has to do with the fix for #2671 - sched_loops now have a field called |
A student has expressed interest in this project: Mozilla-Student-Projects/Projects-Tracker#31 |
nominating for backwards-compatible milestone |
I'm going to outline the straw-man design I have in mind for work stealing to go along with the scheduler redesign in #4419. As the OP pointed out the trickiest part is dealing with waking up schedulers to steal, and the literature doesn't give any guidance on this. There are lots of ways to go about this, but the one I'm going to present is simple and fits well with the language and scheduler design. The primary consideration is that as much synchronization work as possible is offloaded from the active schedulers to those that are seeking work. The original work stealing paper seems to imply this is done by making the schedulers that are out of work 'busy-steal', continually trying to steal from others until there is work available. This imposes no overhead on active schedulers but isn't practical - a power-efficient solution needs to be event driven. Under this design the local, working scheduler will pay a minimum of two atomic operations every time it defers work to the work queue: 1 to enqueue the task, 1 more to check if there is a sleeping scheduler to wake up. There is a shared data structure,
Access to When a stealing scheduler can't find work it makes a Scheduler creation
Work stealingWhen a scheduler looks into it's local work queue and there is nothing to do, it starts the following stealing process:
Sleeping
WakingWhen a scheduler pushes a task onto it's queue it may need to send a signal to jumpstart the stealing process.
Issues
|
I've been working on the work-stealing deque. I have an initial implementation here: https://github.com/Aatch/rust/work-steal that seems to work (at least as a regular queue) and I'm going to see what I can do in terms of implementing the rest of the work-stealing stuff. |
More papers:
|
Some critiques of work-stealing as applied to the Java fork/join library: http://www.coopsoft.com/ar/CalamityArticle.html#steal |
Even more papers that I haven't read yet! |
"BWS: Balanced Work Stealing for Time-Sharing Multicores" The focus is on increasing performance/fairness in the concurrent workstealing "jobs" case, and they use a few kernel extensions to get a solid speedup. This isn't very applicable to Rust's use-case, but the comments they make on fairness+concurrency issues could be informative. |
"Three Layer Cake for Shared-Memory Programming" This paper discusses the composition of message-passing, fork-join, and SIMD parallelism. Extremely relevant to the Rust runtime design. |
Visiting for triage; I don't know enough about the new scheduler to be able to make a sensible comment on this at all. |
"Non-blocking steal-half work queues" This paper presents StealHalf, a new generalization of the Arora et al. algorithm that allows workers, instead of stealing one, to steal up to half of the items in a given work queue at a time. Workers don't need to steal as often and they are more likely to find a non-empty queue when they randomly pick one because items are distributed among more queues. |
I'm going to close this for now. There's some awesome material in this thread, but for now work stealing is implemented in the M:N schedulers using the chase-lev deques. I think that we've been pretty happy with this and we don't appear to be rushing to completely redesign this any time soon. There may still be bugs to flesh out in work-stealing, or just fine-tuning in general, but for now I think that the major action-items on this issue are taken care of. I'm fine reopening if others disagree though! |
…flow-expr Avoid control flow expressions conditions to go multi line
This PR reintroduces the `Invariant` trait as a mechanism for the specification of type safety invariants. The trait is defined in the Kani library where we also provide `Invariant` implementations for primitive types. In contrast to the previous `Invariant` trait, this version doesn't require the `Arbitrary` bound (i.e., it has the same requirements as `Arbitrary`). This way, the user isn't required to provide an `Arbitrary` implementation in addition to the `Invariant` one. Related rust-lang#3095
This PR adds a `#[derive(Invariant)]` macro for structs which allows users to automatically derive the `Invariant` implementations for any struct. The derived implementation determines the invariant for the struct as the conjunction of invariants of its fields. In other words, the invariant is derived as `true && self.field1.is_safe() && self.field2.is_safe() && ..`. For example, for the struct ```rs #[derive(kani::Invariant)] struct Point<X, Y> { x: X, y: Y, } ``` we derive the `Invariant` implementation as ```rs impl<X: kani::Invariant, Y: kani::Invariant> kani::Invariant for Point<X, Y> { fn is_safe(&self) -> bool { true && self.x.is_safe() && self.y.is_safe() } } ``` Related rust-lang#3095
…rary` and `Invariant` macros (rust-lang#3283) This PR enables an `#[safety_constraint(...)]` attribute helper for the `#[derive(Arbitrary)]` and `#[derive(Invariant)]` macro. For the `Invariant` derive macro, this allows users to derive more sophisticated invariants for their data types by annotating individual named fields with the `#[safety_constraint(<cond>)]` attribute, where `<cond>` represents the predicate to be evaluated for the corresponding field. In addition, the implementation always checks `#field.is_safe()` for each field. For example, let's say we are working with the `Point` type from rust-lang#3250 ```rs #[derive(kani::Invariant)] struct Point<X, Y> { x: X, y: Y, } ``` and we need to extend it to only allow positive values for both `x` and `y`. With the `[safety_constraint(...)]` attribute, we can achieve this without explicitly writing the `impl Invariant for ...` as follows: ```rs #[derive(kani::Invariant)] struct PositivePoint { #[safety_constraint(*x >= 0)] x: i32, #[safety_constraint(*y >= 0)] y: i32, } ``` For the `Arbitrary` derive macro, this allows users to derive more sophisticated `kani::any()` generators that respect the specified invariants. In other words, the `kani::any()` will assume any invariants specified through the `#[safety_constraint(...)]` attribute helper. Going back to the `PositivePoint` example, we'd expect this harness to be successful: ```rs #[kani::proof] fn check_invariant_helper_ok() { let pos_point: PositivePoint = kani::any(); assert!(pos_point.x >= 0); assert!(pos_point.y >= 0); } ``` The PR includes tests to ensure it's working as expected, in addition to UI tests checking for cases where the arguments provided to the macro are incorrect. Happy to add any other cases that you feel are missing. Related rust-lang#3095
Currently: Tasks are scheduled on sched_loops in round-robin fashion, and are tied to their sched_loop for their entire lifetime. This can lead to surprising performance with irregular task setups.
It does not seem too hard to implement a 'steal' mechanism by which idle sched_loops can reach over to overloaded ones to balance some of the load. My impression is that the classic presentation involves idle processors busy-looping around the steal mechanism until they pick up something to do; I've even seen a talk on a purportedly very clever mechanism that was totally lockless. But... we just can't eat up all CPUs when not fully loaded.
One thing I haven't quite grasped is how to do this without thrashing some global synchronisation whenever we decide to steal. If I'm not mistaken, the whole point of work stealing is that each processor's runqueue not be in a cacheline that multiple processors want... but I think that if we want to suspend idle CPUs, we need to have a global semaphore-alike setup that gets poked every time a task (a) spawns, (b) blocks, (c) wakes, (d) exits. And, well, this is basically when scheduler runqueues get accessed anyway.
Currently we take a global (per-scheduler) lock on (a) and (d) already, but not (b) or (c). Is there an option for work stealing that puts idle CPUs to sleep without hitting global state during the extra lifecycle events?
The text was updated successfully, but these errors were encountered: