Scheduler work stealing. #3095

bblum · 2012-08-02T23:44:46Z

Currently: Tasks are scheduled on sched_loops in round-robin fashion, and are tied to their sched_loop for their entire lifetime. This can lead to surprising performance with irregular task setups.

It does not seem too hard to implement a 'steal' mechanism by which idle sched_loops can reach over to overloaded ones to balance some of the load. My impression is that the classic presentation involves idle processors busy-looping around the steal mechanism until they pick up something to do; I've even seen a talk on a purportedly very clever mechanism that was totally lockless. But... we just can't eat up all CPUs when not fully loaded.

One thing I haven't quite grasped is how to do this without thrashing some global synchronisation whenever we decide to steal. If I'm not mistaken, the whole point of work stealing is that each processor's runqueue not be in a cacheline that multiple processors want... but I think that if we want to suspend idle CPUs, we need to have a global semaphore-alike setup that gets poked every time a task (a) spawns, (b) blocks, (c) wakes, (d) exits. And, well, this is basically when scheduler runqueues get accessed anyway.

Currently we take a global (per-scheduler) lock on (a) and (d) already, but not (b) or (c). Is there an option for work stealing that puts idle CPUs to sleep without hitting global state during the extra lifecycle events?

brson · 2012-08-03T01:27:02Z

I don't have any answers right now, but I'm going to drop this link here: http://supertech.csail.mit.edu/papers.html. These are all the cilk papers, many about work stealing.

nikomatsakis · 2012-08-03T02:19:35Z

One thing: not having to solve these kinds of annoying---but solved---implementation decisions is the reason why I thought it make sense to consider basing our implementation on a well-tested scheduler, like TBB. I still basically think we ought to try prototyping that.

That said, there are existing implementations that do solve these kinds of problems, and to the extent allowed by law we can also peek at what they do. I remember I did a few things when I was implementing my own work-stealing scheduler a while back, but I don't remember what worked well—I think I was never fully satisfied with this particular aspect of my design, as there was a potential race in which a thread might go to sleep just as new work was being posted and hence be idle. Of course that thread would wake if there were yet more work posted, or if a certain timeout elapsed, so in practice this wasn't a big deal. But I imagine there are more elegant solutions out there.

graydon · 2012-08-03T04:10:25Z

Hmm. So on irc I said I'd like you to clarify the bug in terms of sched vs thread, and I'll repeat that request here in more detail.

Here is my (possibly wrong) understanding of the current system:

By default there is one sched. It's multi threaded and the threads all share a pool of tasks. So that may involve too much locking but it already means that most tasks auto balance across idle cores.

If a user makes multiple scheds, it's generally because they want to pin a task to a thread. Or a smaller set of threads. That's the whole reason the api for multiple scheds exists. So in these cases, stealing tasks between scheds is incorrect, by definition.

If I'm right about those assumptions, then I don't really see what this bug means, except maybe taking fewer locks in the multi threaded sched case. Am I misinformed of how the current system works?

nikomatsakis · 2012-08-03T04:15:20Z

I had assumed that this referred to stealing within the main scheduler? It seems like we ought to have each scheduler map to a pool of threads (which possibly has size 1). Tasks may move around within this pool however we like but do not cross pools.

Also—there is lots of interesting work on scheduling for better locality. In particular for Servo, Leo Meyerovich's work at Berkeley on parallelizing layout uses a variation on work-stealing in which they first do classic work stealing, but they record which processor ended up processing which nodes in the tree. In latter phases they just replay this. Or something like that. He claimed it was essential for good performance, as classical work stealing proved too much overhead. So I'm sure we'll be tweaking this as we go to give user's more control!

bblum · 2012-08-03T06:39:47Z

@brson ooh, useful. I didn't look yet at the cilk papers, but the "worst page-replacement policy" paper (well, the abstract) is amazing.

@nikomatsakis I am still not sure what the nature of the problem is. if the answer is "Well yeah, you just suck it up and bounce on a global lock for every lifecycle operation", well, that wouldn't take much effort at all to implement with what we've got. if, on the other hand, there are clever heuristics that achieve good balancing and avoiding the global lock, that wheel would be a lot worse to reinvent.

@graydon No, threads within a sched do not share a pool of tasks; each thread has their own pool. The benefit which we'd lose from changing to what you described is the task block/wakeup paths only need a per-cpu lock (sched_loop->lock), rather than the global lock needed for spawn and exit (scheduler->lock).

Basically what you described is what we want; the question is how can we implement it right.

@nikomatsakis hmm. nodes in trees is getting a lot closer to data-aware scheduling, which is beyond the scope of the runtime scheduler. It might be easy to make the runtime expose a "cpu affinity" option, and have a data-aware library that makes use of that.

nikomatsakis · 2012-08-03T13:59:06Z

@bblum I don't know what the best solutions are either. Clearly busy-looping trying to steal isn't it. I think what I had before was pretty straight-forward: basically a shared boolean flag indicated whether there were any idle threads combined with a linked list of idle threads. After they failed to steal, threads would acquire a lock, atomically add themselves to the list, and set the flag. When producing work, you would check the flag and, if it is true, wake up an idle task and deliver the work item directly to them rather than pushing it onto your queue. However, I don't claim this was particularly smart, as I said I was never fully satisfied with it. I don't know what sort of protocols are used in other systems, it feels like you can do better though!

bblum · 2012-08-03T19:10:44Z

@nikomatsakis "when you had before" - what system was that for?

it sounds like that's analogous to the "global lock" approach i was mentioning - producing work is task-wakeup (always), and becoming idle is task-block-if-it-was-your-last-task. it sounds like becoming idle was the complicated path in your system, which is in keeping with the point of work-stealing that busy cpus not be burdened with it.

i think there is probably not a way to avoid global locking in the wakeup case. i suspect tasks blocking can bypass the global lock if it wasn't the last task on that cpu.

i will also note that the cpu affinity i mentioned above is a good reason to keep one-runqueue-per-cpu instead of a global pool of tasks.

nikomatsakis · 2012-08-03T21:38:48Z

@bblum this was a scheduler I did as part of my PhD work. Ultimately the system relied on a global lock, the flag was used to avoid touching the lock in the common case where lots of tasks are being produced and the system is generally operating at a good level of throughput (in which case, tasks are not idle).

nikomatsakis · 2012-08-03T21:39:53Z

oh, @bblum, regarding the data-specific stuff, I am envisioning that there will be custom schedulers tailored to specific strategies, perhaps even schedulers implemented just for one application.

bblum · 2012-08-14T18:36:19Z

I won't have time to do this justice before my internship is over.

brson · 2012-08-14T18:55:21Z

That's ok! The work you did on tasks was awesome.

bblum · 2012-08-15T01:13:12Z

One important implementation detail note: in rust_sched_loop - currently, current task stays on the RQ. If it gets taken off, i.e. to steal it, that may introduce a problem where it won't get killed by kill_all if the runtime is failing.

This has to do with the fix for #2671 - sched_loops now have a field called killed, which is set when the runtime fails. If this races with stealing (i.e., so the stolen task misses the kill signal), the sched_loop would need to check its killed flag before starting to run the stolen task.

brson · 2012-12-13T01:32:27Z

A student has expressed interest in this project: Mozilla-Student-Projects/Projects-Tracker#31

graydon · 2013-04-25T14:14:12Z

nominating for backwards-compatible milestone

brson · 2013-05-11T23:31:10Z

I'm going to outline the straw-man design I have in mind for work stealing to go along with the scheduler redesign in #4419. As the OP pointed out the trickiest part is dealing with waking up schedulers to steal, and the literature doesn't give any guidance on this. There are lots of ways to go about this, but the one I'm going to present is simple and fits well with the language and scheduler design.

The primary consideration is that as much synchronization work as possible is offloaded from the active schedulers to those that are seeking work. The original work stealing paper seems to imply this is done by making the schedulers that are out of work 'busy-steal', continually trying to steal from others until there is work available. This imposes no overhead on active schedulers but isn't practical - a power-efficient solution needs to be event driven. Under this design the local, working scheduler will pay a minimum of two atomic operations every time it defers work to the work queue: 1 to enqueue the task, 1 more to check if there is a sleeping scheduler to wake up.

There is a shared data structure, WorkList:

struct WorkList {
    sleeper: AtomicOption<SchedHandle>,
    work_queues: [(WorkQueue, WorkQueueClaim), .. MAX_SCHEDULERS]
}

AtomicOption is a pointer that can be cmpxchgd. SchedHandle is the type through which schedulers send messages to each other. WorkQueue is a work stealing queue. sleeper is a handle to a quiescent, sleeping scheduler that previously decided there was no work available. work_queues is a fixed-size vector of WorkQueue; as there are always MAX_SCHEDULERS WorkQueues these should have very little or no memory overhead when unused. WorkQueueClaim is a pointer-sized integer used for assigning WorkQueues to Schedulers.

WorkList is shared between schedulers as an UnsafeAtomicRcBox<WorkList>.

Access to work_queue is always random to distribute contention between all the queues in the system.

When a stealing scheduler can't find work it makes a SchedHandle available to be woken up with later. Sleeping schedulers form a linked list and are responsible for waking each other up in sequence when new work becomes available. Note that, because of async I/O a scheduler that goes to sleep may be woken in other ways. We need to be sure that I/O bound schedulers don't repeatedly spam other schedulers to indicate that they are sleeping.

Scheduler creation

select an index between [0, MAX_SCHEDULERS),
cmpxchg the value of the WorkQueueClaim at that index from 0 to 1
2a) if the cmpxchg succeeded then the scheduler takes a handle to it, records the index and creates the scheduler
2b) if the cmpxchg fails then increment the index and try 2 again
if all indexes are occupied then scheduler creation fails

Work stealing

When a scheduler looks into it's local work queue and there is nothing to do, it starts the following stealing process:

check the local waiting_for_work flag. if it is already true then we've been through this process before, there is no work available and we are waiting for the WorkAvailable message. go to sleep and wait. this can happen when the scheduler is woken to do I/O or to respond to other messages.
otherwise, select an index between [0, MAX_SCHEDULERS),
attempt to pop from the queue at that index
3a) if success then schedule the task
3b) if the steal fails then increment the index and try 3 again
if all indexes fail then go to sleep (next section)

Sleeping

set the local waiting_for_work flag to true. This will stay set until we receive the WorkAvailable message from the SchedHandle. This is to prevent the situation where I/O (or other events) wake up the scheduler followed by the scheduler going back to sleep and creating a new SchedHandle
create a new SchedHandle to be woken up with via the WorkAvailable message.
atomically swap the SchedHandle with WorkList.sleeper and store the previous value in the local next_sleeper variable
wait to be woken up.

Waking

When a scheduler pushes a task onto it's queue it may need to send a signal to jumpstart the stealing process.

after a working scheduler pushes onto its queue it then atomically swaps 0 with the WorkList.sleeper SchedHandle.
if sleeper is non-null then it sends the WorkAvailable message and destroys the SchedHandle, then goes back to work.
in the remote scheduler, upon receipt of WorkAvailable the scheduler sets waiting_for_work to 0, looks at it's next_sleeper variable to see if there are further schedulers to wake up. if so it sends the WorkAvailable message and sets next_sleeper to 0. the process continues until all schedulers are awake.

Issues

Inlining all these atomically operated words into this data structure will create very heavy cache contention. That is mostly suffered by stealers though so it shouldn't be so bad for workers.
This allows stealers to go to sleep even while there is available work to be stolen. A stealer could go through the list, find nothing, then sleep, but in the meantime another scheduler could push work onto it's queue - neither party notices this situation and the work doesn't get stolen. I don't think this affects correctness - it just means that you miss some parallelism.
For systems under light load this will end up with all pushes onto the work queue requiring the local scheduler to wake a remote scheduler, and that's not so cheap here. Maybe this can be mitigated in some way by keeping track of the number of schedulers that are trying to steal.

Aatch · 2013-05-12T05:10:45Z

I've been working on the work-stealing deque. I have an initial implementation here: https://github.com/Aatch/rust/work-steal that seems to work (at least as a regular queue) and I'm going to see what I can do in terms of implementing the rest of the work-stealing stuff.

brson · 2013-05-24T21:09:30Z

More papers:

Thread scheduling for multiprogramming multiprocessors
Scheduling multithreaded computations by work stealing
The data locality of work stealing
A dynamically-sized work stealing deque - The Chase/Lev lock free deque.
Work-first and help-first scheduling policies for async-finish task parallelism - A slightly more general than 'fully-strict' algorithm.

brson · 2013-05-24T21:13:24Z

Some critiques of work-stealing as applied to the Java fork/join library: http://www.coopsoft.com/ar/CalamityArticle.html#steal

brson · 2013-06-01T02:15:01Z

Even more papers that I haven't read yet!

toddaaro · 2013-06-10T21:43:08Z

"BWS: Balanced Work Stealing for Time-Sharing Multicores"
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-12-1.pdf

The focus is on increasing performance/fairness in the concurrent workstealing "jobs" case, and they use a few kernel extensions to get a solid speedup. This isn't very applicable to Rust's use-case, but the comments they make on fairness+concurrency issues could be informative.

toddaaro · 2013-06-11T21:23:34Z

"Three Layer Cake for Shared-Memory Programming"
http://www.upcrc.illinois.edu/workshops/paraplop10/papers/paraplop10_submission_8.pdf

This paper discusses the composition of message-passing, fork-join, and SIMD parallelism. Extremely relevant to the Rust runtime design.

huonw · 2013-07-30T12:44:34Z

Visiting for triage; I don't know enough about the new scheduler to be able to make a sensible comment on this at all.

cpeterso · 2013-08-13T18:54:56Z

"Non-blocking steal-half work queues"
http://www.cs.bgu.ac.il/~hendlerd/papers/p280-hendler.pdf

This paper presents StealHalf, a new generalization of the Arora et al. algorithm that allows workers, instead of stealing one, to steal up to half of the items in a given work queue at a time. Workers don't need to steal as often and they are more likely to find a non-empty queue when they randomly pick one because items are distributed among more queues.

alexcrichton · 2013-12-17T01:14:26Z

I'm going to close this for now. There's some awesome material in this thread, but for now work stealing is implemented in the M:N schedulers using the chase-lev deques. I think that we've been pretty happy with this and we don't appear to be rushing to completely redesign this any time soon.

There may still be bugs to flesh out in work-stealing, or just fine-tuning in general, but for now I think that the major action-items on this issue are taken care of. I'm fine reopening if others disagree though!

…flow-expr Avoid control flow expressions conditions to go multi line

This PR reintroduces the `Invariant` trait as a mechanism for the specification of type safety invariants. The trait is defined in the Kani library where we also provide `Invariant` implementations for primitive types. In contrast to the previous `Invariant` trait, this version doesn't require the `Arbitrary` bound (i.e., it has the same requirements as `Arbitrary`). This way, the user isn't required to provide an `Arbitrary` implementation in addition to the `Invariant` one. Related rust-lang#3095

This PR adds a `#[derive(Invariant)]` macro for structs which allows users to automatically derive the `Invariant` implementations for any struct. The derived implementation determines the invariant for the struct as the conjunction of invariants of its fields. In other words, the invariant is derived as `true && self.field1.is_safe() && self.field2.is_safe() && ..`. For example, for the struct ```rs #[derive(kani::Invariant)] struct Point<X, Y> { x: X, y: Y, } ``` we derive the `Invariant` implementation as ```rs impl<X: kani::Invariant, Y: kani::Invariant> kani::Invariant for Point<X, Y> { fn is_safe(&self) -> bool { true && self.x.is_safe() && self.y.is_safe() } } ``` Related rust-lang#3095

…rary` and `Invariant` macros (rust-lang#3283) This PR enables an `#[safety_constraint(...)]` attribute helper for the `#[derive(Arbitrary)]` and `#[derive(Invariant)]` macro. For the `Invariant` derive macro, this allows users to derive more sophisticated invariants for their data types by annotating individual named fields with the `#[safety_constraint(<cond>)]` attribute, where `<cond>` represents the predicate to be evaluated for the corresponding field. In addition, the implementation always checks `#field.is_safe()` for each field. For example, let's say we are working with the `Point` type from rust-lang#3250 ```rs #[derive(kani::Invariant)] struct Point<X, Y> { x: X, y: Y, } ``` and we need to extend it to only allow positive values for both `x` and `y`. With the `[safety_constraint(...)]` attribute, we can achieve this without explicitly writing the `impl Invariant for ...` as follows: ```rs #[derive(kani::Invariant)] struct PositivePoint { #[safety_constraint(*x >= 0)] x: i32, #[safety_constraint(*y >= 0)] y: i32, } ``` For the `Arbitrary` derive macro, this allows users to derive more sophisticated `kani::any()` generators that respect the specified invariants. In other words, the `kani::any()` will assume any invariants specified through the `#[safety_constraint(...)]` attribute helper. Going back to the `PositivePoint` example, we'd expect this harness to be successful: ```rs #[kani::proof] fn check_invariant_helper_ok() { let pos_point: PositivePoint = kani::any(); assert!(pos_point.x >= 0); assert!(pos_point.y >= 0); } ``` The PR includes tests to ensure it's working as expected, in addition to UI tests checking for cases where the arguments provided to the macro are incorrect. Happy to add any other cases that you feel are missing. Related rust-lang#3095

ghost assigned bblum Aug 2, 2012

bblum mentioned this issue Sep 11, 2012

task sleep #3455

Closed

jesse99 mentioned this issue Sep 25, 2012

task thrashing #3587

Closed

This was referenced Feb 10, 2013

Write a parallel deque for work stealing #4877

Closed

Scheduler rewrite with I/O event loop #4419

Closed

brson mentioned this issue Feb 19, 2013

A new scheduler prototype #5022

Closed

graydon mentioned this issue Apr 25, 2013

preemptive scheduling #5731

Closed

brson mentioned this issue Dec 28, 2012

Implement work stealing in the Rust task scheduler Mozilla-Student-Projects/Projects-Tracker#31

Closed

alexcrichton closed this as completed Dec 17, 2013

bors pushed a commit to rust-lang-ci/rust that referenced this issue May 15, 2021

Merge pull request rust-lang#3095 from otavio/avoid-overflow-control-…

c4a8cdc

…flow-expr Avoid control flow expressions conditions to go multi line

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler work stealing. #3095

Scheduler work stealing. #3095

bblum commented Aug 2, 2012

brson commented Aug 3, 2012

nikomatsakis commented Aug 3, 2012

graydon commented Aug 3, 2012

nikomatsakis commented Aug 3, 2012

bblum commented Aug 3, 2012

nikomatsakis commented Aug 3, 2012

bblum commented Aug 3, 2012

nikomatsakis commented Aug 3, 2012

nikomatsakis commented Aug 3, 2012

bblum commented Aug 14, 2012

brson commented Aug 14, 2012

bblum commented Aug 15, 2012

brson commented Dec 13, 2012

graydon commented Apr 25, 2013

brson commented May 11, 2013

Aatch commented May 12, 2013

brson commented May 24, 2013

brson commented May 24, 2013

brson commented Jun 1, 2013

toddaaro commented Jun 10, 2013

toddaaro commented Jun 11, 2013

huonw commented Jul 30, 2013

cpeterso commented Aug 13, 2013

alexcrichton commented Dec 17, 2013

Scheduler work stealing. #3095

Scheduler work stealing. #3095

Comments

bblum commented Aug 2, 2012

brson commented Aug 3, 2012

nikomatsakis commented Aug 3, 2012

graydon commented Aug 3, 2012

nikomatsakis commented Aug 3, 2012

bblum commented Aug 3, 2012

nikomatsakis commented Aug 3, 2012

bblum commented Aug 3, 2012

nikomatsakis commented Aug 3, 2012

nikomatsakis commented Aug 3, 2012

bblum commented Aug 14, 2012

brson commented Aug 14, 2012

bblum commented Aug 15, 2012

brson commented Dec 13, 2012

graydon commented Apr 25, 2013

brson commented May 11, 2013

Scheduler creation

Work stealing

Sleeping

Waking

Issues

Aatch commented May 12, 2013

brson commented May 24, 2013

brson commented May 24, 2013

brson commented Jun 1, 2013

toddaaro commented Jun 10, 2013

toddaaro commented Jun 11, 2013

huonw commented Jul 30, 2013

cpeterso commented Aug 13, 2013

alexcrichton commented Dec 17, 2013