Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler work stealing. #3095

Closed
bblum opened this issue Aug 2, 2012 · 24 comments
Closed

Scheduler work stealing. #3095

bblum opened this issue Aug 2, 2012 · 24 comments
Labels
A-concurrency Area: Concurrency A-runtime Area: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflows C-enhancement Category: An issue proposing an enhancement or a PR with one. E-hard Call for participation: Hard difficulty. Experience needed to fix: A lot. P-medium Medium priority

Comments

@bblum
Copy link
Contributor

bblum commented Aug 2, 2012

Currently: Tasks are scheduled on sched_loops in round-robin fashion, and are tied to their sched_loop for their entire lifetime. This can lead to surprising performance with irregular task setups.

It does not seem too hard to implement a 'steal' mechanism by which idle sched_loops can reach over to overloaded ones to balance some of the load. My impression is that the classic presentation involves idle processors busy-looping around the steal mechanism until they pick up something to do; I've even seen a talk on a purportedly very clever mechanism that was totally lockless. But... we just can't eat up all CPUs when not fully loaded.

One thing I haven't quite grasped is how to do this without thrashing some global synchronisation whenever we decide to steal. If I'm not mistaken, the whole point of work stealing is that each processor's runqueue not be in a cacheline that multiple processors want... but I think that if we want to suspend idle CPUs, we need to have a global semaphore-alike setup that gets poked every time a task (a) spawns, (b) blocks, (c) wakes, (d) exits. And, well, this is basically when scheduler runqueues get accessed anyway.

Currently we take a global (per-scheduler) lock on (a) and (d) already, but not (b) or (c). Is there an option for work stealing that puts idle CPUs to sleep without hitting global state during the extra lifecycle events?

@ghost ghost assigned bblum Aug 2, 2012
@brson
Copy link
Contributor

brson commented Aug 3, 2012

I don't have any answers right now, but I'm going to drop this link here: http://supertech.csail.mit.edu/papers.html. These are all the cilk papers, many about work stealing.

@nikomatsakis
Copy link
Contributor

One thing: not having to solve these kinds of annoying---but solved---implementation decisions is the reason why I thought it make sense to consider basing our implementation on a well-tested scheduler, like TBB. I still basically think we ought to try prototyping that.

That said, there are existing implementations that do solve these kinds of problems, and to the extent allowed by law we can also peek at what they do. I remember I did a few things when I was implementing my own work-stealing scheduler a while back, but I don't remember what worked well—I think I was never fully satisfied with this particular aspect of my design, as there was a potential race in which a thread might go to sleep just as new work was being posted and hence be idle. Of course that thread would wake if there were yet more work posted, or if a certain timeout elapsed, so in practice this wasn't a big deal. But I imagine there are more elegant solutions out there.

@graydon
Copy link
Contributor

graydon commented Aug 3, 2012

Hmm. So on irc I said I'd like you to clarify the bug in terms of sched vs thread, and I'll repeat that request here in more detail.

Here is my (possibly wrong) understanding of the current system:

By default there is one sched. It's multi threaded and the threads all share a pool of tasks. So that may involve too much locking but it already means that most tasks auto balance across idle cores.

If a user makes multiple scheds, it's generally because they want to pin a task to a thread. Or a smaller set of threads. That's the whole reason the api for multiple scheds exists. So in these cases, stealing tasks between scheds is incorrect, by definition.

If I'm right about those assumptions, then I don't really see what this bug means, except maybe taking fewer locks in the multi threaded sched case. Am I misinformed of how the current system works?

@nikomatsakis
Copy link
Contributor

I had assumed that this referred to stealing within the main scheduler? It seems like we ought to have each scheduler map to a pool of threads (which possibly has size 1). Tasks may move around within this pool however we like but do not cross pools.

Also—there is lots of interesting work on scheduling for better locality. In particular for Servo, Leo Meyerovich's work at Berkeley on parallelizing layout uses a variation on work-stealing in which they first do classic work stealing, but they record which processor ended up processing which nodes in the tree. In latter phases they just replay this. Or something like that. He claimed it was essential for good performance, as classical work stealing proved too much overhead. So I'm sure we'll be tweaking this as we go to give user's more control!

@bblum
Copy link
Contributor Author

bblum commented Aug 3, 2012

@brson ooh, useful. I didn't look yet at the cilk papers, but the "worst page-replacement policy" paper (well, the abstract) is amazing.

@nikomatsakis I am still not sure what the nature of the problem is. if the answer is "Well yeah, you just suck it up and bounce on a global lock for every lifecycle operation", well, that wouldn't take much effort at all to implement with what we've got. if, on the other hand, there are clever heuristics that achieve good balancing and avoiding the global lock, that wheel would be a lot worse to reinvent.

@graydon No, threads within a sched do not share a pool of tasks; each thread has their own pool. The benefit which we'd lose from changing to what you described is the task block/wakeup paths only need a per-cpu lock (sched_loop->lock), rather than the global lock needed for spawn and exit (scheduler->lock).

Basically what you described is what we want; the question is how can we implement it right.

@nikomatsakis hmm. nodes in trees is getting a lot closer to data-aware scheduling, which is beyond the scope of the runtime scheduler. It might be easy to make the runtime expose a "cpu affinity" option, and have a data-aware library that makes use of that.

@nikomatsakis
Copy link
Contributor

@bblum I don't know what the best solutions are either. Clearly busy-looping trying to steal isn't it. I think what I had before was pretty straight-forward: basically a shared boolean flag indicated whether there were any idle threads combined with a linked list of idle threads. After they failed to steal, threads would acquire a lock, atomically add themselves to the list, and set the flag. When producing work, you would check the flag and, if it is true, wake up an idle task and deliver the work item directly to them rather than pushing it onto your queue. However, I don't claim this was particularly smart, as I said I was never fully satisfied with it. I don't know what sort of protocols are used in other systems, it feels like you can do better though!

@bblum
Copy link
Contributor Author

bblum commented Aug 3, 2012

@nikomatsakis "when you had before" - what system was that for?

it sounds like that's analogous to the "global lock" approach i was mentioning - producing work is task-wakeup (always), and becoming idle is task-block-if-it-was-your-last-task. it sounds like becoming idle was the complicated path in your system, which is in keeping with the point of work-stealing that busy cpus not be burdened with it.

i think there is probably not a way to avoid global locking in the wakeup case. i suspect tasks blocking can bypass the global lock if it wasn't the last task on that cpu.

i will also note that the cpu affinity i mentioned above is a good reason to keep one-runqueue-per-cpu instead of a global pool of tasks.

@nikomatsakis
Copy link
Contributor

@bblum this was a scheduler I did as part of my PhD work. Ultimately the system relied on a global lock, the flag was used to avoid touching the lock in the common case where lots of tasks are being produced and the system is generally operating at a good level of throughput (in which case, tasks are not idle).

@nikomatsakis
Copy link
Contributor

oh, @bblum, regarding the data-specific stuff, I am envisioning that there will be custom schedulers tailored to specific strategies, perhaps even schedulers implemented just for one application.

@bblum
Copy link
Contributor Author

bblum commented Aug 14, 2012

I won't have time to do this justice before my internship is over.

@brson
Copy link
Contributor

brson commented Aug 14, 2012

That's ok! The work you did on tasks was awesome.

@bblum
Copy link
Contributor Author

bblum commented Aug 15, 2012

One important implementation detail note: in rust_sched_loop - currently, current task stays on the RQ. If it gets taken off, i.e. to steal it, that may introduce a problem where it won't get killed by kill_all if the runtime is failing.

This has to do with the fix for #2671 - sched_loops now have a field called killed, which is set when the runtime fails. If this races with stealing (i.e., so the stolen task misses the kill signal), the sched_loop would need to check its killed flag before starting to run the stolen task.

@bblum bblum mentioned this issue Sep 11, 2012
@brson
Copy link
Contributor

brson commented Dec 13, 2012

A student has expressed interest in this project: Mozilla-Student-Projects/Projects-Tracker#31

@graydon
Copy link
Contributor

graydon commented Apr 25, 2013

nominating for backwards-compatible milestone

@brson
Copy link
Contributor

brson commented May 11, 2013

I'm going to outline the straw-man design I have in mind for work stealing to go along with the scheduler redesign in #4419. As the OP pointed out the trickiest part is dealing with waking up schedulers to steal, and the literature doesn't give any guidance on this. There are lots of ways to go about this, but the one I'm going to present is simple and fits well with the language and scheduler design.

The primary consideration is that as much synchronization work as possible is offloaded from the active schedulers to those that are seeking work. The original work stealing paper seems to imply this is done by making the schedulers that are out of work 'busy-steal', continually trying to steal from others until there is work available. This imposes no overhead on active schedulers but isn't practical - a power-efficient solution needs to be event driven. Under this design the local, working scheduler will pay a minimum of two atomic operations every time it defers work to the work queue: 1 to enqueue the task, 1 more to check if there is a sleeping scheduler to wake up.

There is a shared data structure, WorkList:

struct WorkList {
    sleeper: AtomicOption<SchedHandle>,
    work_queues: [(WorkQueue, WorkQueueClaim), .. MAX_SCHEDULERS]
}

AtomicOption is a pointer that can be cmpxchgd. SchedHandle is the type through which schedulers send messages to each other. WorkQueue is a work stealing queue. sleeper is a handle to a quiescent, sleeping scheduler that previously decided there was no work available. work_queues is a fixed-size vector of WorkQueue; as there are always MAX_SCHEDULERS WorkQueues these should have very little or no memory overhead when unused. WorkQueueClaim is a pointer-sized integer used for assigning WorkQueues to Schedulers.

WorkList is shared between schedulers as an UnsafeAtomicRcBox<WorkList>.

Access to work_queue is always random to distribute contention between all the queues in the system.

When a stealing scheduler can't find work it makes a SchedHandle available to be woken up with later. Sleeping schedulers form a linked list and are responsible for waking each other up in sequence when new work becomes available. Note that, because of async I/O a scheduler that goes to sleep may be woken in other ways. We need to be sure that I/O bound schedulers don't repeatedly spam other schedulers to indicate that they are sleeping.

Scheduler creation

  1. select an index between [0, MAX_SCHEDULERS),
  2. cmpxchg the value of the WorkQueueClaim at that index from 0 to 1
    2a) if the cmpxchg succeeded then the scheduler takes a handle to it, records the index and creates the scheduler
    2b) if the cmpxchg fails then increment the index and try 2 again
  3. if all indexes are occupied then scheduler creation fails

Work stealing

When a scheduler looks into it's local work queue and there is nothing to do, it starts the following stealing process:

  1. check the local waiting_for_work flag. if it is already true then we've been through this process before, there is no work available and we are waiting for the WorkAvailable message. go to sleep and wait. this can happen when the scheduler is woken to do I/O or to respond to other messages.
  2. otherwise, select an index between [0, MAX_SCHEDULERS),
  3. attempt to pop from the queue at that index
    3a) if success then schedule the task
    3b) if the steal fails then increment the index and try 3 again
  4. if all indexes fail then go to sleep (next section)

Sleeping

  1. set the local waiting_for_work flag to true. This will stay set until we receive the WorkAvailable message from the SchedHandle. This is to prevent the situation where I/O (or other events) wake up the scheduler followed by the scheduler going back to sleep and creating a new SchedHandle
  2. create a new SchedHandle to be woken up with via the WorkAvailable message.
  3. atomically swap the SchedHandle with WorkList.sleeper and store the previous value in the local next_sleeper variable
  4. wait to be woken up.

Waking

When a scheduler pushes a task onto it's queue it may need to send a signal to jumpstart the stealing process.

  1. after a working scheduler pushes onto its queue it then atomically swaps 0 with the WorkList.sleeper SchedHandle.
  2. if sleeper is non-null then it sends the WorkAvailable message and destroys the SchedHandle, then goes back to work.
  3. in the remote scheduler, upon receipt of WorkAvailable the scheduler sets waiting_for_work to 0, looks at it's next_sleeper variable to see if there are further schedulers to wake up. if so it sends the WorkAvailable message and sets next_sleeper to 0. the process continues until all schedulers are awake.

Issues

  • Inlining all these atomically operated words into this data structure will create very heavy cache contention. That is mostly suffered by stealers though so it shouldn't be so bad for workers.
  • This allows stealers to go to sleep even while there is available work to be stolen. A stealer could go through the list, find nothing, then sleep, but in the meantime another scheduler could push work onto it's queue - neither party notices this situation and the work doesn't get stolen. I don't think this affects correctness - it just means that you miss some parallelism.
  • For systems under light load this will end up with all pushes onto the work queue requiring the local scheduler to wake a remote scheduler, and that's not so cheap here. Maybe this can be mitigated in some way by keeping track of the number of schedulers that are trying to steal.

@Aatch
Copy link
Contributor

Aatch commented May 12, 2013

I've been working on the work-stealing deque. I have an initial implementation here: https://github.com/Aatch/rust/work-steal that seems to work (at least as a regular queue) and I'm going to see what I can do in terms of implementing the rest of the work-stealing stuff.

@brson
Copy link
Contributor

brson commented May 24, 2013

@brson
Copy link
Contributor

brson commented May 24, 2013

Some critiques of work-stealing as applied to the Java fork/join library: http://www.coopsoft.com/ar/CalamityArticle.html#steal

@brson
Copy link
Contributor

brson commented Jun 1, 2013

Even more papers that I haven't read yet!

@toddaaro
Copy link
Contributor

"BWS: Balanced Work Stealing for Time-Sharing Multicores"
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-12-1.pdf

The focus is on increasing performance/fairness in the concurrent workstealing "jobs" case, and they use a few kernel extensions to get a solid speedup. This isn't very applicable to Rust's use-case, but the comments they make on fairness+concurrency issues could be informative.

@toddaaro
Copy link
Contributor

"Three Layer Cake for Shared-Memory Programming"
http://www.upcrc.illinois.edu/workshops/paraplop10/papers/paraplop10_submission_8.pdf

This paper discusses the composition of message-passing, fork-join, and SIMD parallelism. Extremely relevant to the Rust runtime design.

@huonw
Copy link
Member

huonw commented Jul 30, 2013

Visiting for triage; I don't know enough about the new scheduler to be able to make a sensible comment on this at all.

@cpeterso
Copy link
Contributor

"Non-blocking steal-half work queues"
http://www.cs.bgu.ac.il/~hendlerd/papers/p280-hendler.pdf

This paper presents StealHalf, a new generalization of the Arora et al. algorithm that allows workers, instead of stealing one, to steal up to half of the items in a given work queue at a time. Workers don't need to steal as often and they are more likely to find a non-empty queue when they randomly pick one because items are distributed among more queues.

@alexcrichton
Copy link
Member

I'm going to close this for now. There's some awesome material in this thread, but for now work stealing is implemented in the M:N schedulers using the chase-lev deques. I think that we've been pretty happy with this and we don't appear to be rushing to completely redesign this any time soon.

There may still be bugs to flesh out in work-stealing, or just fine-tuning in general, but for now I think that the major action-items on this issue are taken care of. I'm fine reopening if others disagree though!

bors pushed a commit to rust-lang-ci/rust that referenced this issue May 15, 2021
…flow-expr

Avoid control flow expressions conditions to go multi line
jaisnan pushed a commit to jaisnan/rust-dev that referenced this issue Jul 29, 2024
This PR reintroduces the `Invariant` trait as a mechanism for the
specification of type safety invariants. The trait is defined in the
Kani library where we also provide `Invariant` implementations for
primitive types.

In contrast to the previous `Invariant` trait, this version doesn't
require the `Arbitrary` bound (i.e., it has the same requirements as
`Arbitrary`). This way, the user isn't required to provide an
`Arbitrary` implementation in addition to the `Invariant` one.

Related rust-lang#3095
jaisnan pushed a commit to jaisnan/rust-dev that referenced this issue Jul 29, 2024
This PR adds a `#[derive(Invariant)]` macro for structs which allows
users to automatically derive the `Invariant` implementations for any
struct. The derived implementation determines the invariant for the
struct as the conjunction of invariants of its fields. In other words,
the invariant is derived as `true && self.field1.is_safe() &&
self.field2.is_safe() && ..`.

For example, for the struct

```rs
#[derive(kani::Invariant)]
struct Point<X, Y> {
    x: X,
    y: Y,
}
```

we derive the `Invariant` implementation as

```rs
impl<X: kani::Invariant, Y: kani::Invariant> kani::Invariant for Point<X, Y> {
    fn is_safe(&self) -> bool {
        true && self.x.is_safe() && self.y.is_safe()
    }
}
```

Related rust-lang#3095
jaisnan pushed a commit to jaisnan/rust-dev that referenced this issue Jul 29, 2024
…rary` and `Invariant` macros (rust-lang#3283)

This PR enables an `#[safety_constraint(...)]` attribute helper for the
`#[derive(Arbitrary)]` and `#[derive(Invariant)]` macro.

For the `Invariant` derive macro, this allows users to derive more
sophisticated invariants for their data types by annotating individual
named fields with the `#[safety_constraint(<cond>)]` attribute, where
`<cond>` represents the predicate to be evaluated for the corresponding
field. In addition, the implementation always checks `#field.is_safe()`
for each field.

For example, let's say we are working with the `Point` type from rust-lang#3250
```rs
#[derive(kani::Invariant)]
struct Point<X, Y> {
    x: X,
    y: Y,
}
```

and we need to extend it to only allow positive values for both `x` and
`y`.
With the `[safety_constraint(...)]` attribute, we can achieve this
without explicitly writing the `impl Invariant for ...` as follows:

```rs
#[derive(kani::Invariant)]
struct PositivePoint {
    #[safety_constraint(*x >= 0)]
    x: i32,
    #[safety_constraint(*y >= 0)]
    y: i32,
}
```

For the `Arbitrary` derive macro, this allows users to derive more
sophisticated `kani::any()` generators that respect the specified
invariants. In other words, the `kani::any()` will assume any invariants
specified through the `#[safety_constraint(...)]` attribute helper.
Going back to the `PositivePoint` example, we'd expect this harness to
be successful:

```rs
#[kani::proof]
fn check_invariant_helper_ok() {
    let pos_point: PositivePoint = kani::any();
    assert!(pos_point.x >= 0);
    assert!(pos_point.y >= 0);
}
```

The PR includes tests to ensure it's working as expected, in addition to
UI tests checking for cases where the arguments provided to the macro
are incorrect. Happy to add any other cases that you feel are missing.

Related rust-lang#3095
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-concurrency Area: Concurrency A-runtime Area: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflows C-enhancement Category: An issue proposing an enhancement or a PR with one. E-hard Call for participation: Hard difficulty. Experience needed to fix: A lot. P-medium Medium priority
Projects
None yet
Development

No branches or pull requests

9 participants