Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CGU size heuristic for partitioning #47415

Merged
merged 5 commits into from
Jan 26, 2018

Conversation

varkor
Copy link
Member

@varkor varkor commented Jan 13, 2018

This addresses the concern of #47316 by estimating CGU size based on
the size of its MIR. Looking at the size estimate differences for a
small selection of crates, this heuristic produces different orderings,
which should more accurately reflect optimisation time. (Fixes #47316.)

r? @michaelwoerister


// Since `sort_by_key` currently recomputes the keys for each comparison,
// we can save unnecessary recomputations by storing size estimates for
// each `MonoItem`. Storing estimates for `CodegenUnit` might be preferable,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether it's worth trying to store estimates for CodegenUnit instead. I found it somewhat troublesome to use CodegenUnit as a key, but perhaps it'd be worth introducing an index for the units to use instead.

@JustAPerson
Copy link
Contributor

I was also looking at this issue.
I found another spot where cgus are sorted by "size", so perhaps this should be extracted to a method.

@varkor
Copy link
Member Author

varkor commented Jan 13, 2018

Good spot! You're right — I'll include that in the changes. (The fact that each CodegenUnit there is wrapped in an Arc makes it slightly more awkward, so I'll sort it out tomorrow.)
Sorry about the collision — I should really get into the habit of posting on an issue before experimenting with it 😬

@kennytm kennytm added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jan 14, 2018
@michaelwoerister
Copy link
Member

Thank you both for looking into this, @varkor and @JustAPerson!

Here is what I would do:

  • Make mono_item_size_estimate() a method of MonoItem so it's available everywhere.
  • Cache the size estimate for CodegenUnits somewhere in order not to recompute them again and again. Computing them in the closure passed to sort() will mean computing them multiple times for each item because sorting will compare each item more than once. You could store the size estimate in a field of CodegenUnit for example. You just have to make sure to keep it updated properly during the various stages of partitioning.

@varkor
Copy link
Member Author

varkor commented Jan 15, 2018

Okay, I was wondering whether I could get away with extending CodegenUnit to store a size estimate, but I wasn't sure if that was too large a change for one optimisation. That should make things cleaner — I shall do that!

@varkor varkor force-pushed the cgu-partition-heuristic branch 2 times, most recently from f055bd0 to f085542 Compare January 15, 2018 18:31
@JustAPerson
Copy link
Contributor

@varkor perhaps update the comments at those two sites, because now they're wrong.

@varkor
Copy link
Member Author

varkor commented Jan 15, 2018

@JustAPerson: I saw the comments, thought to myself: "I must remember to update those before committing", and then completely forgot! 😅 Thank you!

@michaelwoerister
Copy link
Member

OK, that looks like a correct implementation to me. Now that I see it though, I think it is doing more work than necessary:

  1. After inlining, the sizes of all MonoItems that have already been present before inlining are computed a second time.
  2. We are estimating the size of each monomorphized instance individually, even though the size does not depend on the concrete type arguments of the instance. That is, we are bound to compute the same size for foo<u32>, foo<Bar>, foo<()>, etc. because tcx.instance_mir() will give as the exact same MIR object for each of those.

I think (2) is more of a problem than (1). It could be solved by adding an instance_def_size_estimate query (and base MonoItem::size_estimate() on that). That would make sure that we estimate the size of each InstanceDef only once and hit the cache for subsequents calls. As a side effect, issue (1) would be much less of a problem because recomputing becomes cheaper.

Such a query could also be optimized by just assigning a fixed small size for compiler generated shims like InstanceDef::FnPtrShim, InstanceDef::Virtual, and so on -- as we already do for statics.

Would one of you be willing to implement that?

@varkor
Copy link
Member Author

varkor commented Jan 16, 2018

@JustAPerson: If you'd like to finish the change off to share the work, you're very welcome to! I guess the easiest way would be to make a PR on my branch, which would then appear here after being merged in? I'm happy to make the change if you don't want to though!

@JustAPerson
Copy link
Contributor

@varkor I appreciate the offer but I think I'm a bit busy today so go ahead :)

@varkor
Copy link
Member Author

varkor commented Jan 16, 2018

InstanceDef now caches its size estimate, so as long as each type instantiation of type-parameterised MonoItem uses the same InstanceDef, it shouldn't have to recompute the size.

@michaelwoerister
Copy link
Member

Hm, OK, interesting approach; and certainly efficient! But it doesn't feel quite right to me.

The first reason is that it's hard to uphold the condition you state: The code will copy InstanceDefs around now; but someone else might come around later and change that without noticing the detrimental effects. And it's easy to get something wrong here without noticing. For example, your code copies the Instance, caches the size estimate in the copy -- and then immediately discard it, leaving the Instance we'll be probing later still with None:
a20b94d#diff-2f7b82f74576ca9bedc54d0bcfb29ca4R31
You could get around that with a std::cell::Cell but that's asking for trouble. I try to avoid interior mutability as much as possible.

The second reason is that it kind of conflates separate concerns by putting something into InstanceDef that doesn't strictly belong there. It's a bit of a code smell.

I suggest to go the "query" route. It might be slightly less efficient but it will be more robust in the end.

If you need any help with setting up a new query, let me know!

@varkor
Copy link
Member Author

varkor commented Jan 18, 2018

@michaelwoerister: Ah, I misunderstood — I didn't realise that queries were provided with define_maps!. (This has been a good learning experience 🙂) I haven't had a chance today, but I'll try to fix this tomorrow.

@michaelwoerister
Copy link
Member

Oh sorry, yeah, it's easy to slip into rustc-internals-lingo without noticing. Yes, "query" is what we call the things in define_maps!(). Here's a rough writeup on the concept if you are interested: https://github.com/nikomatsakis/rustc-on-demand-incremental-design-doc/blob/master/0000-rustc-on-demand-and-incremental.md

@varkor varkor force-pushed the cgu-partition-heuristic branch from a20b94d to 98752ea Compare January 19, 2018 00:33
This addresses the concern of rust-lang#47316 by estimating CGU size based on
the size of its MIR. Looking at the size estimate differences for a
small selection of crates, this heuristic produces different orderings,
which should more accurately reflect optimisation time.

Fixes rust-lang#47316.
@varkor varkor force-pushed the cgu-partition-heuristic branch from 98752ea to 5c9a8b5 Compare January 19, 2018 00:51
@varkor
Copy link
Member Author

varkor commented Jan 19, 2018

I've added an instance_def_size_estimate query as you suggested. There are a couple of places I was a bit unsure about, being unfamiliar with queries in general (annotated, to check they're all right).

@@ -638,6 +638,7 @@ define_dep_nodes!( <'tcx>
[input] TargetFeaturesWhitelist,
[] TargetFeaturesEnabled(DefId),

[] InstanceDefSizeEstimate { instance_def: InstanceDef<'tcx> },
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is [] correct here? I wasn't sure what this meant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. Leaving the brackets empty means that this is a regular query/dep-node.

@@ -761,6 +761,7 @@ pub fn force_from_dep_node<'a, 'gcx, 'lcx>(tcx: TyCtxt<'a, 'gcx, 'lcx>,
DepKind::EraseRegionsTy |
DepKind::NormalizeTy |
DepKind::SubstituteNormalizeAndTestPredicates |
DepKind::InstanceDefSizeEstimate |
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the correct location? Or should it force! something like some of the queries below?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's the correct location. Very good! :)

force_from_dep_node takes a DepNode and tries to execute the query corresponding to it. However, a DepNode only consists of a DepKind and an opaque hash value (which acts as a UUID). It is only in special cases, like when the hash value is actually a DefPathHash that we can look up in a table, that we can map from DepNode to the correct query and its arguments. For ty::Instance we don't have such a table that would make it possible to do the mapping.

Copy link
Member

@michaelwoerister michaelwoerister left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! This looks almost done. I left some comments below. I'm impressed by how well you managed to navigate our woefully under-documented query system :)

@@ -638,6 +638,7 @@ define_dep_nodes!( <'tcx>
[input] TargetFeaturesWhitelist,
[] TargetFeaturesEnabled(DefId),

[] InstanceDefSizeEstimate { instance_def: InstanceDef<'tcx> },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. Leaving the brackets empty means that this is a regular query/dep-node.

pub fn size_estimate(&self) -> usize {
// Should only be called if `estimate_size` has previously been called.
assert!(self.size_estimate.is_some());
self.size_estimate.unwrap()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could replace this combination of assert!() and unwrap() with a call to Option::expect(). That amounts to the same thing but is a bit more concise.

@@ -140,6 +176,7 @@ impl<'tcx> HashStable<StableHashingContext<'tcx>> for CodegenUnit<'tcx> {
let CodegenUnit {
ref items,
name,
..
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We keep these destructuring lets exhaustive in order not to miss handling fields that are added later. In a case like this you can explicitly ignore the field:

let CodegenUnit {
    ref items,
    name,
    // The size estimate is not relevant to the hash
    size_estimate: _,
} = *self;

Making these exhaustive has proven to be effective for preventing errors introduced by unrelated refactorings.

@@ -761,6 +761,7 @@ pub fn force_from_dep_node<'a, 'gcx, 'lcx>(tcx: TyCtxt<'a, 'gcx, 'lcx>,
DepKind::EraseRegionsTy |
DepKind::NormalizeTy |
DepKind::SubstituteNormalizeAndTestPredicates |
DepKind::InstanceDefSizeEstimate |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's the correct location. Very good! :)

force_from_dep_node takes a DepNode and tries to execute the query corresponding to it. However, a DepNode only consists of a DepKind and an opaque hash value (which acts as a UUID). It is only in special cases, like when the hash value is actually a DefPathHash that we can look up in a table, that we can map from DepNode to the correct query and its arguments. For ty::Instance we don't have such a table that would make it possible to do the mapping.

InstanceDef::Item(def_id) => {
let mir = tcx.optimized_mir(def_id);
mir.basic_blocks().iter().map(|bb| bb.statements.len()).sum()
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should estimate the size of DropGlue via its MIR.

@michaelwoerister
Copy link
Member

This looks pretty finished to me. Going to tentatively approve. Let me know if there's still something you wanted to do, @varkor. Thanks for the great PR!

@bors r+

@bors
Copy link
Contributor

bors commented Jan 26, 2018

⌛ Testing commit 62703cf with merge 94f7e25e204d8c8fdde9f55ae8a9e08512f7d63d...

@alexcrichton
Copy link
Member

@bors: retry

  • prioritizing rollup withthis in it

@bors
Copy link
Contributor

bors commented Jan 26, 2018

⌛ Testing commit 62703cf with merge 061b57274985c173d76ab948510d43168ddc0a12...

@alexcrichton
Copy link
Member

@bors: retry

  • prioritizing rollup with this in it

@bors
Copy link
Contributor

bors commented Jan 26, 2018

⌛ Testing commit 62703cf with merge 903f08a...

bors added a commit that referenced this pull request Jan 26, 2018
…ster

Add CGU size heuristic for partitioning

This addresses the concern of #47316 by estimating CGU size based on
the size of its MIR. Looking at the size estimate differences for a
small selection of crates, this heuristic produces different orderings,
which should more accurately reflect optimisation time. (Fixes #47316.)

r? @michaelwoerister
@bors
Copy link
Contributor

bors commented Jan 26, 2018

💔 Test failed - status-appveyor

@alexcrichton
Copy link
Member

@bors
Copy link
Contributor

bors commented Jan 26, 2018

⌛ Testing commit 62703cf with merge 8952c0af0dcc0effb95544b730b6cd1e396c1feb...

@alexcrichton
Copy link
Member

@bors: retry

  • prioritizing rollup with this in it

@bors
Copy link
Contributor

bors commented Jan 26, 2018

⌛ Testing commit 62703cf with merge 1e968bad4676ef2cabe13b1c9eca697997151ada...

@alexcrichton
Copy link
Member

@bors: retry

  • prioritizing rollup with this in it

bors added a commit that referenced this pull request Jan 26, 2018
@bors bors merged commit 62703cf into rust-lang:master Jan 26, 2018
@varkor varkor deleted the cgu-partition-heuristic branch January 27, 2018 14:31
kennytm added a commit to kennytm/rust that referenced this pull request Mar 27, 2018
Add slice::sort_by_cached_key as a memoised sort_by_key

At present, `slice::sort_by_key` calls its key function twice for each comparison that is made. When the key function is expensive (which can often be the case when `sort_by_key` is chosen over `sort_by`), this can lead to very suboptimal behaviour.

To address this, I've introduced a new slice method, `sort_by_cached_key`, which has identical semantic behaviour to `sort_by_key`, except that it guarantees the key function will only be called once per element.

Where there are `n` elements and the key function is `O(m)`:
- `slice::sort_by_cached_key` time complexity is `O(m n log m n)`, compared to `slice::sort_by_key`'s `O(m n + n log n)`.
- `slice::sort_by_cached_key` space complexity remains at `O(n + m)`. (Technically, it now reserves a slice of size `n`, whereas before it reserved a slice of size `n/2`.)

`slice::sort_unstable_by_key` has not been given an analogue, as it is important that unstable sorts are in-place, which is not a property that is guaranteed here. However, this also means that `slice::sort_unstable_by_key` is likely to be slower than `slice::sort_by_cached_key` when the key function does not have negligible complexity. We might want to explore this trade-off further in the future.

Benchmarks (for a vector of 100 `i32`s):
```
# Lexicographic: `|x| x.to_string()`
test bench_sort_by_key ... bench:      112,638 ns/iter (+/- 19,563)
test bench_sort_by_cached_key ... bench:       15,038 ns/iter (+/- 4,814)

# Identity: `|x| *x`
test bench_sort_by_key ... bench:        1,346 ns/iter (+/- 238)
test bench_sort_by_cached_key ... bench:        1,839 ns/iter (+/- 765)

# Power: `|x| x.pow(31)`
test bench_sort_by_key ... bench:        3,624 ns/iter (+/- 738)
test bench_sort_by_cached_key ... bench:        1,997 ns/iter (+/- 311)

# Abs: `|x| x.abs()`
test bench_sort_by_key ... bench:        1,546 ns/iter (+/- 174)
test bench_sort_by_cached_key ... bench:        1,668 ns/iter (+/- 790)
```
(So it seems functions that are single operations do perform slightly worse with this method, but for pretty much any more complex key, you're better off with this optimisation.)

I've definitely found myself using expensive keys in the past and wishing this optimisation was made (e.g. for rust-lang#47415). This feels like both desirable and expected behaviour, at the small cost of slightly more stack allocation and minute degradation in performance for extremely trivial keys.

Resolves rust-lang#34447.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants