-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Always create loop pre headers #83956
Always create loop pre headers #83956
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue Detailsnull
|
[This comment is from a draft PR] Overall diffs is size improvement. TP regression about 0.1% to 0.6%. Perhaps because more optimizations kick in? Or more blocks to process? Even though the overall size diff is an improvement (e.g., when additional redundant block opts kicks in more), there are cases where it regresses, e.g., more loop cloning occurs. |
6f1aa65
to
f1d07df
Compare
/azp run runtime-coreclr outerloop, runtime-coreclr jitstress |
Azure Pipelines successfully started running 2 pipeline(s). |
There are diffs due to LSRA basing its traversal order on bbNum ordering, and the bbNum order changes with pre-headers created early. There are a few small diffs because the order of nested child loop pre-header blocks processed during hoisting is different than before. Because the pre-headers exist early, some additional downstream optimizations kick in. E.g., I saw additional cases of redundant branch opts. |
@AndyAyersMS PTAL |
defExec.Reset(); | ||
preHeadersList = existingPreHeaders; | ||
defExec.Pop(defExec.Height() - childLoopPreHeaders); | ||
assert(defExec.Height() == childLoopPreHeaders); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw, this if
appears to be dead code. It's never hit in SPMI anyway. And that makes sense: walking up the immediate dominators from the single loop exit block (or any loop exit block, for that matter, though we don't track non-single-exit blocks) should always reach the loop entry block. Or, in other words, the (single) loop entry should dominate the (single) loop exit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth adding assert(false)
here and run SPMI for all configurations and get rid of it if we don't hit it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran an experiment and we actually do hit this case in a very few x86 tests. We should probably augment our loop recognition to reject those loops, in which the "exit" block of the loop is an EH handler. Presumably if the loop has a normal exit as well as an EH exit, on x86 we won't get here because we only get here for "single exit" loops. So it requires an "infinite" loop with the only exit being from the EH handler.
Opened #84222 to track.
There's an overall significant size improvement, and also a non-trivial TP regression of up to ~0.5%. I presume this is due to (1) creating pre-headers where we didn't before, (2) maintaining and processing extra blocks, and (3) the cost of increased downstream optimization phases in the presence of the additional block, e.g., larger bit-vectors, more blocks in dominators, etc. |
Could you paste some of the before vs. after diffs from the "hoisting from nested loop" examples I had in #68061? |
I'll look into that. Note that I locally did diffs of this change before removing the special hoisting pre-header handling compared to after that change was mostly removed but replaced with different code to add child loop pre-headers to the blocks to consider. There were very few diffs (on win-x64): mostly a few reordering because the blocks are processed in a slightly different order, so code gets hoisted in a different order; and one case where a slightly different set of things got hoisted because we hoisted in a different order and exceeded the hoisting budget before getting to everything. |
I am a bit surprised it costs this much. Maybe look into the TP costs via the more fine-grained profiling via PIN? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall code looks good, Left a few comments on comments.
I think you should look more closely at the TP costs. Maybe this is exposing some poorly scaling algorithm somewhere?
The per-function PIN diffs (for win-x64 benchmarks collection, which has a TP regression of %0.56 with this PR) is interesting: it shows all kinds of effects of having more basic blocks (ignore the fgDominate change: I changed the signature, so there's a corresponding improvement at the bottom). One interesting thing about the
|
In We could also avoid repeatedly walking the full block list in the At the very least we should be walking the blocks in the Not sure if you want to fold all that in here or do it separately -- should be a win on its own and also minimize some of the extra costs here. That plus resizing the array stack might cut out half of the worst case TP impact. Maybe. |
I generated
So, a lot of functions get bumped up to higher block count buckets.
I'm not sure why the number of constant iterator loops has changed; creating loop preheaders happens after all loop recognition and recording (where the constant iterator determination occurs). |
Probably should be done separately; possibly before this change is merged. |
// loop pre-header block would be added anyway (by dominating the loop exit block), we don't | ||
// add it here, and let it be added naturally, below. | ||
// | ||
// Note that all pre-headers get added first, which means they get considered for hoisting last. It is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now, the hoisting order is: "the entry block" followed by the pre-headers from inner loop. Can you write a comment giving example of an order of the pre-headers hoisting for multi-nested loop?
// preheader 1
for (....) {
// preheader 2
for (...) {
// preheader 3
for (...) {
}
}
}
At preheader 1
, what will be the order in which preheader will be considered? 1, 2, 3 or 3, 2, 1 or something else?
Note that the order does matter for the hoisting profitability heuristics
Is there a way where we can hoist the block depending on size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this comment:
// For example, consider this loop nest:
//
// for (....) { // loop L00
// pre-header 1
// for (...) { // loop L01
// }
// // pre-header 2
// for (...) { // loop L02
// // pre-header 3
// for (...) { // loop L03
// }
// }
// }
//
// When processing the outer loop L00 (with an assumed single exit), we will push on the defExec stack
// pre-header 2, pre-header 1, the loop exit block, any IDom tree blocks leading to the entry block,
// and finally the entry block. (Note that the child loop iteration order of a loop is from "farthest"
// from the loop "head" to "nearest".) Blocks are considered for hoisting in the opposite order.
//
// Note that pre-header 3 is not pushed, since it is not a direct child. It would have been processed
// when loop L02 was considered for hoisting.
//
// The order of pushing pre-header 1 and pre-header 2 is based on the order in the loop table (which is
// convenient). But note that it is arbitrary because there is not guaranteed execution order amongst
// the child loops.
Is there a way where we can hoist the block depending on size?
I'm not sure I understand the question. Hoisting of expressions does have various cost metrics applied. What kind of "block size" are you thinking about? Would it affect the order here, or the normal hoisting costing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of "block size" are you thinking about?
What I meant was if inside pre-header 1, we hoisted out 2 expressions and inside pre-header 2, we hoisted 4 expressions, should we track that and determine which block should be hoisted first. I am also wondering if we should first hoist the inner-most pre-header because that's the one that gets executed more often than that of outer loops preheader? That way if we hit CSE limit, we at least would have hoisted the hot parts first. Let me know if it is still not clear and we can talk offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant was if inside pre-header 1, we hoisted out 2 expressions and inside pre-header 2, we hoisted 4 expressions, should we track that and determine which block should be hoisted first.
In the example above, pre-header 1 and 2 are from sibling loops. How would we decide which block should be considered first? I don't think size makes sense. It would make sense to order based on (PGO or synthesized) block weights.
Any change here is independent of this change, though.
I am also wondering if we should first hoist the inner-most pre-header because that's the one that gets executed more often than that of outer loops preheader?
We only hoist one level at a time, and from inner-to-outer. So it's possible expressions in L03 got hoisted to pre-header 3, then got hoisted to pre-header 2. Then, they should be considered together (and equivalently) to the other expressions in pre-header 2, possibly using weighting, as described above.
@kunalspathak I tried all the examples listed there. There is no codegen difference for any between the baseline and this PR. (This is one minor case of a label difference induced by our PerfScore code.) |
@AndyAyersMS @kunalspathak I've updated the PR to address the feedback, especially for comments. I added code to handle rebuilding the loop table when a pre-header block was previously added, and still recognize a constant initializer. If the tests pass and I get a sign-off, it's ready to merge. |
No change from before. win-arm64 timed out (infra problem?) |
Is there any reason why changes in da05026 need to go in this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM but worried about the number of methods regressed. On an average for every configuration, there is approx. 20% methods that regressed in code size. Part of the reason is because of block renumbering.
Were you able to run asmdiff with PerfScore on local machine to see if you notice any improvements? Hopefully micro benchmarks would catch anything important.
Actually, those already got merged. Having that here was an accident because I cherry-picked it to use the change but didn't rebase them away (to avoid messing up comment threads). Presumably it ends up being a nop? Anyway, I rebased now. |
As part of finding natural loops and creating the loop table, create a loop pre-header for every loop. This simplifies a lot of downstream phases, as the loop pre-header will be guaranteed to exist, and will already exist in the dominator tree. Introduce code to preserve an empty pre-header block through the optimization phases. Remove now unnecessary code in hoisting and elsewhere. Fixes dotnet#77033, dotnet#62665
Disallow creating pre-header after SSA is built
When the loop table is built, it looks around for various loop patterns, including looking for a guaranteed-executed, pre-loop constant initializer. This is used in loop cloning and loop unrolling. It needs to look "a little harder" in the case we created loop pre-headers, then rebuild the loop table (currently, only due to loop unrolling of loops that contain nested loops). The new code only allows for empty pre-headers. This works since in our current phase ordering, no hoisting happens by the time the loop table is rebuilt. (Actually, it's currently not necessary to do this at all, since the constant initializer info is only used by cloning and loop unrolling, both of which have finished by the time the loop table is rebuilt. However, we might someday choose to rebuild the loop table after cloning and before unrolling, at which point it would be necessary.)
9c7ad36
to
cb7f42f
Compare
As with CodeSize diffs, the PerfScore diffs show lots of differences, some improvements and some regressions. On balance, it appears more improvements than regressions. Regressions seems to be due mostly to block weight changes (which look better in diff). |
As part of finding natural loops and creating the loop table, create a loop pre-header for every loop. This
simplifies a lot of downstream phases, as the loop pre-header will be guaranteed to exist, and will already
exist in the dominator tree.
Introduce code to preserve an empty pre-header block through the optimization phases.
Remove now unnecessary code in hoisting and elsewhere.
Fixes #77033, #62665