Always create loop pre headers #83956

BruceForstall · 2023-03-27T06:22:30Z

As part of finding natural loops and creating the loop table, create a loop pre-header for every loop. This
simplifies a lot of downstream phases, as the loop pre-header will be guaranteed to exist, and will already
exist in the dominator tree.

Introduce code to preserve an empty pre-header block through the optimization phases.

Remove now unnecessary code in hoisting and elsewhere.

Fixes #77033, #62665

ghost · 2023-03-27T06:22:42Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author:	BruceForstall
Assignees:	BruceForstall
Labels:	`area-CodeGen-coreclr`
Milestone:	-

BruceForstall · 2023-03-27T19:41:05Z

[This comment is from a draft PR]

Diffs

Overall diffs is size improvement. TP regression about 0.1% to 0.6%. Perhaps because more optimizations kick in? Or more blocks to process? Even though the overall size diff is an improvement (e.g., when additional redundant block opts kicks in more), there are cases where it regresses, e.g., more loop cloning occurs.

BruceForstall · 2023-03-29T22:36:38Z

/azp run runtime-coreclr outerloop, runtime-coreclr jitstress

azure-pipelines · 2023-03-29T22:37:07Z

Azure Pipelines successfully started running 2 pipeline(s).

BruceForstall · 2023-03-29T22:41:19Z

There are diffs due to LSRA basing its traversal order on bbNum ordering, and the bbNum order changes with pre-headers created early. There are a few small diffs because the order of nested child loop pre-header blocks processed during hoisting is different than before. Because the pre-headers exist early, some additional downstream optimizations kick in. E.g., I saw additional cases of redundant branch opts.

BruceForstall · 2023-03-29T23:00:20Z

@AndyAyersMS PTAL
cc @dotnet/jit-contrib

BruceForstall · 2023-03-29T23:03:53Z

src/coreclr/jit/optimizer.cpp

-            defExec.Reset();
-            preHeadersList = existingPreHeaders;
+            defExec.Pop(defExec.Height() - childLoopPreHeaders);
+            assert(defExec.Height() == childLoopPreHeaders);


fwiw, this if appears to be dead code. It's never hit in SPMI anyway. And that makes sense: walking up the immediate dominators from the single loop exit block (or any loop exit block, for that matter, though we don't track non-single-exit blocks) should always reach the loop entry block. Or, in other words, the (single) loop entry should dominate the (single) loop exit.

Is it worth adding assert(false) here and run SPMI for all configurations and get rid of it if we don't hit it?

I ran an experiment and we actually do hit this case in a very few x86 tests. We should probably augment our loop recognition to reject those loops, in which the "exit" block of the loop is an EH handler. Presumably if the loop has a normal exit as well as an EH exit, on x86 we won't get here because we only get here for "single exit" loops. So it requires an "infinite" loop with the only exit being from the EH handler.

Opened #84222 to track.

BruceForstall · 2023-03-30T01:23:23Z

Diffs

There's an overall significant size improvement, and also a non-trivial TP regression of up to ~0.5%. I presume this is due to (1) creating pre-headers where we didn't before, (2) maintaining and processing extra blocks, and (3) the cost of increased downstream optimization phases in the presence of the additional block, e.g., larger bit-vectors, more blocks in dominators, etc.

kunalspathak · 2023-03-30T01:28:51Z

Could you paste some of the before vs. after diffs from the "hoisting from nested loop" examples I had in #68061?

BruceForstall · 2023-03-30T01:33:01Z

outerloop failed with 2 known issues: R2R-CG2 test failures in Loader\classloader\TypeInitialization\CctorsWithSideEffects\CctorForWrite\CctorForWrite.cmd (#84007); one config failed baseservices/threading/regressions/2164/foreground-shutdown/foreground-shutdown.cmd (#83658).

BruceForstall · 2023-03-30T01:37:02Z

Could you paste some of the before vs. after diffs from the "hoisting from nested loop" examples I had in #68061?

I'll look into that.

Note that I locally did diffs of this change before removing the special hoisting pre-header handling compared to after that change was mostly removed but replaced with different code to add child loop pre-headers to the blocks to consider. There were very few diffs (on win-x64): mostly a few reordering because the blocks are processed in a slightly different order, so code gets hoisted in a different order; and one case where a slightly different set of things got hoisted because we hoisted in a different order and exceeded the hoisting budget before getting to everything.

AndyAyersMS · 2023-03-30T02:12:51Z

There's an overall significant size improvement, and also a non-trivial TP regression of up to ~0.5%. I presume this is due to (1) creating pre-headers where we didn't before, (2) maintaining and processing extra blocks, and (3) the cost of increased downstream optimization phases in the presence of the additional block, e.g., larger bit-vectors, more blocks in dominators, etc.

I am a bit surprised it costs this much. Maybe look into the TP costs via the more fine-grained profiling via PIN?

AndyAyersMS

Overall code looks good, Left a few comments on comments.

I think you should look more closely at the TP costs. Maybe this is exposing some poorly scaling algorithm somewhere?

src/coreclr/jit/fgopt.cpp

src/coreclr/jit/loopcloning.cpp

src/coreclr/jit/optimizer.cpp

BruceForstall · 2023-03-30T22:09:58Z

The per-function PIN diffs (for win-x64 benchmarks collection, which has a TP regression of %0.56 with this PR) is interesting: it shows all kinds of effects of having more basic blocks (ignore the fgDominate change: I changed the signature, so there's a corresponding improvement at the bottom). fgComputeReachabilitySets and fgComputeDoms are the biggest regressions. fgComputeReachabilitySets in particular iterates over the block list 2+ times, but since it iterates to a fixed point, it could be many more than 2 times. fgComputeDoms also iterates over the blocks multiple times and iterates to a fixed point.

One interesting thing about the fgDfsReversePostorderHelper and related regressions: it uses ArrayStack<DfsBlockEntry> with a default initial capacity, so we see a regression because Realloc is called. Perhaps it needs to size the initial capacity better based on number of blocks in the function.

Base: 55245533581, Diff: 55555209636, +0.5605%

?fgComputeReachabilitySets@Compiler@@IEAAXXZ                                                                                                                                                                                                                                                                                            : 61902267  : +34.68%  : 15.41% : +0.1120%
?fgComputeDoms@Compiler@@IEAAXXZ                                                                                                                                                                                                                                                                                                        : 38419209  : +36.04%  : 9.56%  : +0.0695%
?fgDominate@Compiler@@QEAA_NPEBUBasicBlock@@0@Z                                                                                                                                                                                                                                                                                         : 34346995  : NA       : 8.55%  : +0.0622%
?fgDfsReversePostorderHelper@Compiler@@IEAAXPEAUBasicBlock@@AEAPEA_KAEAI2@Z                                                                                                                                                                                                                                                             : 25242649  : +36.18%  : 6.28%  : +0.0457%
?fgUpdateFlowGraph@Compiler@@QEAA_N_N0@Z                                                                                                                                                                                                                                                                                                : 19906161  : +3.73%   : 4.95%  : +0.0360%
?GetSucc@BasicBlock@@QEAAPEAU1@IPEAVCompiler@@@Z                                                                                                                                                                                                                                                                                        : 17099420  : +4.05%   : 4.26%  : +0.0310%
?fgDfsReversePostorder@Compiler@@IEAAXXZ                                                                                                                                                                                                                                                                                                : 13681737  : +35.28%  : 3.41%  : +0.0248%
?NumSucc@BasicBlock@@QEAAIPEAVCompiler@@@Z                                                                                                                                                                                                                                                                                              : 10881176  : +4.07%   : 2.71%  : +0.0197%
?fgRenumberBlocks@Compiler@@QEAA_NXZ                                                                                                                                                                                                                                                                                                    : 9195714   : +10.85%  : 2.29%  : +0.0166%
DomTreeVisitor<`Compiler::fgNumberDomTree'::`2'::NumberDomTreeVisitor>::WalkTree                                                                                                                                                                                                                                                        : 8870325   : +34.93%  : 2.21%  : +0.0161%
?fgCanCompactBlocks@Compiler@@QEAA_NPEAUBasicBlock@@0@Z                                                                                                                                                                                                                                                                                 : 7068583   : +5.04%   : 1.76%  : +0.0128%
?PerBlockAnalysis@LiveVarAnalysis@@AEAA_NPEAUBasicBlock@@_N1@Z                                                                                                                                                                                                                                                                          : 5800820   : +0.76%   : 1.44%  : +0.0105%
?isEmpty@BasicBlock@@QEBA_NXZ                                                                                                                                                                                                                                                                                                           : 5057038   : +3.61%   : 1.26%  : +0.0092%
?fgBuildDomTree@Compiler@@IEAAPEAUDomTreeNode@@XZ                                                                                                                                                                                                                                                                                       : 4889433   : +22.03%  : 1.22%  : +0.0089%
??0AllSuccessorIterPosition@@QEAA@PEAVCompiler@@PEAUBasicBlock@@@Z                                                                                                                                                                                                                                                                      : 4864797   : +1.67%   : 1.21%  : +0.0088%
?allocateMemory@ArenaAllocator@@QEAAPEAX_K@Z                                                                                                                                                                                                                                                                                            : 4396313   : +0.40%   : 1.09%  : +0.0080%
?FindNextRegSuccTry@EHSuccessorIterPosition@@AEAAXPEAVCompiler@@PEAUBasicBlock@@@Z                                                                                                                                                                                                                                                      : 4148232   : +1.56%   : 1.03%  : +0.0075%
?BlockPredsWithEH@Compiler@@QEAAPEAUFlowEdge@@PEAUBasicBlock@@@Z                                                                                                                                                                                                                                                                        : 3686431   : +2.31%   : 0.92%  : +0.0067%
?CheckGrowth@?$JitHashTable@PEAUBasicBlock@@U?$JitPtrKeyFuncs@UBasicBlock@@@@PEAU1@VCompAllocator@@VJitHashTableBehavior@@@@AEAAXXZ                                                                                                                                                                                                     : 3296879   : +14.51%  : 0.82%  : +0.0060%
?fgCompactBlocks@Compiler@@QEAAXPEAUBasicBlock@@0@Z                                                                                                                                                                                                                                                                                     : 3281891   : +2.67%   : 0.82%  : +0.0059%
?fgCreateLoopPreHeader@Compiler@@QEAA_NI@Z                                                                                                                                                                                                                                                                                              : 2758659   : +466.47% : 0.69%  : +0.0050%
?doLinearScan@LinearScan@@UEAA?AW4PhaseStatus@@XZ                                                                                                                                                                                                                                                                                       : 2534677   : +1.34%   : 0.63%  : +0.0046%
?Advance@AllSuccessorIterPosition@@QEAAXPEAVCompiler@@PEAUBasicBlock@@@Z                                                                                                                                                                                                                                                                : 2486850   : +1.23%   : 0.62%  : +0.0045%
ArrayStack<`Compiler::fgDfsReversePostorderHelper'::`2'::DfsBlockEntry>::Realloc                                                                                                                                                                                                                                                        : 2476960   : +48.38%  : 0.62%  : +0.0045%
?ComputeIteratedDominanceFrontier@SsaBuilder@@AEAAXPEAUBasicBlock@@PEBV?$JitHashTable@PEAUBasicBlock@@U?$JitPtrKeyFuncs@UBasicBlock@@@@V?$vector@PEAUBasicBlock@@V?$allocator@PEAUBasicBlock@@@jitstd@@@jitstd@@VCompAllocator@@VJitHashTableBehavior@@@@PEAV?$vector@PEAUBasicBlock@@V?$allocator@PEAUBasicBlock@@@jitstd@@@jitstd@@@Z : 2268187   : +2.23%   : 0.56%  : +0.0041%
??$ForwardAnalysis@VAssertionPropFlowCallback@@@DataFlow@@QEAAXAEAVAssertionPropFlowCallback@@@Z                                                                                                                                                                                                                                        : 2253689   : +1.70%   : 0.56%  : +0.0041%
?optReachable@Compiler@@QEAA_NQEAUBasicBlock@@00@Z                                                                                                                                                                                                                                                                                      : 2126301   : +2.85%   : 0.53%  : +0.0038%
?ComputeDominanceFrontiers@SsaBuilder@@AEAAXPEAPEAUBasicBlock@@HPEAV?$JitHashTable@PEAUBasicBlock@@U?$JitPtrKeyFuncs@UBasicBlock@@@@V?$vector@PEAUBasicBlock@@V?$allocator@PEAUBasicBlock@@@jitstd@@@jitstd@@VCompAllocator@@VJitHashTableBehavior@@@@@Z                                                                                : 1943378   : +3.45%   : 0.48%  : +0.0035%
?optImpliedByTypeOfAssertions@Compiler@@QEAAXAEAPEA_K@Z                                                                                                                                                                                                                                                                                 : 1851366   : +2.37%   : 0.46%  : +0.0034%
?ComputeImmediateDom@SsaBuilder@@AEAAXPEAPEAUBasicBlock@@H@Z                                                                                                                                                                                                                                                                            : 1840454   : +2.45%   : 0.46%  : +0.0033%
?TopologicalSort@SsaBuilder@@AEAAHPEAPEAUBasicBlock@@H@Z                                                                                                                                                                                                                                                                                : 1786891   : +3.00%   : 0.44%  : +0.0032%
??$fgAddRefPred@$0A@@Compiler@@QEAAPEAUFlowEdge@@PEAUBasicBlock@@0PEAU1@@Z                                                                                                                                                                                                                                                              : 1724117   : +3.93%   : 0.43%  : +0.0031%
?optHoistThisLoop@Compiler@@IEAA_NIPEAULoopHoistContext@1@@Z                                                                                                                                                                                                                                                                            : 1704027   : NA       : 0.42%  : +0.0031%
?fgOptimizeEmptyBlock@Compiler@@QEAA_NPEAUBasicBlock@@@Z                                                                                                                                                                                                                                                                                : 1450385   : +23.49%  : 0.36%  : +0.0026%
??$ForwardAnalysis@VCSE_DataFlow@@@DataFlow@@QEAAXAEAVCSE_DataFlow@@@Z                                                                                                                                                                                                                                                                  : 1432333   : +1.49%   : 0.36%  : +0.0026%
?optJumpThreadCheck@Compiler@@QEAA_NQEAUBasicBlock@@0@Z                                                                                                                                                                                                                                                                                 : 1404980   : +12.64%  : 0.35%  : +0.0025%
?reorderPredList@BasicBlock@@QEAAXPEAVCompiler@@@Z                                                                                                                                                                                                                                                                                      : 1346707   : +6.43%   : 0.34%  : +0.0024%
??$KindIs@W4BBjumpKinds@@@BasicBlock@@QEBA_NW4BBjumpKinds@@0@Z                                                                                                                                                                                                                                                                          : 1303183   : +3.63%   : 0.32%  : +0.0024%
?fgValueNumberBlock@Compiler@@QEAAXPEAUBasicBlock@@@Z                                                                                                                                                                                                                                                                                   : 1250664   : +0.97%   : 0.31%  : +0.0023%
?fgRemoveRefPred@Compiler@@QEAAPEAUFlowEdge@@PEAUBasicBlock@@0@Z                                                                                                                                                                                                                                                                        : 1198425   : +2.83%   : 0.30%  : +0.0022%
?BlockRenameVariables@SsaBuilder@@AEAAXPEAUBasicBlock@@@Z                                                                                                                                                                                                                                                                               : 1194368   : +0.70%   : 0.30%  : +0.0022%
?ehGetBlockExnFlowDsc@Compiler@@QEAAPEAUEHblkDsc@@PEAUBasicBlock@@@Z                                                                                                                                                                                                                                                                    : 1178071   : +1.40%   : 0.29%  : +0.0021%
?AddPhiArgsToSuccessors@SsaBuilder@@AEAAXPEAUBasicBlock@@@Z                                                                                                                                                                                                                                                                             : 1135341   : +1.53%   : 0.28%  : +0.0021%
jitstd::`anonymous namespace'::quick_sort<jitstd::vector<FlowEdge *,jitstd::allocator<FlowEdge *> >::iterator,`BasicBlock::reorderPredList'::`2'::FlowEdgeBBNumCmp>                                                                                                                                                                     : 1101272   : +6.28%   : 0.27%  : +0.0020%
?ensure_capacity@?$vector@PEAUBasicBlock@@V?$allocator@PEAUBasicBlock@@@jitstd@@@jitstd@@AEAA_N_K@Z                                                                                                                                                                                                                                     : 1077164   : +3.18%   : 0.27%  : +0.0019%
?fgValueNumber@Compiler@@QEAA?AW4PhaseStatus@@XZ                                                                                                                                                                                                                                                                                        : 1060039   : +1.17%   : 0.26%  : +0.0019%
?Assign@?$BitSetOps@PEA_K$00PEAVCompiler@@VTrackedVarBitSetTraits@@@@SAXPEAVCompiler@@AEAPEA_KPEA_K@Z                                                                                                                                                                                                                                   : 977026    : +0.66%   : 0.24%  : +0.0018%
?optFindNaturalLoops@Compiler@@IEAAXXZ                                                                                                                                                                                                                                                                                                  : 924808    : +8.45%   : 0.23%  : +0.0017%
?bbNewBasicBlock@Compiler@@QEAAPEAUBasicBlock@@W4BBjumpKinds@@@Z                                                                                                                                                                                                                                                                        : 903977    : +0.88%   : 0.22%  : +0.0016%
?fgPerBlockLocalVarLiveness@Compiler@@QEAAXXZ                                                                                                                                                                                                                                                                                           : 884567    : +0.31%   : 0.22%  : +0.0016%
?GetDescriptorForSwitch@Compiler@@QEAA?AUSwitchUniqueSuccSet@1@PEAUBasicBlock@@@Z                                                                                                                                                                                                                                                       : 878679    : +5.21%   : 0.22%  : +0.0016%
?optRedundantBranch@Compiler@@QEAA_NQEAUBasicBlock@@@Z                                                                                                                                                                                                                                                                                  : 846054    : +0.17%   : 0.21%  : +0.0015%
?fgUpdateLoopsAfterCompacting@Compiler@@QEAAXPEAUBasicBlock@@0@Z                                                                                                                                                                                                                                                                        : 814495    : +7.11%   : 0.20%  : +0.0015%
?Assign@?$BitSetOps@PEA_K$00PEAUBitVecTraits@@U1@@@SAXPEAUBitVecTraits@@AEAPEA_KPEA_K@Z                                                                                                                                                                                                                                                 : 807181    : +1.80%   : 0.20%  : +0.0015%
?fgCompDominatedByExceptionalEntryBlocks@Compiler@@IEAAXXZ                                                                                                                                                                                                                                                                              : 772144    : +25.66%  : 0.19%  : +0.0014%
?Run@LiveVarAnalysis@@AEAAX_N@Z                                                                                                                                                                                                                                                                                                         : 737728    : +0.84%   : 0.18%  : +0.0013%
??$CountBitsInIntegral@_K@BitSetSupport@@SAI_K@Z                                                                                                                                                                                                                                                                                        : 718278    : +14.79%  : 0.18%  : +0.0013%
?optBlockCopyProp@Compiler@@QEAA_NPEAUBasicBlock@@PEAV?$JitHashTable@IU?$JitSmallPrimitiveKeyFuncs@I@@PEAV?$ArrayStack@VCopyPropSsaDef@Compiler@@@@VCompAllocator@@VJitHashTableBehavior@@@@@Z                                                                                                                                          : 715155    : +0.36%   : 0.18%  : +0.0013%
?optCopyProp@Compiler@@QEAA_NPEAUBasicBlock@@PEAUStatement@@PEAUGenTreeLclVarCommon@@IPEAV?$JitHashTable@IU?$JitSmallPrimitiveKeyFuncs@I@@PEAV?$ArrayStack@VCopyPropSsaDef@Compiler@@@@VCompAllocator@@VJitHashTableBehavior@@@@@Z                                                                                                      : 695610    : +0.09%   : 0.17%  : +0.0013%
?ClearD@?$BitSetOps@PEA_K$00PEAVCompiler@@VTrackedVarBitSetTraits@@@@SAXPEAVCompiler@@AEAPEA_K@Z                                                                                                                                                                                                                                        : 634075    : +0.69%   : 0.16%  : +0.0011%
?optComputeLoopSideEffectsOfBlock@Compiler@@AEAA_NPEAUBasicBlock@@@Z                                                                                                                                                                                                                                                                    : 575091    : +1.20%   : 0.14%  : +0.0010%
?fgInterBlockLocalVarLiveness@Compiler@@QEAAXXZ                                                                                                                                                                                                                                                                                         : 567791    : +0.20%   : 0.14%  : +0.0010%
?InsertPhiFunctions@SsaBuilder@@AEAAXPEAPEAUBasicBlock@@H@Z                                                                                                                                                                                                                                                                             : 558383    : +0.64%   : 0.14%  : +0.0010%
?optValnumCSE_InitDataFlow@Compiler@@IEAAXXZ                                                                                                                                                                                                                                                                                            : 531119    : +1.46%   : 0.13%  : +0.0010%
?fgInitBlockVarSets@Compiler@@QEAAXXZ                                                                                                                                                                                                                                                                                                   : 490339    : +0.69%   : 0.12%  : +0.0009%
?MakeEmpty@?$BitSetOps@PEA_K$00PEAVCompiler@@VTrackedVarBitSetTraits@@@@SAPEA_KPEAVCompiler@@@Z                                                                                                                                                                                                                                         : 464869    : +3.07%   : 0.12%  : +0.0008%
?RenameVariables@SsaBuilder@@AEAAXXZ                                                                                                                                                                                                                                                                                                    : 454551    : +0.75%   : 0.11%  : +0.0008%
?optInitAssertionDataflowFlags@Compiler@@QEAAPEAPEA_KXZ                                                                                                                                                                                                                                                                                 : 444968    : +1.71%   : 0.11%  : +0.0008%
?optAssertionPropMain@Compiler@@QEAA?AW4PhaseStatus@@XZ                                                                                                                                                                                                                                                                                 : 403540    : +0.17%   : 0.10%  : +0.0007%
memset                                                                                                                                                                                                                                                                                                                                  : -567179   : -0.13%   : 0.14%  : -0.0010%
?compInit@Compiler@@QEAAXPEAVArenaAllocator@@PEAUCORINFO_METHOD_STRUCT_@@PEAVICorJitInfo@@PEAUCORINFO_METHOD_INFO@@PEAUInlineInfo@@@Z                                                                                                                                                                                                   : -572769   : -0.51%   : 0.14%  : -0.0010%
?optUpdateLoopsBeforeRemoveBlock@Compiler@@IEAAXPEAUBasicBlock@@_N@Z                                                                                                                                                                                                                                                                    : -610530   : -21.54%  : 0.15%  : -0.0011%
?processBlockStartLocations@LinearScan@@AEAAXPEAUBasicBlock@@@Z                                                                                                                                                                                                                                                                         : -694841   : -0.10%   : 0.17%  : -0.0013%
?optHoistThisLoop@Compiler@@IEAA_NIPEAULoopHoistContext@1@PEAUBasicBlockList@@@Z                                                                                                                                                                                                                                                        : -1734446  : -100.00% : 0.43%  : -0.0031%
?fgDominate@Compiler@@QEAA_NPEAUBasicBlock@@0@Z                                                                                                                                                                                                                                                                                         : -39379589 : -100.00% : 9.80%  : -0.0713%

AndyAyersMS · 2023-03-30T22:35:10Z

In fgComputeReachabilitySets, if UnionD returned a bool indicating if any bits changed, we could get rid of the newReach, Equals, and Assign calls, and just modify block->bbReach directly, and still know whether to keep iterating. This bool should be fairly cheap to compute in the existing methods, or we could add a new method to isolate those extra costs to cases where we're going to look at the result.

We could also avoid repeatedly walking the full block list in the do loop since we know exactly which blocks might see new reachability bits (successors of any block that just did an update). If we had the appropriate worklist set up (say another blockset indexed by the reverse postorder number) we would organically process blocks in pred->succ order and so possibly greatly accelerate convergence.

At the very least we should be walking the blocks in the do loop in reverse postorder; that might be an easy change with some nice wins (if the bbPostOrderNum -> block mapping is not current, you can build it in the loop above).

Not sure if you want to fold all that in here or do it separately -- should be a win on its own and also minimize some of the extra costs here.

That plus resizing the array stack might cut out half of the worst case TP impact. Maybe.

BruceForstall · 2023-03-30T22:58:13Z

I generated COUNT_BASIC_BLOCKS and COUNT_LOOPS stats, and added a new blocks stats histogram for post-loop-recognition and see the following (win-x64 benchmarks replay):

Blocks after importer (same base & diff):

--------------------------------------------------
Basic block count frequency table:
--------------------------------------------------
     <=          1 ===>  356063 count ( 66% of total)
      2 ..       2 ===>     156 count ( 66% of total)
      3 ..       3 ===>  124928 count ( 90% of total)
      4 ..       5 ===>   11703 count ( 92% of total)
      6 ..      10 ===>   29412 count ( 97% of total)
     11 ..      20 ===>    9327 count ( 99% of total)
     21 ..      50 ===>    1493 count ( 99% of total)
     51 ..     100 ===>     280 count ( 99% of total)
    101 ..    1000 ===>      79 count ( 99% of total)
   1001 ..   10000 ===>       1 count (100% of total)
--------------------------------------------------

(Note: this is for 35418 method contexts, so it appears we're hugely overemphasizing inlinees (and probably double counting)

Post loop recognition: base
--------------------------------------------------
Basic block count frequency table (post loop recognition):
--------------------------------------------------
     <=          1 ===>   10652 count ( 31% of total)
      2 ..       2 ===>      40 count ( 31% of total)
      3 ..       3 ===>    2515 count ( 38% of total)
      4 ..       5 ===>    2193 count ( 45% of total)
      6 ..      10 ===>    3457 count ( 55% of total)
     11 ..      20 ===>    5832 count ( 72% of total)
     21 ..      50 ===>    8131 count ( 96% of total)
     51 ..     100 ===>     906 count ( 98% of total)
    101 ..    1000 ===>     383 count ( 99% of total)
   1001 ..   10000 ===>       1 count (100% of total)
--------------------------------------------------

Post loop recognition: diff
--------------------------------------------------
Basic block count frequency table (post loop recognition):
--------------------------------------------------
     <=          1 ===>   10652 count ( 31% of total)
      2 ..       2 ===>      40 count ( 31% of total)
      3 ..       3 ===>    2052 count ( 37% of total)
      4 ..       5 ===>    2368 count ( 44% of total)
      6 ..      10 ===>    3615 count ( 54% of total)
     11 ..      20 ===>    3588 count ( 65% of total)
     21 ..      50 ===>   10459 count ( 96% of total)
     51 ..     100 ===>     940 count ( 98% of total)
    101 ..    1000 ===>     395 count ( 99% of total)
   1001 ..   10000 ===>       1 count (100% of total)
--------------------------------------------------

So, a lot of functions get bumped up to higher block count buckets.

Loops: base
---------------------------------------------------
Loop stats
---------------------------------------------------
Total number of methods with loops is 11816
Total number of              loops is 14933
Maximum number of loops per method is    45
# of methods overflowing nat loop table is     0
Total number of 'unnatural' loops is 15481
# of methods overflowing unnat loop limit is     0
Total number of loops with an         iterator is  3063
Total number of loops with a constant iterator is   929
--------------------------------------------------
Loop count frequency table:
--------------------------------------------------
     <=          0 ===>     336 count (  2% of total)
      1 ..       1 ===>   10484 count ( 89% of total)
      2 ..       2 ===>     759 count ( 95% of total)
      3 ..       3 ===>     263 count ( 97% of total)
      4 ..       4 ===>     130 count ( 98% of total)
      5 ..       5 ===>      58 count ( 99% of total)
      6 ..       6 ===>      23 count ( 99% of total)
      7 ..       7 ===>      20 count ( 99% of total)
      8 ..       8 ===>      14 count ( 99% of total)
      9 ..       9 ===>      17 count ( 99% of total)
     10 ..      10 ===>       4 count ( 99% of total)
     11 ..      11 ===>       7 count ( 99% of total)
     12 ..      12 ===>       3 count (100% of total)
      >         12 ===>      34 count (100% of total)
--------------------------------------------------
Loop exit count frequency table:
--------------------------------------------------
     <=          0 ===>       1 count (  0% of total)
      1 ..       1 ===>    3716 count ( 26% of total)
      2 ..       2 ===>    1733 count ( 38% of total)
      3 ..       3 ===>     556 count ( 42% of total)
      4 ..       4 ===>    2836 count ( 62% of total)
      5 ..       5 ===>    5223 count ( 98% of total)
      6 ..       6 ===>     167 count (100% of total)
      >          6 ===>     701 count (104% of total)
--------------------------------------------------

You would expect diffs to be the same, but it has two differences:
Total number of 'unnatural' loops is 15480 (-1)
Total number of loops with a constant iterator is   851 (-78)

I'm not sure why the number of constant iterator loops has changed; creating loop preheaders happens after all loop recognition and recording (where the constant iterator determination occurs).

BruceForstall · 2023-03-30T22:59:51Z

Not sure if you want to fold all that in here or do it separately -- should be a win on its own and also minimize some of the extra costs here.

Probably should be done separately; possibly before this change is merged.

src/coreclr/jit/fgopt.cpp

src/coreclr/jit/optimizer.cpp

kunalspathak · 2023-03-31T04:24:16Z

src/coreclr/jit/optimizer.cpp

+    // loop pre-header block would be added anyway (by dominating the loop exit block), we don't
+    // add it here, and let it be added naturally, below.
+    //
+    // Note that all pre-headers get added first, which means they get considered for hoisting last. It is


So now, the hoisting order is: "the entry block" followed by the pre-headers from inner loop. Can you write a comment giving example of an order of the pre-headers hoisting for multi-nested loop?

// preheader 1 for (....) { // preheader 2 for (...) { // preheader 3 for (...) { } } }

At preheader 1, what will be the order in which preheader will be considered? 1, 2, 3 or 3, 2, 1 or something else?

Note that the order does matter for the hoisting profitability heuristics

Is there a way where we can hoist the block depending on size?

I added this comment:

// For example, consider this loop nest: // // for (....) { // loop L00 // pre-header 1 // for (...) { // loop L01 // } // // pre-header 2 // for (...) { // loop L02 // // pre-header 3 // for (...) { // loop L03 // } // } // } // // When processing the outer loop L00 (with an assumed single exit), we will push on the defExec stack // pre-header 2, pre-header 1, the loop exit block, any IDom tree blocks leading to the entry block, // and finally the entry block. (Note that the child loop iteration order of a loop is from "farthest" // from the loop "head" to "nearest".) Blocks are considered for hoisting in the opposite order. // // Note that pre-header 3 is not pushed, since it is not a direct child. It would have been processed // when loop L02 was considered for hoisting. // // The order of pushing pre-header 1 and pre-header 2 is based on the order in the loop table (which is // convenient). But note that it is arbitrary because there is not guaranteed execution order amongst // the child loops.

Is there a way where we can hoist the block depending on size?

I'm not sure I understand the question. Hoisting of expressions does have various cost metrics applied. What kind of "block size" are you thinking about? Would it affect the order here, or the normal hoisting costing?

What kind of "block size" are you thinking about?

What I meant was if inside pre-header 1, we hoisted out 2 expressions and inside pre-header 2, we hoisted 4 expressions, should we track that and determine which block should be hoisted first. I am also wondering if we should first hoist the inner-most pre-header because that's the one that gets executed more often than that of outer loops preheader? That way if we hit CSE limit, we at least would have hoisted the hot parts first. Let me know if it is still not clear and we can talk offline.

What I meant was if inside pre-header 1, we hoisted out 2 expressions and inside pre-header 2, we hoisted 4 expressions, should we track that and determine which block should be hoisted first.

In the example above, pre-header 1 and 2 are from sibling loops. How would we decide which block should be considered first? I don't think size makes sense. It would make sense to order based on (PGO or synthesized) block weights.

Any change here is independent of this change, though.

I am also wondering if we should first hoist the inner-most pre-header because that's the one that gets executed more often than that of outer loops preheader?

We only hoist one level at a time, and from inner-to-outer. So it's possible expressions in L03 got hoisted to pre-header 3, then got hoisted to pre-header 2. Then, they should be considered together (and equivalently) to the other expressions in pre-header 2, possibly using weighting, as described above.

src/coreclr/jit/optimizer.cpp

BruceForstall · 2023-04-05T01:48:02Z

Could you paste some of the before vs. after diffs from the "hoisting from nested loop" examples I had in #68061?

@kunalspathak I tried all the examples listed there. There is no codegen difference for any between the baseline and this PR. (This is one minor case of a label difference induced by our PerfScore code.)

BruceForstall · 2023-04-05T01:54:56Z

@AndyAyersMS @kunalspathak I've updated the PR to address the feedback, especially for comments. I added code to handle rebuilding the loop table when a pre-header block was previously added, and still recognize a constant initializer.

If the tests pass and I get a sign-off, it's ready to merge.

BruceForstall · 2023-04-05T16:30:29Z

Current diffs

No change from before. win-arm64 timed out (infra problem?)

BruceForstall · 2023-04-05T23:55:32Z

@AndyAyersMS @kunalspathak ping

kunalspathak · 2023-04-06T04:56:44Z

Is there any reason why changes in da05026 need to go in this PR?

kunalspathak

Changes LGTM but worried about the number of methods regressed. On an average for every configuration, there is approx. 20% methods that regressed in code size. Part of the reason is because of block renumbering.

Were you able to run asmdiff with PerfScore on local machine to see if you notice any improvements? Hopefully micro benchmarks would catch anything important.

BruceForstall · 2023-04-06T06:11:52Z

Is there any reason why changes in da05026 need to go in this PR?

Actually, those already got merged. Having that here was an accident because I cherry-picked it to use the change but didn't rebase them away (to avoid messing up comment threads). Presumably it ends up being a nop? Anyway, I rebased now.

As part of finding natural loops and creating the loop table, create a loop pre-header for every loop. This simplifies a lot of downstream phases, as the loop pre-header will be guaranteed to exist, and will already exist in the dominator tree. Introduce code to preserve an empty pre-header block through the optimization phases. Remove now unnecessary code in hoisting and elsewhere. Fixes dotnet#77033, dotnet#62665

Disallow creating pre-header after SSA is built

When the loop table is built, it looks around for various loop patterns, including looking for a guaranteed-executed, pre-loop constant initializer. This is used in loop cloning and loop unrolling. It needs to look "a little harder" in the case we created loop pre-headers, then rebuild the loop table (currently, only due to loop unrolling of loops that contain nested loops). The new code only allows for empty pre-headers. This works since in our current phase ordering, no hoisting happens by the time the loop table is rebuilt. (Actually, it's currently not necessary to do this at all, since the constant initializer info is only used by cloning and loop unrolling, both of which have finished by the time the loop table is rebuilt. However, we might someday choose to rebuild the loop table after cloning and before unrolling, at which point it would be necessary.)

BruceForstall · 2023-04-06T06:54:04Z

Were you able to run asmdiff with PerfScore on local machine to see if you notice any improvements? Hopefully micro benchmarks would catch anything important.

As with CodeSize diffs, the PerfScore diffs show lots of differences, some improvements and some regressions. On balance, it appears more improvements than regressions. Regressions seems to be due mostly to block weight changes (which look better in diff).

ghost assigned BruceForstall Mar 27, 2023

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 27, 2023

build-analysis bot mentioned this pull request Mar 27, 2023

WasmTestOnBrowser-System.* test failures in CI #83655

Closed

BruceForstall force-pushed the AlwaysCreateLoopPreHeader branch from 6f1aa65 to f1d07df Compare March 29, 2023 22:35

dotnet deleted a comment from azure-pipelines bot Mar 29, 2023

BruceForstall marked this pull request as ready for review March 29, 2023 22:41

BruceForstall requested a review from AndyAyersMS March 29, 2023 22:41

BruceForstall changed the title ~~Always create loop pre header~~ Always create loop pre headers Mar 29, 2023

BruceForstall commented Mar 29, 2023

View reviewed changes

AndyAyersMS reviewed Mar 30, 2023

View reviewed changes

kunalspathak reviewed Mar 31, 2023

View reviewed changes

src/coreclr/jit/fgopt.cpp Show resolved Hide resolved

kunalspathak reviewed Mar 31, 2023

View reviewed changes

src/coreclr/jit/optimizer.cpp Show resolved Hide resolved

kunalspathak reviewed Mar 31, 2023

View reviewed changes

src/coreclr/jit/optimizer.cpp Outdated Show resolved Hide resolved

kunalspathak reviewed Mar 31, 2023

View reviewed changes

src/coreclr/jit/optimizer.cpp Show resolved Hide resolved

kunalspathak approved these changes Apr 6, 2023

View reviewed changes

BruceForstall added 14 commits April 5, 2023 23:15

Fix loop unrolling to work with loop pre-headers

f291aa5

Add optLoopsRequirePreHeaders variable

9a16b00

Prevent removing pre-header blocks

47d1f84

Allow removing unreachable pre-headers

5f77004

Disallow creating pre-header after SSA is built

Make optLoopCloningEnabled() static

4fee3bc

Teach loop cloning to expect and respect loop pre-headers

e82478d

Remove special case pre-header handling in hoisting

07e66b2

Remove unused SSA update code in fgCreateLoopPreHeader

0a45df2

Remove unneeded pre-header code from fgDominate

19744da

Remove workaround to avoid extraneous LSRA diffs due to bbNum ordering

c9912cd

Update comments

dddb692

Update comments

cb7f42f

BruceForstall force-pushed the AlwaysCreateLoopPreHeader branch from 9c7ad36 to cb7f42f Compare April 6, 2023 06:16

BruceForstall merged commit 657865f into dotnet:main Apr 6, 2023

BruceForstall deleted the AlwaysCreateLoopPreHeader branch April 6, 2023 15:49

BruceForstall mentioned this pull request Apr 6, 2023

Change loop cloning to work with pre-header blocks #62665

Closed

jakobbotsch mentioned this pull request Apr 11, 2023

JIT: Assertion failed '((tree->gtFlags & GTF_VAR_DEF) == 0) && (tree->GetLclNum() == lclNum) && tree->gtVNPair.BothDefined()' during 'VN based copy prop #84619

Closed

AndyAyersMS mentioned this pull request Apr 12, 2023

Investigate microbenchmarks that regress with PGO enabled #84264

Closed

EgorBo mentioned this pull request Apr 18, 2023

[Perf] Alpine/x64: 7 Regressions on 4/6/2023 3:49:07 PM #84988

Closed

ghost locked as resolved and limited conversation to collaborators May 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always create loop pre headers #83956

Always create loop pre headers #83956

BruceForstall commented Mar 27, 2023 •

edited

Loading

ghost commented Mar 27, 2023

BruceForstall commented Mar 27, 2023 •

edited

Loading

BruceForstall commented Mar 29, 2023

azure-pipelines bot commented Mar 29, 2023

BruceForstall commented Mar 29, 2023

BruceForstall commented Mar 29, 2023

BruceForstall Mar 29, 2023

kunalspathak Mar 31, 2023

BruceForstall Apr 1, 2023

BruceForstall commented Mar 30, 2023

kunalspathak commented Mar 30, 2023

BruceForstall commented Mar 30, 2023

BruceForstall commented Mar 30, 2023

AndyAyersMS commented Mar 30, 2023

AndyAyersMS left a comment

BruceForstall commented Mar 30, 2023

AndyAyersMS commented Mar 30, 2023

BruceForstall commented Mar 30, 2023

BruceForstall commented Mar 30, 2023

kunalspathak Mar 31, 2023

BruceForstall Apr 5, 2023 •

edited

Loading

kunalspathak Apr 5, 2023

BruceForstall Apr 5, 2023

BruceForstall commented Apr 5, 2023

BruceForstall commented Apr 5, 2023

BruceForstall commented Apr 5, 2023

BruceForstall commented Apr 5, 2023

kunalspathak commented Apr 6, 2023

kunalspathak left a comment

BruceForstall commented Apr 6, 2023 •

edited

Loading

BruceForstall commented Apr 6, 2023

Always create loop pre headers #83956

Always create loop pre headers #83956

Conversation

BruceForstall commented Mar 27, 2023 • edited Loading

ghost commented Mar 27, 2023

BruceForstall commented Mar 27, 2023 • edited Loading

BruceForstall commented Mar 29, 2023

azure-pipelines bot commented Mar 29, 2023

BruceForstall commented Mar 29, 2023

BruceForstall commented Mar 29, 2023

BruceForstall Mar 29, 2023

Choose a reason for hiding this comment

kunalspathak Mar 31, 2023

Choose a reason for hiding this comment

BruceForstall Apr 1, 2023

Choose a reason for hiding this comment

BruceForstall commented Mar 30, 2023

kunalspathak commented Mar 30, 2023

BruceForstall commented Mar 30, 2023

BruceForstall commented Mar 30, 2023

AndyAyersMS commented Mar 30, 2023

AndyAyersMS left a comment

Choose a reason for hiding this comment

BruceForstall commented Mar 30, 2023

AndyAyersMS commented Mar 30, 2023

BruceForstall commented Mar 30, 2023

BruceForstall commented Mar 30, 2023

kunalspathak Mar 31, 2023

Choose a reason for hiding this comment

BruceForstall Apr 5, 2023 • edited Loading

Choose a reason for hiding this comment

kunalspathak Apr 5, 2023

Choose a reason for hiding this comment

BruceForstall Apr 5, 2023

Choose a reason for hiding this comment

BruceForstall commented Apr 5, 2023

BruceForstall commented Apr 5, 2023

BruceForstall commented Apr 5, 2023

BruceForstall commented Apr 5, 2023

kunalspathak commented Apr 6, 2023

kunalspathak left a comment

Choose a reason for hiding this comment

BruceForstall commented Apr 6, 2023 • edited Loading

BruceForstall commented Apr 6, 2023

BruceForstall commented Mar 27, 2023 •

edited

Loading

BruceForstall commented Mar 27, 2023 •

edited

Loading

BruceForstall Apr 5, 2023 •

edited

Loading

BruceForstall commented Apr 6, 2023 •

edited

Loading