-
Notifications
You must be signed in to change notification settings - Fork 12.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LoopUnroll] Clamp PartialThreshold for large LoopMicroOpBufferSize #67657
Conversation
@llvm/pr-subscribers-llvm-transforms ChangesThe znver3/znver4 scheduler models are outliers, specifying very large LoopMicroOpBufferSizes at 512, while typical values for other subtargets are on the order of ~50. Even if this information is micro-architecturally correct (*), this does not mean that we want to runtime unroll all loops to a size that completely fills the loop buffer. Unless this is the single hot loop in the entire application, the massive code size increase will bust the micro-op and instruction caches. Protect against this by clamping to the default PartialThreshold of 150, which is the same as the default full-unroll threshold and half the aggressive full-unroll threshold. Allowing more partial unrolling than full unrolling is certainly nonsensical. (*) I strongly doubt that this is actually correct -- I believe this may derive from an incorrect reading of Agner Fog's micro-architecture guide. The number 4096 that was originally used here is the size of the general micro-op cache, not that of a loop buffer. A separate loop buffer is not listed for the Zen microarchitecture. Comparing this to the listing for Skylake, it has a 1536 micro-op buffer, but only a 64 micro-op loopback buffer, with a note that it's rarely fully utilized. Our scheduling model specifies LoopMicroOpBufferSize of 50 in that case. Patch is 65.57 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/67657.diff 2 Files Affected:
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index c11d558a73e9d09..91daeee19acad34 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -575,7 +575,13 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
if (PartialUnrollingThreshold.getNumOccurrences() > 0)
MaxOps = PartialUnrollingThreshold;
else if (ST->getSchedModel().LoopMicroOpBufferSize > 0)
- MaxOps = ST->getSchedModel().LoopMicroOpBufferSize;
+ // Upper bound by the default PartialThreshold, which is the same as
+ // the default full-unroll Threshold. Even if the loop micro-op buffer
+ // is very large, this does not mean that we want to unroll all loops
+ // to that length, as it would increase code size beyond the limits of
+ // what unrolling normally allows.
+ MaxOps = std::min(ST->getSchedModel().LoopMicroOpBufferSize,
+ UP.PartialThreshold);
else
return;
diff --git a/llvm/test/Transforms/LoopUnroll/X86/znver3.ll b/llvm/test/Transforms/LoopUnroll/X86/znver3.ll
index 30389062a096788..467c57906d88827 100644
--- a/llvm/test/Transforms/LoopUnroll/X86/znver3.ll
+++ b/llvm/test/Transforms/LoopUnroll/X86/znver3.ll
@@ -9,8 +9,8 @@ define i32 @test(ptr %ary) "target-cpu"="znver3" {
; CHECK-NEXT: entry:
; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.body:
-; CHECK-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDVARS_IV_NEXT_127:%.*]], [[FOR_BODY]] ]
-; CHECK-NEXT: [[SUM:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[SUM_NEXT_127:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDVARS_IV_NEXT_31:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT: [[SUM:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[SUM_NEXT_31:%.*]], [[FOR_BODY]] ]
; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV]]
; CHECK-NEXT: [[VAL:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
; CHECK-NEXT: [[SUM_NEXT:%.*]] = add nsw i32 [[VAL]], [[SUM]]
@@ -137,396 +137,12 @@ define i32 @test(ptr %ary) "target-cpu"="znver3" {
; CHECK-NEXT: [[INDVARS_IV_NEXT_30:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 31
; CHECK-NEXT: [[ARRAYIDX_31:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_30]]
; CHECK-NEXT: [[VAL_31:%.*]] = load i32, ptr [[ARRAYIDX_31]], align 4
-; CHECK-NEXT: [[SUM_NEXT_31:%.*]] = add nsw i32 [[VAL_31]], [[SUM_NEXT_30]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_31:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 32
-; CHECK-NEXT: [[ARRAYIDX_32:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_31]]
-; CHECK-NEXT: [[VAL_32:%.*]] = load i32, ptr [[ARRAYIDX_32]], align 4
-; CHECK-NEXT: [[SUM_NEXT_32:%.*]] = add nsw i32 [[VAL_32]], [[SUM_NEXT_31]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_32:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 33
-; CHECK-NEXT: [[ARRAYIDX_33:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_32]]
-; CHECK-NEXT: [[VAL_33:%.*]] = load i32, ptr [[ARRAYIDX_33]], align 4
-; CHECK-NEXT: [[SUM_NEXT_33:%.*]] = add nsw i32 [[VAL_33]], [[SUM_NEXT_32]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_33:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 34
-; CHECK-NEXT: [[ARRAYIDX_34:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_33]]
-; CHECK-NEXT: [[VAL_34:%.*]] = load i32, ptr [[ARRAYIDX_34]], align 4
-; CHECK-NEXT: [[SUM_NEXT_34:%.*]] = add nsw i32 [[VAL_34]], [[SUM_NEXT_33]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_34:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 35
-; CHECK-NEXT: [[ARRAYIDX_35:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_34]]
-; CHECK-NEXT: [[VAL_35:%.*]] = load i32, ptr [[ARRAYIDX_35]], align 4
-; CHECK-NEXT: [[SUM_NEXT_35:%.*]] = add nsw i32 [[VAL_35]], [[SUM_NEXT_34]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_35:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 36
-; CHECK-NEXT: [[ARRAYIDX_36:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_35]]
-; CHECK-NEXT: [[VAL_36:%.*]] = load i32, ptr [[ARRAYIDX_36]], align 4
-; CHECK-NEXT: [[SUM_NEXT_36:%.*]] = add nsw i32 [[VAL_36]], [[SUM_NEXT_35]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_36:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 37
-; CHECK-NEXT: [[ARRAYIDX_37:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_36]]
-; CHECK-NEXT: [[VAL_37:%.*]] = load i32, ptr [[ARRAYIDX_37]], align 4
-; CHECK-NEXT: [[SUM_NEXT_37:%.*]] = add nsw i32 [[VAL_37]], [[SUM_NEXT_36]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_37:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 38
-; CHECK-NEXT: [[ARRAYIDX_38:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_37]]
-; CHECK-NEXT: [[VAL_38:%.*]] = load i32, ptr [[ARRAYIDX_38]], align 4
-; CHECK-NEXT: [[SUM_NEXT_38:%.*]] = add nsw i32 [[VAL_38]], [[SUM_NEXT_37]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_38:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 39
-; CHECK-NEXT: [[ARRAYIDX_39:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_38]]
-; CHECK-NEXT: [[VAL_39:%.*]] = load i32, ptr [[ARRAYIDX_39]], align 4
-; CHECK-NEXT: [[SUM_NEXT_39:%.*]] = add nsw i32 [[VAL_39]], [[SUM_NEXT_38]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_39:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 40
-; CHECK-NEXT: [[ARRAYIDX_40:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_39]]
-; CHECK-NEXT: [[VAL_40:%.*]] = load i32, ptr [[ARRAYIDX_40]], align 4
-; CHECK-NEXT: [[SUM_NEXT_40:%.*]] = add nsw i32 [[VAL_40]], [[SUM_NEXT_39]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_40:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 41
-; CHECK-NEXT: [[ARRAYIDX_41:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_40]]
-; CHECK-NEXT: [[VAL_41:%.*]] = load i32, ptr [[ARRAYIDX_41]], align 4
-; CHECK-NEXT: [[SUM_NEXT_41:%.*]] = add nsw i32 [[VAL_41]], [[SUM_NEXT_40]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_41:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 42
-; CHECK-NEXT: [[ARRAYIDX_42:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_41]]
-; CHECK-NEXT: [[VAL_42:%.*]] = load i32, ptr [[ARRAYIDX_42]], align 4
-; CHECK-NEXT: [[SUM_NEXT_42:%.*]] = add nsw i32 [[VAL_42]], [[SUM_NEXT_41]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_42:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 43
-; CHECK-NEXT: [[ARRAYIDX_43:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_42]]
-; CHECK-NEXT: [[VAL_43:%.*]] = load i32, ptr [[ARRAYIDX_43]], align 4
-; CHECK-NEXT: [[SUM_NEXT_43:%.*]] = add nsw i32 [[VAL_43]], [[SUM_NEXT_42]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_43:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 44
-; CHECK-NEXT: [[ARRAYIDX_44:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_43]]
-; CHECK-NEXT: [[VAL_44:%.*]] = load i32, ptr [[ARRAYIDX_44]], align 4
-; CHECK-NEXT: [[SUM_NEXT_44:%.*]] = add nsw i32 [[VAL_44]], [[SUM_NEXT_43]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_44:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 45
-; CHECK-NEXT: [[ARRAYIDX_45:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_44]]
-; CHECK-NEXT: [[VAL_45:%.*]] = load i32, ptr [[ARRAYIDX_45]], align 4
-; CHECK-NEXT: [[SUM_NEXT_45:%.*]] = add nsw i32 [[VAL_45]], [[SUM_NEXT_44]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_45:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 46
-; CHECK-NEXT: [[ARRAYIDX_46:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_45]]
-; CHECK-NEXT: [[VAL_46:%.*]] = load i32, ptr [[ARRAYIDX_46]], align 4
-; CHECK-NEXT: [[SUM_NEXT_46:%.*]] = add nsw i32 [[VAL_46]], [[SUM_NEXT_45]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_46:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 47
-; CHECK-NEXT: [[ARRAYIDX_47:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_46]]
-; CHECK-NEXT: [[VAL_47:%.*]] = load i32, ptr [[ARRAYIDX_47]], align 4
-; CHECK-NEXT: [[SUM_NEXT_47:%.*]] = add nsw i32 [[VAL_47]], [[SUM_NEXT_46]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_47:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 48
-; CHECK-NEXT: [[ARRAYIDX_48:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_47]]
-; CHECK-NEXT: [[VAL_48:%.*]] = load i32, ptr [[ARRAYIDX_48]], align 4
-; CHECK-NEXT: [[SUM_NEXT_48:%.*]] = add nsw i32 [[VAL_48]], [[SUM_NEXT_47]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_48:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 49
-; CHECK-NEXT: [[ARRAYIDX_49:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_48]]
-; CHECK-NEXT: [[VAL_49:%.*]] = load i32, ptr [[ARRAYIDX_49]], align 4
-; CHECK-NEXT: [[SUM_NEXT_49:%.*]] = add nsw i32 [[VAL_49]], [[SUM_NEXT_48]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_49:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 50
-; CHECK-NEXT: [[ARRAYIDX_50:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_49]]
-; CHECK-NEXT: [[VAL_50:%.*]] = load i32, ptr [[ARRAYIDX_50]], align 4
-; CHECK-NEXT: [[SUM_NEXT_50:%.*]] = add nsw i32 [[VAL_50]], [[SUM_NEXT_49]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_50:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 51
-; CHECK-NEXT: [[ARRAYIDX_51:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_50]]
-; CHECK-NEXT: [[VAL_51:%.*]] = load i32, ptr [[ARRAYIDX_51]], align 4
-; CHECK-NEXT: [[SUM_NEXT_51:%.*]] = add nsw i32 [[VAL_51]], [[SUM_NEXT_50]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_51:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 52
-; CHECK-NEXT: [[ARRAYIDX_52:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_51]]
-; CHECK-NEXT: [[VAL_52:%.*]] = load i32, ptr [[ARRAYIDX_52]], align 4
-; CHECK-NEXT: [[SUM_NEXT_52:%.*]] = add nsw i32 [[VAL_52]], [[SUM_NEXT_51]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_52:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 53
-; CHECK-NEXT: [[ARRAYIDX_53:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_52]]
-; CHECK-NEXT: [[VAL_53:%.*]] = load i32, ptr [[ARRAYIDX_53]], align 4
-; CHECK-NEXT: [[SUM_NEXT_53:%.*]] = add nsw i32 [[VAL_53]], [[SUM_NEXT_52]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_53:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 54
-; CHECK-NEXT: [[ARRAYIDX_54:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_53]]
-; CHECK-NEXT: [[VAL_54:%.*]] = load i32, ptr [[ARRAYIDX_54]], align 4
-; CHECK-NEXT: [[SUM_NEXT_54:%.*]] = add nsw i32 [[VAL_54]], [[SUM_NEXT_53]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_54:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 55
-; CHECK-NEXT: [[ARRAYIDX_55:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_54]]
-; CHECK-NEXT: [[VAL_55:%.*]] = load i32, ptr [[ARRAYIDX_55]], align 4
-; CHECK-NEXT: [[SUM_NEXT_55:%.*]] = add nsw i32 [[VAL_55]], [[SUM_NEXT_54]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_55:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 56
-; CHECK-NEXT: [[ARRAYIDX_56:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_55]]
-; CHECK-NEXT: [[VAL_56:%.*]] = load i32, ptr [[ARRAYIDX_56]], align 4
-; CHECK-NEXT: [[SUM_NEXT_56:%.*]] = add nsw i32 [[VAL_56]], [[SUM_NEXT_55]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_56:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 57
-; CHECK-NEXT: [[ARRAYIDX_57:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_56]]
-; CHECK-NEXT: [[VAL_57:%.*]] = load i32, ptr [[ARRAYIDX_57]], align 4
-; CHECK-NEXT: [[SUM_NEXT_57:%.*]] = add nsw i32 [[VAL_57]], [[SUM_NEXT_56]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_57:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 58
-; CHECK-NEXT: [[ARRAYIDX_58:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_57]]
-; CHECK-NEXT: [[VAL_58:%.*]] = load i32, ptr [[ARRAYIDX_58]], align 4
-; CHECK-NEXT: [[SUM_NEXT_58:%.*]] = add nsw i32 [[VAL_58]], [[SUM_NEXT_57]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_58:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 59
-; CHECK-NEXT: [[ARRAYIDX_59:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_58]]
-; CHECK-NEXT: [[VAL_59:%.*]] = load i32, ptr [[ARRAYIDX_59]], align 4
-; CHECK-NEXT: [[SUM_NEXT_59:%.*]] = add nsw i32 [[VAL_59]], [[SUM_NEXT_58]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_59:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 60
-; CHECK-NEXT: [[ARRAYIDX_60:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_59]]
-; CHECK-NEXT: [[VAL_60:%.*]] = load i32, ptr [[ARRAYIDX_60]], align 4
-; CHECK-NEXT: [[SUM_NEXT_60:%.*]] = add nsw i32 [[VAL_60]], [[SUM_NEXT_59]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_60:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 61
-; CHECK-NEXT: [[ARRAYIDX_61:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_60]]
-; CHECK-NEXT: [[VAL_61:%.*]] = load i32, ptr [[ARRAYIDX_61]], align 4
-; CHECK-NEXT: [[SUM_NEXT_61:%.*]] = add nsw i32 [[VAL_61]], [[SUM_NEXT_60]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_61:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 62
-; CHECK-NEXT: [[ARRAYIDX_62:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_61]]
-; CHECK-NEXT: [[VAL_62:%.*]] = load i32, ptr [[ARRAYIDX_62]], align 4
-; CHECK-NEXT: [[SUM_NEXT_62:%.*]] = add nsw i32 [[VAL_62]], [[SUM_NEXT_61]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_62:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 63
-; CHECK-NEXT: [[ARRAYIDX_63:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_62]]
-; CHECK-NEXT: [[VAL_63:%.*]] = load i32, ptr [[ARRAYIDX_63]], align 4
-; CHECK-NEXT: [[SUM_NEXT_63:%.*]] = add nsw i32 [[VAL_63]], [[SUM_NEXT_62]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_63:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 64
-; CHECK-NEXT: [[ARRAYIDX_64:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_63]]
-; CHECK-NEXT: [[VAL_64:%.*]] = load i32, ptr [[ARRAYIDX_64]], align 4
-; CHECK-NEXT: [[SUM_NEXT_64:%.*]] = add nsw i32 [[VAL_64]], [[SUM_NEXT_63]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_64:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 65
-; CHECK-NEXT: [[ARRAYIDX_65:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_64]]
-; CHECK-NEXT: [[VAL_65:%.*]] = load i32, ptr [[ARRAYIDX_65]], align 4
-; CHECK-NEXT: [[SUM_NEXT_65:%.*]] = add nsw i32 [[VAL_65]], [[SUM_NEXT_64]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_65:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 66
-; CHECK-NEXT: [[ARRAYIDX_66:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_65]]
-; CHECK-NEXT: [[VAL_66:%.*]] = load i32, ptr [[ARRAYIDX_66]], align 4
-; CHECK-NEXT: [[SUM_NEXT_66:%.*]] = add nsw i32 [[VAL_66]], [[SUM_NEXT_65]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_66:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 67
-; CHECK-NEXT: [[ARRAYIDX_67:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_66]]
-; CHECK-NEXT: [[VAL_67:%.*]] = load i32, ptr [[ARRAYIDX_67]], align 4
-; CHECK-NEXT: [[SUM_NEXT_67:%.*]] = add nsw i32 [[VAL_67]], [[SUM_NEXT_66]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_67:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 68
-; CHECK-NEXT: [[ARRAYIDX_68:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_67]]
-; CHECK-NEXT: [[VAL_68:%.*]] = load i32, ptr [[ARRAYIDX_68]], align 4
-; CHECK-NEXT: [[SUM_NEXT_68:%.*]] = add nsw i32 [[VAL_68]], [[SUM_NEXT_67]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_68:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 69
-; CHECK-NEXT: [[ARRAYIDX_69:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_68]]
-; CHECK-NEXT: [[VAL_69:%.*]] = load i32, ptr [[ARRAYIDX_69]], align 4
-; CHECK-NEXT: [[SUM_NEXT_69:%.*]] = add nsw i32 [[VAL_69]], [[SUM_NEXT_68]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_69:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 70
-; CHECK-NEXT: [[ARRAYIDX_70:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_69]]
-; CHECK-NEXT: [[VAL_70:%.*]] = load i32, ptr [[ARRAYIDX_70]], align 4
-; CHECK-NEXT: [[SUM_NEXT_70:%.*]] = add nsw i32 [[VAL_70]], [[SUM_NEXT_69]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_70:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 71
-; CHECK-NEXT: [[ARRAYIDX_71:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_70]]
-; CHECK-NEXT: [[VAL_71:%.*]] = load i32, ptr [[ARRAYIDX_71]], align 4
-; CHECK-NEXT: [[SUM_NEXT_71:%.*]] = add nsw i32 [[VAL_71]], [[SUM_NEXT_70]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_71:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 72
-; CHECK-NEXT: [[ARRAYIDX_72:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_71]]
-; CHECK-NEXT: [[VAL_72:%.*]] = load i32, ptr [[ARRAYIDX_72]], align 4
-; CHECK-NEXT: [[SUM_NEXT_72:%.*]] = add nsw i32 [[VAL_72]], [[SUM_NEXT_71]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_72:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 73
-; CHECK-NEXT: [[ARRAYIDX_73:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_72]]
-; CHECK-NEXT: [[VAL_73:%.*]] = load i32, ptr [[ARRAYIDX_73]], align 4
-; CHECK-NEXT: [[SUM_NEXT_73:%.*]] = add nsw i32 [[VAL_73]], [[SUM_NEXT_72]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_73:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 74
-; CHECK-NEXT: [[ARRAYIDX_74:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_73]]
-; CHECK-NEXT: [[VAL_74:%.*]] = load i32, ptr [[ARRAYIDX_74]], align 4
-; CHECK-NEXT: [[SUM_NEXT_74:%.*]] = add nsw i32 [[VAL_74]], [[SUM_NEXT_73]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_74:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 75
-; CHECK-NEXT: [[ARRAYIDX_75:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_74]]
-; CHECK-NEXT: [[VAL_75:%.*]] = load i32, ptr [[ARRAYIDX_75]], align 4
-; CHECK-NEXT: [[SUM_NEXT_75:%.*]] = add nsw i32 [[VAL_75]], [[SUM_NEXT_74]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_75:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 76
-; CHECK-NEXT: [[ARRAYIDX_76:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_75]]
-; CHECK-NEXT: [[VAL_76:%.*]] = load i32, ptr [[ARRAYIDX_76]], align 4
-; CHECK-NEXT: [[SUM_NEXT_76:%.*]] = add nsw i32 [[VAL_76]], [[SUM_NEXT_75]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_76:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 77
-; CHECK-NEXT: [[ARRAYIDX_77:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_76]]
-; CHECK-NEXT: [[VAL_77:%.*]] = load i32, ptr [[ARRAYIDX_77]], align 4
-; CHECK-NEXT: [[SUM_NEXT_77:%.*]] = add nsw i32 [[VAL_77]], [[SUM_NEXT_76]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_77:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 78
-; CHECK-NEXT: [[ARRAYIDX_78:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_77]]
-; CHECK-NEXT: [[VAL_78:%.*]] = load i32, ptr [[ARRAYIDX_78]], align 4
-; CHECK-NEXT: [[SUM_NEXT_78:%.*]] = add nsw i32 [[VAL_78]], [[SUM_NEXT_77]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_78:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 79
-; CHECK-NEXT: [[ARRAYIDX_79:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_78]]
-; CHECK-NEXT: [[VAL_79:%.*]] = load i32, ptr [[ARRAYIDX_79]], align 4
-; CHECK-NEXT: [[SUM_NEXT_79:%.*]] = add nsw i32 [[VAL_79]], [[SUM_NEXT_78]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_79:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 80
-; CHECK-NEXT: [[ARRAYIDX_80:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_79]]
-; CHECK-NEXT: [[VAL_80:%.*]] = load i32, ptr [[ARRAYIDX_80]], align 4
-; CHECK-NEXT: [[SUM_NEXT_80:%.*]] = add nsw i32 [[VAL_80]], [[SUM_NEXT_79]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_80:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 81
-; CHECK-NEXT: [[ARRAYIDX_81:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_80]]
-; CHECK-NEXT: [[VAL_81:%.*]] = load i32, ptr [[ARRAYIDX_81]], align 4
-; CHECK-NEXT: [[SUM_NEXT_81:%.*]] = add nsw i32 [[VAL_81]], [[SUM_NEXT_80]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_81:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 82
-; CHECK-NEXT: [[ARRAYIDX_82:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_81]]
-; CHECK-NEXT: [[VAL_82:%.*]] = load i32, ptr [[ARRAYIDX_82]], align 4
-; CHECK-NEXT: [[SUM_NEXT_82:%.*]] = add nsw i32 [[VAL_82]], [[SUM_NEXT_81]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_82:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 83
-; CHECK-NEXT: [[ARRAYIDX_83:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_82]]
-; CHECK-NEXT: [[VAL_83:%.*]] = load i32, ptr [[ARRAYIDX_83]], align 4
-; CHECK-NEXT: [[SUM_NEXT_83:%.*]] = add nsw i32 [[VAL_83]], [[SUM_NEXT_82]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_83:%.*]] = add nuw nsw i6...
[truncated]
|
I've always struggled with the idea of LoopMicroOpBufferSize in general. It appears to be just used as a way of throttling unrolls, to avoid the potential cost of unnecessary compare+branch. In which case shouldn't we be trying to better estimate the cost of the loop control flow, based off branch distance etc? That way we avoid estimation issues for cpus without a dedicated loopback buffer etc. |
I believe the primary purpose of LoopMicroOpBufferSize here is to make sure that we don't runtime-unroll past the loop buffer size: Doing so could mean that a (non-unrolled) loop that previously used the loop buffer may no longer do so after unrolling, which would be a clear pessimization. So I think it should primarily serve as an upper bound on runtime unrolling. However, in practice we take it as the desired target, i.e. try to unroll all loops that we can to the loop buffer size. I'm not sure this part really makes sense, at least for the kind of wide out-of-order cores we're talking about here (see also #42332 (comment)). Possibly the more correct fix here would be to set LoopMicroOpBufferSize=0 for these subtargets and disable runtime unrolling entirely. My current proposed fix is just a very conservative starting point.
Can you explain in more detail how we would estimate the cost/benefit of runtime unrolling? It's not really clear what we can do here beyond the fact that the compare and branch go away, as we generally don't expect runtime unrolling to result in any simplification beyond that. |
Not that this patch should address it, but on Intel X86 if the loop has a highly predictable iteration count in ~[24, 150] range (24=loop starts using LSD at ~this iter count, 150=max depth of branch predictor) using the loop micro op buffer can be a pessimization. This is because the LSD exits via a branch miss so a loop that would be perfectly predicted running from the DSB gets an branch miss added to it. |
The znver3/znver4 scheduler modules are outliers, specifying very large LoopMicroOpBufferSizes at 512, while typical values for other subtargets are on the order of ~50. Even if this information is micro-architecturally correct (*), this does not mean that we want to runtime unroll all loops to a size that completely fills the loop buffer. Unless this is the single hot loop in the entire application, the massive code size increase will bust the micro-op and instruction caches. Protect against this by clamping to the default PartialThreshold of 150, which is the same as the default full-unroll threshold and half the aggressive full-unroll threshold. Allowing more partial unrolling than full unrolling is certainly non-sensical. (*) I strongly doubt that this is actually correct -- I believe this may derive from an incorrect reading of Agner Fog's micro-architecture guide. The number 4096 that was originally used here is the size of the general micro-op cache, not that of a loop buffer. A separate loop buffer is not listed for the Zen microarchitecture. Comparing this to the listing for Skylake, it has a 1536 micro-op buffer, but only a 64 micro-op loopback buffer, with a note that it's rarely fully utilized. Our scheduling model specifies LoopMicroOpBufferSize of 50 in that case.
3e0d138
to
5c7fbce
Compare
Any suggestions on how to move forward with this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good generic limit to put in place so the numbers are compatible with what the loop transforms expect. I'm still hoping we can come up with some solution on #91340 for znver3/4 specifically but that's getting hindered by lack of test hardware.
…erSize (#67657)" This reverts commit f0b3654. This commit triggers UB by reading an uninitialized variable. `UP.PartialThreshold` is used uninitialized in `getUnrollingPreferences()` when it is called from `LoopVectorizationPlanner::executePlan()`. In this case the `UP` variable is created on the stack and its fields are not initialized. ``` ==8802==WARNING: MemorySanitizer: use-of-uninitialized-value #0 0x557c0b081b99 in llvm::BasicTTIImplBase<llvm::X86TTIImpl>::getUnrollingPreferences(llvm::Loop*, llvm::ScalarEvolution&, llvm::TargetTransformInfo::UnrollingPreferences&, llvm::OptimizationRemarkEmitter*) llvm-project/llvm/include/llvm/CodeGen/BasicTTIImpl.h #1 0x557c0b07a40c in llvm::TargetTransformInfo::Model<llvm::X86TTIImpl>::getUnrollingPreferences(llvm::Loop*, llvm::ScalarEvolution&, llvm::TargetTransformInfo::UnrollingPreferences&, llvm::OptimizationRemarkEmitter*) llvm-project/llvm/include/llvm/Analysis/TargetTransformInfo.h:2277:17 #2 0x557c0f5d69ee in llvm::TargetTransformInfo::getUnrollingPreferences(llvm::Loop*, llvm::ScalarEvolution&, llvm::TargetTransformInfo::UnrollingPreferences&, llvm::OptimizationRemarkEmitter*) const llvm-project/llvm/lib/Analysis/TargetTransformInfo.cpp:387:19 #3 0x557c0e6b96a0 in llvm::LoopVectorizationPlanner::executePlan(llvm::ElementCount, unsigned int, llvm::VPlan&, llvm::InnerLoopVectorizer&, llvm::DominatorTree*, bool, llvm::DenseMap<llvm::SCEV const*, llvm::Value*, llvm::DenseMapInfo<llvm::SCEV const*, void>, llvm::detail::DenseMapPair<llvm::SCEV const*, llvm::Value*>> const*) llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:7624:7 #4 0x557c0e6e4b63 in llvm::LoopVectorizePass::processLoop(llvm::Loop*) llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:10253:13 #5 0x557c0e6f2429 in llvm::LoopVectorizePass::runImpl(llvm::Function&, llvm::ScalarEvolution&, llvm::LoopInfo&, llvm::TargetTransformInfo&, llvm::DominatorTree&, llvm::BlockFrequencyInfo*, llvm::TargetLibraryInfo*, llvm::DemandedBits&, llvm::AssumptionCache&, llvm::LoopAccessInfoManager&, llvm::OptimizationRemarkEmitter&, llvm::ProfileSummaryInfo*) llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:10344:30 #6 0x557c0e6f2f97 in llvm::LoopVectorizePass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:10383:9 [...] Uninitialized value was created by an allocation of 'UP' in the stack frame #0 0x557c0e6b961e in llvm::LoopVectorizationPlanner::executePlan(llvm::ElementCount, unsigned int, llvm::VPlan&, llvm::InnerLoopVectorizer&, llvm::DominatorTree*, bool, llvm::DenseMap<llvm::SCEV const*, llvm::Value*, llvm::DenseMapInfo<llvm::SCEV const*, void>, llvm::detail::DenseMapPair<llvm::SCEV const*, llvm::Value*>> const*) llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:7623:3 ```
The znver3/znver4 scheduler models are outliers, specifying very large LoopMicroOpBufferSizes at 512, while typical values for other subtargets are on the order of ~50. Even if this information is micro-architecturally correct (*), this does not mean that we want to runtime unroll all loops to a size that completely fills the loop buffer. Unless this is the single hot loop in the entire application, the massive code size increase will bust the micro-op and instruction caches.
Protect against this by clamping to the default PartialThreshold of 150, which is the same as the default full-unroll threshold and half the aggressive full-unroll threshold. Allowing more partial unrolling than full unrolling is certainly nonsensical.
(*) I strongly doubt that this is actually correct -- I believe this may derive from an incorrect reading of Agner Fog's micro-architecture guide. The number 4096 that was originally used here is the size of the general micro-op cache, not that of a loop buffer. A separate loop buffer is not listed for the Zen microarchitecture. Comparing this to the listing for Skylake, it has a 1536 micro-op buffer, but only a 64 micro-op loopback buffer, with a note that it's rarely fully utilized. Our scheduling model specifies LoopMicroOpBufferSize of 50 in that case.