Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Proper lowering of constant alloca operations #1060

Open
Lancern opened this issue Nov 5, 2024 · 13 comments
Open

[RFC] Proper lowering of constant alloca operations #1060

Lancern opened this issue Nov 5, 2024 · 13 comments
Labels
enhancement New feature or request IR design Considerations around the design of ClangIR

Comments

@Lancern
Copy link
Member

Lancern commented Nov 5, 2024

This is related to #866 .

Since #892 , ClangIR lowers constant local variables in C/C++ to cir.alloca operations with a const flag. The presence of the const flag implies:

  • The variable must be initialized by a following cir.store operation, and
  • All cir.load operation that loads the cir.alloca result must produce the value stored by the cir.store operation.

An obvious optimization here is that we could eliminate all the loads and replace the loaded values with the stored initial value. LLVM already implements similar optimizations, but we need to tweak the generated LLVM IR to teach LLVM to apply those optimizations. I'm proposing several approaches here that could lead to such optimizations in LLVM, and hope we could choose one that best fits our needs.

Approach 1: use the llvm.invariant.start intrinsic

The first approach would be using the llvm.invariant.start and the llvm.invariant.end intrinsic. This pair of intrinsics tell the optimizer that a specified memory location will never change within the region bounded by the intrinsics. With this approach, the following CIR:

cir.func @test(@init: !s32i) {
  %0 = cir.alloca !s32i, !cir.ptr<!s32i>, ["var", init, const]
  cir.store @init, %0 : !s32i, !cir.ptr<!s32i>

  // example uses of %0
  %1 = cir.load %0 : !cir.ptr<!s32i>, !s32i
  cir.call @clobber(%1) : (!cir.ptr<!s32i>) -> ()
  %2 = cir.load %0 : !cir.ptr<!s32i>, !s32i

  cir.return
}

would generate the following LLVM IR:

define dso_local void @test(i32 %init) {
  %1 = alloca i32, align 4
  store i32 %init, ptr %1, align 4
  %inv = call ptr @llvm.invariant.start(ptr %1, i64 4)

  // example uses of %0
  %2 = load i32, ptr %1, align 4
  call void @clobber(ptr %1)
  %3 = load i32, ptr %1, align 4

  call void @llvm.invariant.end(ptr %inv, i64 4, ptr %1)
  ret void
}

Theoretically, the optimizer would be able to at least fold %3 into %2. Ironically, it seems that the optimizer refuses to optimize if the llvm.invariant.end intrinsic call is present, see https://godbolt.org/z/5dMv7T77e. To bypass this limitation, simply remove the call to the llvm.invariant.end intrinsic, and the optimizer works as expected.

Approach 2: use the !invariant.load metadata

A load instruction could have an !invariant.load metadata attached. The LLVM language reference says:

If a load instruction tagged with the !invariant.load metadata is executed, the memory location referenced by the load has to contain the same value at all points in the program where the memory location is dereferenceable; otherwise, the behavior is undefined.

With this approach, the CIR snippet listed earlier would emit the following LLVM IR:

define dso_local void @test(i32 %init) {
  %1 = alloca i32, align 4
  store i32 %init, ptr %1, align 4

  ; example uses of %1
  %2 = load i32, ptr %1, align 4, !invariant.load !0
  call void @clobber(ptr %1)
  %3 = load i32, ptr %1, align 4, !invariant.load !0

  ret void
}

!0 = !{}

The optimizer could then fold both load instructions to just %init, see https://godbolt.org/z/Exnh85zhx.

It's worth mentioning here that the !invariant.load metadata is already supported by the MLIR LLVMIR dialect.

Approach 3: use the !invariant.group metadata

A load instruction or a store instruction could have an !invariant.group metadata attached. Unlike !invariant.load, the !invariant.group only requires that every value loaded or stored by such instructions must be the same if those instructions load or store to the same pointer. With this approach, the CIR snippet listed earlier would emit the following LLVM IR:

define dso_local void @test(i32 %init) {
  %1 = alloca i32, align 4
  store i32 %init, ptr %1, align 4, !invariant.group !0

  ; example uses of %1
  %2 = load i32, ptr %1, align 4, !invariant.group !0
  call void @clobber(ptr %1)
  %3 = load i32, ptr %1, align 4, !invariant.group !0

  ret void
}

!0 = !{}

The optimizer could then fold both load instructions to just %init, see https://godbolt.org/z/8MsxcoqTY.

Constant local variables in inner scopes

Let's consider a slightly more complex example:

void test(std::vector<int> vec) {
  for (const int item : vec)
    do_something(item);
}

Upon each iteration, the local variable item would reuse the same memory location. But ideally we would like to still teach LLVM that item is constant during a single iteration. The second approach is infeasible since the value in the memory location changes between iterations. Thus only the first approach and the third approach is suitable for such a case.

The first approach would emit code like this:

define dso_local void @test() {
  %item = alloca i32, align 4
  ; ...
loop.body:
  store i32 %loop.ind, ptr %item, align 4
  %inv = call ptr @llvm.invariant.start(ptr %item, i64 4)
  
  ; loop body goes here, an example load instruction below
  %1 = load i32, ptr %item, align 4
  
  call void @llvm.invariant.end(ptr %inv, i64 4, ptr %1)
  br label %loop.header
}

The third approach would emit code like this:

define dso_local void @test() {
  %item = alloca i32, align 4
  ; ...
loop.body:
  %item.0 = phi ptr [ %item, %0 ], [ %item.launder, %loop.body ]
  store i32 %loop.ind, ptr %item.0, align 4, !invariant.group !0
  
  ; loop body goes here, an example load instruction below
  %1 = load i32, ptr %item.0, align 4, !invariant.group !0
  
  %item.launder = call @llvm.launder.invariant.group(ptr %item.0)
  br label %loop.body
}

!0 = !{}

The call to the llvm.launder.invariant.group intrinsic makes sure that each iteration creates a "distinct invariant group". Without this intrinsic call, the optimizer could assume that the load and store instructions would load and store the same value across all iterations of the loop.

So what do you think about these 3 lowering approaches? Or do you know any other approaches that this proposal does not mention?

@bcardosolopes
Copy link
Member

Very nice write up, thanks for considering diferent options.

But ideally we would like to still teach LLVM that item is constant during a single iteration.

Neat!

So what do you think about these 3 lowering approaches? Or do you know any other approaches that this proposal does not mention?

The option (3) looks more appealing to me since (1) has the weird end behavior. Do you know if there's a trend in LLVM to pursue one direction more than the other? That could be one good deciding criteria, but both look good to me.

This is a good use of a CIR abstraction that allow us to give extra information to LLVM, cool to see work in this direction!

On a tagencial subject, should this be done in -O1? Or do we want to emit this even in -O0?

@Lancern
Copy link
Member Author

Lancern commented Nov 6, 2024

Do you know if there's a trend in LLVM to pursue one direction more than the other? That could be one good deciding criteria, but both look good to me.

This is a good point. Actually the metadata and intrinsics involved in approach 3 is still experimental, while the llvm.invariant.start intrinsics are already there for years. Considering that the first approach still has such a strange behavior regarding llvm.invariant.end after years of development, I would prefer approach 3 too.

On a tagencial subject, should this be done in -O1? Or do we want to emit this even in -O0?

Since this is technically an optimization, I think we should emit these stuff under -O1 or higher.

@bcardosolopes bcardosolopes added enhancement New feature or request IR design Considerations around the design of ClangIR labels Nov 6, 2024
@bcardosolopes
Copy link
Member

Great, happy to see PRs in either direction. Should we add a -cc1 flag to control this? It could be enabled by default but a no version would be cool (this would be handy for later experiments to see how much perf % we get out of it).

Considering that the first approach still has such a strange behavior regarding llvm.invariant.end after years of development

One more idea: ask on LLVM discourse if folks know why this happens?

@Lancern
Copy link
Member Author

Lancern commented Nov 7, 2024

Should we add a -cc1 flag to control this?

Sounds good to me!

One more idea: ask on LLVM discourse if folks know why this happens?

Great idea. Opened a thread here: https://discourse.llvm.org/t/why-llvm-invariant-end-intrinsic-would-prevent-optimizations/82984

@Lancern
Copy link
Member Author

Lancern commented Nov 7, 2024

Great idea. Opened a thread here: https://discourse.llvm.org/t/why-llvm-invariant-end-intrinsic-would-prevent-optimizations/82984

nikic has responded:

In practice, LLVM only supports open-ended invariant.start. We should really remove the invariant.end intrinsic to not confuse people.

The invariant.group metadata (and related intrinsics) are the modern way to represent “temporary” invariant-ness, because they can be more efficiently processed.

So I believe !invariant.group and its related intrinsics (i.e. the 3rd approach in this post) are what we should stick to.

@bcardosolopes
Copy link
Member

@Lancern way to go! thanks for the extra research

@orbiri
Copy link
Collaborator

orbiri commented Nov 8, 2024

Hey, jumping in since we've been working a lot on memory optimizations downstream.

Firstly, I understand that the reason Mem2Reg and SROA don't act here is that the local variable is escaping, so it is indeed a cool optimization to use the const information in CIR 🙌

Having said that, is there a reason to avoid applying this super small optimization in CIR? From experience, using complex LLVM metadata always turns out to be a long and sometimes painful path. The example above looks almost trivial - RAUW of load operations. I admit that I haven't thought thoroughly about it, so maybe you can give an example where it is not that trivial? :)

In addition, seems like it can be super cool to already inline in CIR and use the const values as part of the cost model to choose whether or not it is worth inlining! But I guess this is something for a bit further in the future.

@bcardosolopes
Copy link
Member

@orbiri those are good points, thanks for jumping in. My take is that we should aim at all angles because they are orthogonal, i.e. if we can leverage LLVM opts by providing extra info, we should do it. This shouldn't preclude work on CIR specific opts though.

@Lancern
Copy link
Member Author

Lancern commented Dec 9, 2024

#1159 started emitting !invariant.group metadata for loads and stores to function-level allocas, as shown in Approach 3. This left us with local constants in an inner scope of the function-level scope, as shown in the "Constant local variables in inner scopes" section. #1159 temporarily removes the const flag on these allocas when hositing them so we don't emit wrong code for now.

@Lancern
Copy link
Member Author

Lancern commented Dec 11, 2024

Here's a short draft about how I'm planning to deal with nested constant allocas. Let's re-consider the C++ code listed in the top message:

void test(std::vector<int> vec) {
  for (const int item : vec)
    do_something(item);
}

Now I'm planning to lower it to the following LLVM IR:

define dso_local void @test() {
  %item = alloca i32, align 4
  ; ...
loop.body:
  %item.launder = call ptr @llvm.launder.invariant.group(ptr %item)
  store i32 %loop.ind, ptr %item.launder, align 4, !invariant.group !0
  
  ; loop body goes here, an example load instruction below
  %1 = load i32, ptr %item.launder, align 4, !invariant.group !0
  
  br label %loop.body
}

!0 = !{}

Note the difference about where we put the call to @llvm.launder.invariant.group. I found the new code is much easier to codegen and may produce shorter LLVM code. During alloca hoisting, we could simply replace nested constant allocas with an intrinsic call to @llvm.launder.invariant.group, and the job is done.

This leaves us one final corner case. Consider the following code:

while (const int x = condition()) {
  // ...
}

Currently the alloca for x is emitted outside of the while-loop, despite that x is re-created and destroyed upon each iteration of the loop. I haven't come up with a solution to resolve this, and to make the majority of cases work I'm planning to hack the CIRGen code for x and temporarily remove the const flag there until we work out a solution.

@bcardosolopes
Copy link
Member

bcardosolopes commented Dec 12, 2024

I found the new code is much easier to codegen and may produce shorter LLVM code. During alloca hoisting, we could simply replace nested constant allocas with an intrinsic call to @llvm.launder.invariant.group, and the job is done.

Super neat, yay! Will that remove the workaround for good?

Currently the alloca for x is emitted outside of the while-loop

But we have a surrounding scope, right? Would it help in any way if we start tagging scopes with their loop type? Meaning, scope if, scope for, etc?

and to make the majority of cases work I'm planning to hack the CIRGen code for x and temporarily remove the const flag there until we work out a solution.

I'd prefer if CIRGen isn't touch much, it should be as fast as possible, and you might be removing more consts than you'd probably desire (assuming you use getParentOfType)? Can we instead keep some of this logic in the current workaround? Meaning, keep applying the workaround for anything we don't know for sure we can keep?

@bcardosolopes
Copy link
Member

Btw, does the problem of while goes away with fixes for #1218?

@Lancern
Copy link
Member Author

Lancern commented Dec 15, 2024

Btw, does the problem of while goes away with fixes for #1218?

No. #1218 moves variables in loop bodies into a scope in the loop body, but this is not suitable for condition variables. Condition variable must be put at places that enclose both the while condition and the while body. I don't see a easy way to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request IR design Considerations around the design of ClangIR
Projects
None yet
Development

No branches or pull requests

3 participants