Why is the code size of catch_unwind so large ? #64224

gnzlbg · 2019-09-06T14:54:10Z

While filling #64222 I noticed that we generate more code than C++ for catch_unwind. That did not feel right, since C++'s catch can do much more than Rust's catch unwind, e.g., filtering different types of exceptions, etc.

MWE: C++ (https://gcc.godbolt.org/z/z_dgPg):

extern "C" void foo();

int bar() {
    try {
        foo();
        return 42;
    } catch(...) {
        return 13;
    }
}

generates

bar(): # @bar()
  push rbx
  mov ebx, 42
  call foo
  mov eax, ebx
  pop rbx
  ret
  mov rdi, rax
  call __cxa_begin_catch
  call __cxa_end_catch
  mov ebx, 13
  mov eax, ebx
  pop rbx
  ret

while Rust (https://gcc.godbolt.org/z/4sbc6k):

#![feature(unwind_attributes)]

extern "C" {
    // can unwind:
    #[unwind(allow)] fn foo(); 
}

pub unsafe fn bar() -> i32 {
    std::panic::catch_unwind(|| { foo(); 42 }).unwrap_or(13)
}

generates

example::bar:
  push rbp
  push r14
  push rbx
  sub rsp, 32
  mov qword ptr [rsp + 16], 0
  mov qword ptr [rsp + 24], 0
  lea rdi, [rip + std::panicking::try::do_call]
  lea rsi, [rsp + 12]
  lea rdx, [rsp + 16]
  lea rcx, [rsp + 24]
  call qword ptr [rip + __rust_maybe_catch_panic@GOTPCREL]
  test eax, eax
  je .LBB2_1
  mov rdi, -1
  call qword ptr [rip + std::panicking::update_panic_count@GOTPCREL]
  mov r14, qword ptr [rsp + 16]
  mov rbx, qword ptr [rsp + 24]
  mov rdi, r14
  call qword ptr [rbx]
  mov rsi, qword ptr [rbx + 8]
  mov ebp, 13
  test rsi, rsi
  je .LBB2_5
  mov rdx, qword ptr [rbx + 16]
  mov rdi, r14
  call qword ptr [rip + __rust_dealloc@GOTPCREL]
  jmp .LBB2_5
.LBB2_1:
  mov ebp, dword ptr [rsp + 12]
.LBB2_5:
  mov eax, ebp
  add rsp, 32
  pop rbx
  pop r14
  pop rbp
  ret
  mov rbp, rax
  mov rdi, r14
  mov rsi, rbx
  call alloc::alloc::box_free
  mov rdi, rbp
  call _Unwind_Resume@PLT
  ud2

This appears to be a constant overhead every time catch_unwind is used (e.g. see https://gcc.godbolt.org/z/bAvN24). Maybe we are inlining too much ?

The text was updated successfully, but these errors were encountered:

Mark-Simulacrum · 2019-12-22T00:49:38Z

I think the first example with try/catch is a bit misleading. std::panic::catch_unwind essentially implements something like the try_run function here. This generates assembly that is relatively similar to the Rust assembly.

One notable difference between try/catch and std::panic::catch_unwind is that the first provides an option to not actually capture the exception (like your code illustrates). That is something that Rust today cannot do -- which is quite unfortunate, as that is a nonzero codesize burden due to needing to deallocate the Box. If we forget it explicitly instead, then we get somewhat nicer assembly: https://gcc.godbolt.org/z/vs6URt.

I'm not sure it matters too much in practice as I'd guess that catch_unwind which ignores the exception itself is somewhat rare, though it might be worth trying to figure out a way that we can thread that information through (it would probably need a second catch_unwind function -- otherwise, I don't think we can thread the information through into the function in anyway, just due to its return type).

Separately, C++ also can get away without the separate do_call function/symbol that Rust currently uses. This is because we currently thread through a function pointer to the lambda, whereas C++ gets away without this overhead due to being able to jump directly to a separate basic block from it's equivalent of a try intrinsic. I think this would be quite hard to replicate in Rust; we need essentially something like a naked function or so (but that's not quite right either), and perhaps to inform LLVM that said function will always be called, i.e., it can be generated directly into the instruction stream, vs. a call instruction to jump to it. I think it is not implausible that we could better express this if __rust_maybe_catch_panic was not a libary-implemented function, but rather an intrinsic.

I'm not sure it's worth optimizing this, though, as catch_unwind is much more rarely called than try/catch in C++, I expect. I have opened a PR (#67502) that slightly optimizes the ABI of __rust_maybe_catch_panic which reduces the stack space utilization and simplifies the code a little; this likely has essentially no runtime impact though.

gnzlbg · 2019-12-23T09:10:31Z

I'm not sure it matters too much in practice as I'd guess that catch_unwind which ignores the exception itself is somewhat rare

Good point. Would there be a way to insert the forget automatically if the result is not used?

and perhaps to inform LLVM that said function will always be called

When would this function not be unconditionally called with the current codegen ? (i.e. is this an LLVM bug, in which LLVM is not recognizing that this function is unconditionally called?).

Mark-Simulacrum · 2019-12-23T11:29:19Z

Well, the forget is a memory leak, so I doubt you'd want to auto insert it.

Unfortunately I don't think we can really blame llvm here - we're indirecting through an extern C function so llvm basically can't inline anything here (which would be needed I imagine to see that the function is always called). Furthermore it's not even the first thing to be called in the non-aborting case.

We do likely have a way out - move the whole catch function to an intrinsic. Then we'd likely get much better results, particularly with -Cpanic=abort, and it wouldn't even be all that hard I imagine to do this.

I continue to be unsure that it's worth it. I do recall that Servo mentioned that catch unwind used to be a performance problem for them; I'm not sure if it still is. Maybe @Manishearth knows, or can find out?

Manishearth · 2019-12-23T21:03:29Z

I don't really recall much about that.

Mark-Simulacrum · 2019-12-23T22:23:17Z

Hm, maybe @jdm can say more here (they filed #34727 way back). We got catch_unwind down to a "pretty fast" function there, though.

#64222 is basically the same as this issue we can't optimize out the unwind code path in the "does not unwind" case because we don't tell Rust that that's the case.

jdm · 2019-12-24T00:41:01Z

I have no data showing that catch_unwind is currently a performance problem, if that's the question being asked.

gnzlbg · 2019-12-24T08:28:09Z

The problem being reported here is that Rust's catch_unwind generates 3x as much code as a C++ try/catch in one particular case at least - for the data, see the first post of this issue. Even for cases like this one suggested by @Mark-Simulacrum above, we still generate 2x as much code as C++.

Amanieu · 2019-12-24T18:13:05Z

Could we get better codegen if we let __rust_maybe_catch_panic get inlined? LLVM should then be able to move most of the work into the catch path and improve the performance of the fast path. I'm not sure if this will help with code size though.

Mark-Simulacrum · 2019-12-24T18:44:17Z

__rust_maybe_catch_panic is what dispatches between panic=abort and panic=unwind currently; inlining it is possible but not entirely trivial. If we care about code size here, I have a number of ideas I can explore, but I would like to avoid spending the time unless we can find someone who has at least some level of "I care" about this :)

Amanieu · 2019-12-25T20:22:58Z

Here's what I got by inlining __rust_maybe_catch_panic: https://gcc.godbolt.org/z/AS2i4a

example::bar:
  push r14
  push rbx
  sub rsp, 24
  mov ebx, 42
  call qword ptr [rip + foo@GOTPCREL]
.LBB2_6:
  mov eax, ebx
  add rsp, 24
  pop rbx
  pop r14
  ret
  mov rbx, qword ptr [rax + 256]
  mov r14, qword ptr [rax + 264]
  mov qword ptr [rax + 256], 0
  mov rdi, rax
  call qword ptr [rip + _Unwind_DeleteException@GOTPCREL]
  mov qword ptr [rsp + 8], rbx
  mov qword ptr [rsp + 16], r14
  test rbx, rbx
  je .LBB2_2
  mov edi, -1
  call qword ptr [rip + update_panic_count@GOTPCREL]
  mov ebx, 13
  jmp .LBB2_6
.LBB2_2:
  lea rdi, [rip + .L__unnamed_1]
  lea rdx, [rip + .L__unnamed_2]
  mov esi, 43
  call qword ptr [rip + core::panicking::panic@GOTPCREL]
  ud2
  mov rbx, rax
  lea rdi, [rsp + 8]
  call core::ptr::real_drop_in_place
  mov rdi, rbx
  call _Unwind_Resume@PLT
  ud2

You'll note that the fast path is now almost as efficient as the C++ one (we're just allocating more stack space), and I'm sure we can reduce the slow path by forcing it out to a separate function.

I think we should make catch_unwind independent of the panic runtime and simply always catch unwinding. With inlining LLVM will be able to eliminate the catch path when compiling with panic=abort.

Mark-Simulacrum · 2019-12-26T16:05:19Z

The inlining (and some further work) worked out to produce a PR that @Amanieu and I independently came up with, though we've decided to utilize my previous PR as a base (#67502). That PR is marked as resolving this issue, as well as #64222. Interested parties can take a look at the implementation there which we're still iterating on.

gnzlbg · 2019-12-27T10:22:56Z

@Mark-Simulacrum that sounds great.

I'm not sure how this issue gets resolved, could you post the codegen of the OP example w/o the patch ? Does that PR produce code that's sufficiently close to what C++ produces ?

Mark-Simulacrum · 2019-12-27T11:49:06Z

The PR description had a godbolt link and includes two sets of assembly with the same code as from the godbolt link.

gnzlbg · 2019-12-28T13:02:47Z

So how does it do for this snippet?

#![feature(unwind_attributes)]

extern "C" {
    // can unwind:
    #[unwind(allow)] fn foo(); 
}

pub unsafe fn bar() -> i32 {
    std::panic::catch_unwind(|| { foo(); 42 }).unwrap_or(13)
}

?

Mark-Simulacrum · 2019-12-28T13:15:13Z

I don't really have the time to try out lots of snippets; I expect that to be on par, but slightly worse -- the lack of a mem::forget will mean that you have a destructor for the Box<Any> which we can't avoid without adding an alternative to catch_unwind (e.g., catch_unwind_no_capture, with a better name)

gnzlbg · 2019-12-28T13:32:31Z

I'm just trying to understand how that PR closes this issue, which is not straightforward from the contents of that PR (I understand how it closes the other issue though).

Mark-Simulacrum · 2019-12-28T14:01:47Z

That PR makes the code size as small as we can, to my knowledge; I've kicked off a try build so that it's easier to test that PR - if there's still suboptimal examples I'd be happy to look at fixing them.

I personally suspect that the remainder of the suboptimal examples cannot be fixed without a new, separate function which takes a separate closure or does not capture the exception at all - one that at least conceptually could go in libcore (aside from being platform specific implementation wise).

Amanieu · 2019-12-28T14:05:07Z

@gnzlbg Here's the asm output with that PR:

_ZN4test3bar17h0bd71e8ae5322970E:
	push	r15
	push	r14
	push	rbx
	mov	ebx, 42
	call	qword ptr [rip + foo@GOTPCREL]
.LBB1_4:
	mov	eax, ebx
	pop	rbx
	pop	r14
	pop	r15
	ret
.LBB1_1:
	mov	rdi, rax
	call	qword ptr [rip + _ZN3std9panicking3try7cleanup17hee23cc19e10d1537E@GOTPCREL]
	mov	r14, rax
	mov	r15, rdx
	mov	rdi, rax
	call	qword ptr [rdx]
	mov	rsi, qword ptr [r15 + 8]
	mov	ebx, 13
	test	rsi, rsi
	je	.LBB1_4
	mov	rdx, qword ptr [r15 + 16]
	mov	rdi, r14
	call	qword ptr [rip + __rust_dealloc@GOTPCREL]
	jmp	.LBB1_4
.LBB1_5:
	mov	rbx, rax
	mov	rdi, r14
	mov	rsi, r15
	call	_ZN5alloc5alloc8box_free17h849e57dccc1d906aE
	mov	rdi, rbx
	call	_Unwind_Resume@PLT
	ud2

Notes:

LBB1_1 is invoked if callq *foo@GOTPCREL(%rip) unwinds.
LBB1_5 is invoked if callq *(%rdx) unwinds.

Amanieu · 2019-12-28T15:47:57Z

We're still not reaching code-size parity with C++ for 2 reasons:

The drop_in_place call for the Box<dyn Any + Send> has been inlined into the function. Ideally LLVM would realize that this is a cold path and avoid inlining to reduce code size. I don't think there is much we can do about this here.
The drop code for Box<dyn Any + Send> calls the actual destructor through a function pointer in the trait. Since drops are allowed to unwind, we need to handle this. Note that this is not a double-panic since by this point we are outside the catch_panic call.

gnzlbg · 2020-01-02T15:10:25Z

The drop_in_place call for the Box<dyn Any + Send> has been inlined into the function. Ideally LLVM would realize that this is a cold path and avoid inlining to reduce code size. I don't think there is much we can do about this here.

Could we maybe wrap this in a #[cold] function here and call that instead?

The drop code for Box<dyn Any + Send> calls the actual destructor through a function pointer in the trait. Since drops are allowed to unwind, we need to handle this. Note that this is not a double-panic since by this point we are outside the catch_panic call.

This makes sense, I don't think there is a way to handle this any better either.

Mark-Simulacrum · 2020-01-02T15:15:56Z

The drop isn't controlled by us (occurs in the unwrap_or); we've already marked the unwind path as cold (and I even tried a "likely" intrinsic, but that didn't help either).

Amanieu · 2020-01-02T15:32:04Z

With a separate drop_box function:

#![feature(unwind_attributes)]

extern "C" {
    #[unwind(allow)]
    fn foo();
}

#[cold]
fn drop_box(b: Box<dyn std::any::Any + Send>) {
    drop(b);
}

pub unsafe fn bar() -> i32 {
    std::panic::catch_unwind(|| {
        foo();
        42
    })
    .unwrap_or_else(|e| {
        drop_box(e);
        13
    })
}

_ZN4test8drop_box17hee2c4bc13b7e934bE:
	push	r15
	push	r14
	push	rbx
	mov	rbx, rsi
	mov	r14, rdi
	call	qword ptr [rsi]
	mov	rsi, qword ptr [rbx + 8]
	test	rsi, rsi
	je	.LBB1_4
	mov	rdx, qword ptr [rbx + 16]
	mov	rdi, r14
	pop	rbx
	pop	r14
	pop	r15
	jmp	qword ptr [rip + __rust_dealloc@GOTPCREL]
.LBB1_4:
	pop	rbx
	pop	r14
	pop	r15
	ret
.LBB1_3:
	mov	r15, rax
	mov	rdi, r14
	mov	rsi, rbx
	call	_ZN5alloc5alloc8box_free17h849e57dccc1d906aE
	mov	rdi, r15
	call	_Unwind_Resume@PLT
	ud2

_ZN4test3bar17h0bd71e8ae5322970E:
	push	rbx
	mov	ebx, 42
	call	qword ptr [rip + foo@GOTPCREL]
.LBB2_2:
	mov	eax, ebx
	pop	rbx
	ret
.LBB2_1:
	mov	rdi, rax
	call	qword ptr [rip + _ZN3std9panicking3try7cleanup17hee23cc19e10d1537E@GOTPCREL]
	mov	rdi, rax
	mov	rsi, rdx
	call	_ZN4test8drop_box17hee2c4bc13b7e934bE
	mov	ebx, 13
	jmp	.LBB2_2

bar now has 13 instructions, just like the C++ version.

gnzlbg · 2020-01-02T16:19:26Z

Thanks @Amanieu, that's exactly what I had in mind. I don't see a simple way of doing this by default without making catch_unwind an intrinsic like @Mark-Simulacrum mentioned above :/

Maaaybe we could provide a specialized impl in liballoc of Drop for these particular boxes.. :

default impl<T> Drop for Box<T> { ... }
impl Drop for Box<dyn Any + Send> {
    #[cold] fn drop(&mut self) { ... }
}

but I'm not sure whether that will have the desired impact, and also whether that would be worth doing even if it did. It would not only impact catch_unwind but also all these Boxes which are used everywhere, e.g., through std::thread::Result<T> and possibly others.

An intrinsic for catch_unwind sounds like a better path forward.

Mark-Simulacrum · 2020-01-02T16:22:29Z

An intrinsic for catch_unwind wouldn't help I think? Or at least I don't see exactly what you mean by that. I think what could help is fn catch_unwind_ref(try: impl FnOnce() -> R, catch: impl FnOnce(&Exception) -> E) -> Result<R, E> where Exception is Clone (or can get a Box<dyn Any> out or so). That's obviously a much more complex API though.

Optimize catch_unwind to match C++ try/catch This refactors the implementation of catching unwinds to allow LLVM to inline the "try" closure directly into the happy path, avoiding indirection. This means that the catch_unwind implementation is (after this PR) zero-cost unless a panic is thrown. https://rust.godbolt.org/z/cZcUSB is an example of the current codegen in a simple case. Notably, the codegen is *exactly the same* if `-Cpanic=abort` is passed, which is clearly not great. This PR, on the other hand, generates the following assembly: ```asm # -Cpanic=unwind: push rbx mov ebx,0x2a call QWORD PTR [rip+0x1c53c] # <happy> mov eax,ebx pop rbx ret mov rdi,rax call QWORD PTR [rip+0x1c537] # cleanup function call call QWORD PTR [rip+0x1c539] # <unfortunate> mov ebx,0xd mov eax,ebx pop rbx ret # -Cpanic=abort: push rax call QWORD PTR [rip+0x20a1] # <happy> mov eax,0x2a pop rcx ret ``` Fixes #64224, and resolves #64222.

Mark-Simulacrum mentioned this issue Dec 22, 2019

Optimize catch_unwind to match C++ try/catch #67502

Merged

bors closed this as completed in be055d9 Mar 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is the code size of catch_unwind so large ? #64224

Why is the code size of catch_unwind so large ? #64224

gnzlbg commented Sep 6, 2019 •

edited

Loading

Mark-Simulacrum commented Dec 22, 2019

gnzlbg commented Dec 23, 2019

Mark-Simulacrum commented Dec 23, 2019

Manishearth commented Dec 23, 2019

Mark-Simulacrum commented Dec 23, 2019

jdm commented Dec 24, 2019

gnzlbg commented Dec 24, 2019

Amanieu commented Dec 24, 2019

Mark-Simulacrum commented Dec 24, 2019

Amanieu commented Dec 25, 2019

Mark-Simulacrum commented Dec 26, 2019

gnzlbg commented Dec 27, 2019

Mark-Simulacrum commented Dec 27, 2019

gnzlbg commented Dec 28, 2019

Mark-Simulacrum commented Dec 28, 2019

gnzlbg commented Dec 28, 2019

Mark-Simulacrum commented Dec 28, 2019

Amanieu commented Dec 28, 2019 •

edited

Loading

Amanieu commented Dec 28, 2019

gnzlbg commented Jan 2, 2020

Mark-Simulacrum commented Jan 2, 2020

Amanieu commented Jan 2, 2020

gnzlbg commented Jan 2, 2020

Mark-Simulacrum commented Jan 2, 2020

Why is the code size of catch_unwind so large ? #64224

Why is the code size of catch_unwind so large ? #64224

Comments

gnzlbg commented Sep 6, 2019 • edited Loading

Mark-Simulacrum commented Dec 22, 2019

gnzlbg commented Dec 23, 2019

Mark-Simulacrum commented Dec 23, 2019

Manishearth commented Dec 23, 2019

Mark-Simulacrum commented Dec 23, 2019

jdm commented Dec 24, 2019

gnzlbg commented Dec 24, 2019

Amanieu commented Dec 24, 2019

Mark-Simulacrum commented Dec 24, 2019

Amanieu commented Dec 25, 2019

Mark-Simulacrum commented Dec 26, 2019

gnzlbg commented Dec 27, 2019

Mark-Simulacrum commented Dec 27, 2019

gnzlbg commented Dec 28, 2019

Mark-Simulacrum commented Dec 28, 2019

gnzlbg commented Dec 28, 2019

Mark-Simulacrum commented Dec 28, 2019

Amanieu commented Dec 28, 2019 • edited Loading

Amanieu commented Dec 28, 2019

gnzlbg commented Jan 2, 2020

Mark-Simulacrum commented Jan 2, 2020

Amanieu commented Jan 2, 2020

gnzlbg commented Jan 2, 2020

Mark-Simulacrum commented Jan 2, 2020

gnzlbg commented Sep 6, 2019 •

edited

Loading

Amanieu commented Dec 28, 2019 •

edited

Loading