Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Box::new copies generated Future unless call is wrapped #100932

Closed
jkarneges opened this issue Aug 23, 2022 · 14 comments
Closed

Box::new copies generated Future unless call is wrapped #100932

jkarneges opened this issue Aug 23, 2022 · 14 comments
Labels
A-layout Area: Memory layout of types C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@jkarneges
Copy link
Contributor

Sometimes boxing a generated Future causes the value to be (unnecessarily?) copied. This can be worked around by wrapping the call to Box::new.

Code:

use std::future::ready;
use std::mem;

struct Foo {}

impl Drop for Foo {
    fn drop(&mut self) {
        println!("foo");
    }
}

async fn func1() {
    let _f = Foo {};
    let mut buf = [0u8; 16_384];
    buf[0] = 42;
    ready(()).await;
    println!("{}", buf[0]);
}

fn _box_new<T>(x: T) -> Box<T> {
    Box::new(x)
}

fn take<T>(v: T) {
    println!("{}", mem::size_of_val(&v));
}

fn main() {
    take(Box::new(func1()));
    //take(_box_new(func1()));
}

Asm contains a ~16k memcpy:

$ rustc -O --edition 2021 --emit asm boxcopybug.rs && grep -B 3 memcpy boxcopybug.s
        leaq    32(%rsp), %rsi
        movl    $16386, %edx
        movq    %rax, %rdi
        callq   *memcpy@GOTPCREL(%rip)

In main, if you comment out the first line and uncomment the second, the memcpy goes away, even though the code is nearly identical. I'd expect both ways to compile down the same.

Box::pin is similarly affected.

The behavior is present in the latest stable and nightly.

rustc --version --verbose:

rustc 1.63.0 (4b91a6ea7 2022-08-08)
binary: rustc
commit-hash: 4b91a6ea7258a947e59c6522cd5898e7c0a6a88f
commit-date: 2022-08-08
host: x86_64-unknown-linux-gnu
release: 1.63.0
LLVM version: 14.0.5
@jkarneges jkarneges added the C-bug Category: This is a bug. label Aug 23, 2022
@guswynn
Copy link
Contributor

guswynn commented Aug 24, 2022

moves are actually implemented as memcpy's under the hood! Boxing the future requires you move the value from the stack into the allocation on the heap. In practice, I haven't ever seen a move cause a meaningful performance problem, unless the type is huge (futures CAN be huge). There were some RFC's at one point to add the ability to guarantee a value is constructed in place on the heap, but the problem-space is complex! I wonder if --release optimizes anything away

@est31
Copy link
Member

est31 commented Aug 24, 2022

Thanks for this, I've been looking for small examples where Box::new is not optimized out by LLVM's optimizer. I knew they existed but I didn't find a concrete example. You can box directly via two nightly features, box x or #[rustc_box] Box:new(x), this should be memcpy-free. But I don't know how long any of the two will still exist.

@est31
Copy link
Member

est31 commented Aug 24, 2022

For more, see #49733

@jkarneges
Copy link
Contributor Author

I haven't ever seen a move cause a meaningful performance problem, unless the type is huge (futures CAN be huge)

Yeah it seems like a Rust best practice is to always work with smallish structs and basically assume everything might get copied when moved, and hope that such copying won't be meaningful. Unfortunately, generated futures often make working with big structs unavoidable.

Interestingly, I discovered this issue while attempting to avoid moving futures. Most runtime spawn functions (for example, tokio::spawn) are generic and take a non-boxed value (i.e. they have a signature like spawn<T: Future>(fut: T)). Such functions then box the future internally. I wondered if maybe moving a whole future by value into the spawn function could potentially be slow compared to having the caller box the future first and then changing the spawn function to take a box, so I tested that. My expectation was that this would probably have no impact because the optimizer was already avoiding copies, but maybe I'd get lucky my change might turn out to be faster. I did not at all expect that this would be slower, and so I went on a hunt to find out why.

It is really counterintuitive because you'd think immediately boxing a future would be a good best practice, i.e. it would be the best chance of avoiding copies, when actually immediate boxing risks adding a copy. Passing the future by value to a function that then boxes the value is how you save the copy.

@veber-alex
Copy link
Contributor

veber-alex commented Aug 24, 2022

On nightly with -Copt-level=3 instead of -O (which is opt-level=2) there is no memcpy.
Though it's still strange that putting the Box::new in a wrapper functions actually changes anything.

@jkarneges
Copy link
Contributor Author

jkarneges commented Aug 24, 2022

The memcpy is still present for me with opt-level=3.

EDIT: sorry, that was on stable. On nightly I can confirm that opt-level=3 does optimize away the memcpy. So maybe this issue is fixed.

@est31
Copy link
Member

est31 commented Aug 24, 2022

it's still strange that putting the Box::new in a wrapper functions actually changes anything.

Generally, optimizers employ a large amount of heuristics which are unreliable. It's not uncommon that something like this causes changes in their output.

On nightly I can confirm that opt-level=3 does optimize away the memcpy.

If I had to guess, I would point at the LLVM 15 upgrade (#99464). LLVM keeps changing their optimizer all the time so it's possible that it now detects this case, whether unintentionally or not.

@eholk
Copy link
Contributor

eholk commented Aug 24, 2022

@rustbot label +A-async-await

@rustbot rustbot added the A-async-await Area: Async & Await label Aug 24, 2022
@tmandry tmandry added I-slow Issue: Problems and improvements with respect to performance of generated code. A-layout Area: Memory layout of types labels Aug 25, 2022
@tmandry
Copy link
Member

tmandry commented Aug 29, 2022

opt-level=3 can be quite aggressive, and I think we should be able to optimize this away at opt-level=2 at least.

@eholk
Copy link
Contributor

eholk commented Aug 29, 2022

Do we know that opt-level=2 doesn't catch this on nightly still? If there was an LLVM upgrade, maybe that caught this case as well.

@eholk
Copy link
Contributor

eholk commented Aug 29, 2022

We talked about this in the wg-async triage meeting. It sounds like this is a more general bug that async code happens to be more likely to expose, so we'll remove the async label.

@rustbot label -A-async-await

@rustbot rustbot removed the A-async-await Area: Async & Await label Aug 29, 2022
@jkarneges
Copy link
Contributor Author

opt-level=2 on nightly does not catch it

@jkarneges
Copy link
Contributor Author

On nightly I can confirm that opt-level=3 does optimize away the memcpy. So maybe this issue is fixed.

I've confirmed this issue is fixed in 1.65.0-beta.1 (2a65764 2022-09-19).

@jkarneges
Copy link
Contributor Author

And... confirmed the fix made it to stable! (1.65.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-layout Area: Memory layout of types C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

7 participants