Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inlined function duplication across complex branches when extern "Rust" is used with LTO and opt-level="s" #102295

Open
cr1901 opened this issue Sep 26, 2022 · 0 comments
Labels
A-LTO Area: Link-time optimization (LTO) I-heavy Issue: Problems and improvements with respect to binary size of generated code. O-Arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state O-msp430 T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@cr1901
Copy link
Contributor

cr1901 commented Sep 26, 2022

Context

The example code I linked/described here is an MCVE. See Background For "Real" Applications section for details.

  • Consider a Rust binary which calls a function free(f) within its main(). free() takes a closure f with a branch (?) as input, and in turn calls f and then a function called release().
  • The Rust binary has a feature called use-extern-cs. When disabled, the body of both free() and release() are provided by an external crate called critical. When enabled, the free() function is provided by the main binary instead of critical, and the release() function is marked as extern "Rust" in the main binary's source file.
  • Within the critical crate, the release() function may or may not be marked as #[inline]. This is controlled by the critical/inline feature.

Instructions

  1. If testing msp430, make sure the msp430-elf-gcc toolchain is installed. Optionally install just for convenience.

  2. git clone https://github.com/cr1901/msp430-size. Use commit b8ef905 specifically.

    Despite the name of the repo, this code works for thumbv6m-none-eabi as well; the behavior appears to be arch-agnostic.

  3. Make sure a nightly Rust toolchain is installed (for -Zbuild-std=core).

  4. Run the following command:

    cargo +nightly rustc --manifest-path=./test-cases/Cargo.toml --target=$TARGET --release -Zbuild-std=core --example=critical --features=$FEATURES -- --emit=obj=target/$TARGET/release/examples/critical.o,llvm-ir=target/$TARGET/release/examples/critical.ll,asm=target/$TARGET/release/examples/critical.s

    where:

    • $TARGET: either msp430-none-elf or thumbv6m-none-eabi.
    • $FEATURES: empty, use-extern-cs, critical/inline, or use-extern-cs,critical/inline
  5. Examine the output LLVM, assembly, and object/ELF files with objdump and look for a series of ten nops once or multiple times. Each nop sled represents a call to release.

Expected Behavior

The body of release appears once for the single call to free(), regardless of which combinations of features are enabled (including none).

Actual Behavior

The body of release appears twice in the single call to free() for all combinations of features, except for --features=critical/inline.

Other Hints

  • Sometimes I don't need the #[inline] attribute to prevent release's body from being duplicated. However, I could not translate this behavior well from my real application to MCVE. One way that I found works is to remove the extern "Rust" fn release() declaration, and paste the critical::internal::release() impl directly in the main source file.
  • The extern "Rust" declaration seems to prevent #[inline] hints from working at all.
  • If rustc decides to duplicate release, sometimes rustc will inline one call of release into free, but not the other.
  • release duplication appears in the LLVM files emitted by rustc.

Background For "Real" Applications

The embedded Rust community has started to standardize around a pluggable critical-section crate. The critical-section crate by necessity marks some functions as extern "Rust" and defers to other crates to define them. Specifically, the critical_section::free(f) function takes a closure f() and calls in order (args omitted):

  1. extern "Rust" acquire()
  2. f()
  3. extern "Rust" release()

The crate doesn't define any new functionality for embedded Rust applications; it rather changes how existing functionality (critical sections) is implemented. In principle, the crate should be drop-in to existing embedded Rust applications.

When I transitioned some embedded Rust code to use the critical-section crate, I noticed marked size increases in the .text section (1992 bytes => 2048+ bytes- no longer fits) due to new overhead from how critical_section::free(f) is inlined in my main application's functions. Specifically, if the closure f to critical_section::free(f) has a sufficiently complex branch, rustc will duplicate the body of release across both sides of the branch, even when lto="fat" and opt-level="s".

Calling critical_section::free() is essential for sharing non-atomic data between interrupts/threads in a bare-metal application. To minimize interrupt latency/maximize the amount of work that can be done, the size/speed overhead these calls should be kept as small as possible. I don't understand why Rust is unable to inline calls to critical_section::free(f) without duplicating the body of release (when lto="fat" and codegen-units=1 is enabled), regardless of
the following scenarios:

  1. acquire(), release(), and free() are all provided inline by the main binary.
  2. acquire(), release(), and free() are all provided by the same crate (via use statements no extern "Rust").
  3. free() is provided by one crate (via use), acquire() and release() are provided by another (via use).
  4. free() is provided by one crate (via use), extern "Rust" acquire() and extern "Rust" release() are provided by another crate.

For the MCVE the body of release is exaggerated; actual size difference will vary depending on application. From my own testing, real thumbv6m-none-eabi applications have the duplication, but on average are affected less than msp430-none-elf.

@jyn514 jyn514 added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. I-heavy Issue: Problems and improvements with respect to binary size of generated code. A-LTO Area: Link-time optimization (LTO) O-msp430 O-Arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state labels Apr 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LTO Area: Link-time optimization (LTO) I-heavy Issue: Problems and improvements with respect to binary size of generated code. O-Arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state O-msp430 T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

2 participants