Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows allocator is incompatible with passing heap allocations across DLL libraries with static std #131468

Closed
danakj opened this issue Oct 9, 2024 · 5 comments · Fixed by #131888
Labels
C-bug Category: This is a bug. O-windows Operating system: Windows T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@danakj
Copy link
Contributor

danakj commented Oct 9, 2024

Given two DLLs that contain statically-linked Rust code:

  • One DLL calls Rust to allocate a heap pointer
  • It passes the pointer to the second DLL
  • The second DLL calls Rust to free the heap pointer

The Windows allocation code will crash due to the HEAP global variable being null in the second DLL:

// SAFETY: because `ptr` has been successfully allocated with this allocator,
// `HEAP` must have been successfully initialized.
let heap = unsafe { get_process_heap() };

The allocator code assumes that the allocation already happened in the same DLL.

On other build targets, this setup works fine.

it's not clear why the Rust stdlib goes to all the trouble with the HEAP global variable in the first place. It could just call GetProcessHeap() on each call to HeapAlloc() and HeapFree(). Could we get rid of the HEAP variable?

How did I run into this? You might wonder.

This build setup probably looks weird, but it is a fundamental (bad) thing that Chromium uses for debug developer builds, which we call our component build. We mark subsets of the build graph as components and then everything inside the component is static-linked as usual, but the components are a cluster of DLLs. This breaks global variables and is generally problematic, you might think, and you're right. But it's also deemed to be important for making linking times on Windows debug builds reasonable.

Chromium bug: https://crbug.com/372425130

@danakj danakj added the C-bug Category: This is a bug. label Oct 9, 2024
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Oct 9, 2024
@danakj danakj changed the title Windows allocator is incompatible with passing heap-allocations across DLL libraries with static std Windows allocator is incompatible with passing heap allocations across DLL libraries with static std Oct 9, 2024
@Fulgen301
Copy link
Contributor

Fulgen301 commented Oct 9, 2024

At least on Windows 7 and Windows 11, GetProcessHeap reads the ProcessHeap field from the PEB, so the global variable is not a performance improvement. The code was introduced in #83065, however without mentioning why the global atomic variable was an improvement over calling GetProcessHeap directly. citing performance improvements in local testing; I haven't measured it, but I don't see why avoiding one call into kernelbase.dll that contains 2 - 3 mov instructions on x86 would measurably improve performance in a code path that proceeds to allocate or free memory?

@jieyouxu jieyouxu added T-libs Relevant to the library team, which will review and decide on the PR/issue. O-windows Operating system: Windows labels Oct 10, 2024
@bjorn3
Copy link
Member

bjorn3 commented Oct 10, 2024

If two DLL's statically link the standard library, you should not use the rust abi between them. This is not the only place where there is global mutable state in rust. For example for panics when catching we explicitly check that the underlying exception was thrown by the same copy of the standard library to prevent UB in case of both copies being from a different rustc version. And even aside from that check having multiple copies of the panic count can easily result in a panic turning into an abort if it was caught by a different copy of the standard library.

Would it be possible to dynamically link both DLL's against the same libstd.dll? That avoids duplication of the global mutable state of the standard library.

@ChrisDenton
Copy link
Member

citing performance improvements in local testing; I haven't measured it, but I don't see why avoiding one call into kernelbase.dll that contains 2 - 3 mov instructions on x86 would measurably improve performance in a code path that proceeds to allocate or free memory?

At least theoretically, a function call can be more expensive for the happy path (which is just accessing/testing an atomic) even if that function returns a static. Whether or not it's measurable compared to allocating, I don't know, I haven't measured it. Admittedly I wouldn't have assumed so but apparently at least one person has measured improvements.

If two DLL's statically link the standard library, you should not use the rust abi between them.

Yeah, we can walk back this alleged optimization (personally I'm open to that) but the general problem will remain.

@danakj
Copy link
Contributor Author

danakj commented Oct 10, 2024

Yeah, we can walk back this alleged optimization (personally I'm open to that) but the general problem will remain.

Right, I fully acknowledge it's a general problem, and I'm not expecting this to generally be fine. The chromuim component build has all kinds of issues caused by globals in various libraries. I'd just like to make this code not crash in this configuration.

Would it be possible to dynamically link both DLL's against the same libstd.dll

One thing Chromium could do is to turn std into a "component" in this build, thus making it a shared library in that build configuration. There's other things we could do downstream to avoid this crash. The optimization and complexity here caused by the HEAP variable just didn't look needed, so I thought it might be nice to remove it upstream while this is top of mind.

@jieyouxu jieyouxu removed the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Oct 21, 2024
@danakj
Copy link
Contributor Author

danakj commented Oct 23, 2024

For what it's worth, I benchmarked removing the HEAP variable with a really trivial and probably silly micro-benchmark.

Here's my main.rs that I ran cargo bench against.

#![feature(test)]
extern crate test;

fn main() {
    println!("Hello, world!");
}

#[cfg(test)]
mod tests {
    use test::Bencher;

    #[bench]
    fn bench_alloc(b: &mut Bencher) {
        b.iter(|| alloc_test());
    }

    fn alloc_test() {
        let x: Box<i32> = Box::new(0);
        test::black_box(x);
    }
}

I also did it again with a Box<[i32; 9001]> to compare a larger allocation and Box<[std::mem::MaybeUninit<i32>; 9001]> to compare with uninitialized memory.

The HEAP variable seems to save about 2ns of time, for what that's worth, while an allocation/free can take more than 100ns (on my virtualized Windows machine). The time is within the noise when allocating 9001 i32s and not initializing them. Once memory initialization is involved, the time difference is completely buried within the noise.

Results on HEAD:

# One i32
test tests::bench_alloc ... bench:          42.76 ns/iter (+/- 0.44)
test tests::bench_alloc ... bench:          42.96 ns/iter (+/- 0.37)
test tests::bench_alloc ... bench:          42.91 ns/iter (+/- 0.73)
test tests::bench_alloc ... bench:          42.64 ns/iter (+/- 1.61)
test tests::bench_alloc ... bench:          43.95 ns/iter (+/- 0.76)
test tests::bench_alloc ... bench:          42.87 ns/iter (+/- 0.42)
test tests::bench_alloc ... bench:          43.29 ns/iter (+/- 0.38)
test tests::bench_alloc ... bench:          42.81 ns/iter (+/- 1.64)
test tests::bench_alloc ... bench:          43.13 ns/iter (+/- 0.69)

# 9001 i32s
test tests::bench_alloc ... bench:         529.47 ns/iter (+/- 5.57)
test tests::bench_alloc ... bench:         511.10 ns/iter (+/- 5.51)
test tests::bench_alloc ... bench:         514.95 ns/iter (+/- 15.19)
test tests::bench_alloc ... bench:         523.20 ns/iter (+/- 6.20)
test tests::bench_alloc ... bench:         510.23 ns/iter (+/- 18.15)
test tests::bench_alloc ... bench:         510.80 ns/iter (+/- 19.90)
test tests::bench_alloc ... bench:         511.86 ns/iter (+/- 8.72)
test tests::bench_alloc ... bench:         513.21 ns/iter (+/- 6.13)
test tests::bench_alloc ... bench:         512.11 ns/iter (+/- 18.64)

# 9001 MaybeUninit<i32>s
test tests::bench_alloc ... bench:         108.39 ns/iter (+/- 3.70)
test tests::bench_alloc ... bench:         105.99 ns/iter (+/- 3.28)
test tests::bench_alloc ... bench:         105.88 ns/iter (+/- 2.33)
test tests::bench_alloc ... bench:         106.11 ns/iter (+/- 2.00)
test tests::bench_alloc ... bench:         106.48 ns/iter (+/- 5.75)
test tests::bench_alloc ... bench:         108.20 ns/iter (+/- 4.89)
test tests::bench_alloc ... bench:         105.87 ns/iter (+/- 4.56)
test tests::bench_alloc ... bench:         105.80 ns/iter (+/- 1.76)
test tests::bench_alloc ... bench:         106.16 ns/iter (+/- 5.27)

Results with patch to remove HEAP variable:

# One i32
test tests::bench_alloc ... bench:          45.71 ns/iter (+/- 4.74)
test tests::bench_alloc ... bench:          45.73 ns/iter (+/- 2.51)
test tests::bench_alloc ... bench:          46.03 ns/iter (+/- 1.15)
test tests::bench_alloc ... bench:          46.21 ns/iter (+/- 1.10)
test tests::bench_alloc ... bench:          45.99 ns/iter (+/- 2.93)
test tests::bench_alloc ... bench:          47.92 ns/iter (+/- 0.60)
test tests::bench_alloc ... bench:          46.52 ns/iter (+/- 3.34)
test tests::bench_alloc ... bench:          45.66 ns/iter (+/- 1.39)
test tests::bench_alloc ... bench:          45.88 ns/iter (+/- 0.40)

# 9001 i32s
test tests::bench_alloc ... bench:         511.37 ns/iter (+/- 8.81)
test tests::bench_alloc ... bench:         514.31 ns/iter (+/- 12.93)
test tests::bench_alloc ... bench:         515.96 ns/iter (+/- 11.38)
test tests::bench_alloc ... bench:         514.65 ns/iter (+/- 15.67)
test tests::bench_alloc ... bench:         516.57 ns/iter (+/- 10.78)
test tests::bench_alloc ... bench:         519.76 ns/iter (+/- 15.14)
test tests::bench_alloc ... bench:         520.05 ns/iter (+/- 22.08)
test tests::bench_alloc ... bench:         519.02 ns/iter (+/- 13.15)
test tests::bench_alloc ... bench:         515.42 ns/iter (+/- 10.17)

# 9001 MaybeUninit<i32>s
test tests::bench_alloc ... bench:         112.31 ns/iter (+/- 2.30)
test tests::bench_alloc ... bench:         112.67 ns/iter (+/- 2.01)
test tests::bench_alloc ... bench:         112.79 ns/iter (+/- 1.99)
test tests::bench_alloc ... bench:         112.94 ns/iter (+/- 2.14)
test tests::bench_alloc ... bench:         112.83 ns/iter (+/- 4.61)
test tests::bench_alloc ... bench:         113.45 ns/iter (+/- 2.95)
test tests::bench_alloc ... bench:         112.41 ns/iter (+/- 4.21)
test tests::bench_alloc ... bench:         115.05 ns/iter (+/- 3.59)
test tests::bench_alloc ... bench:         112.69 ns/iter (+/- 1.84)

I used this patch to remove HEAP and built std.

Here's the .cargo/config.toml file I used, to point at the Chromium toolchain stuff. I used static CRT because that's what we use in our optimized builds. I presume that's the one that matters.

[build]
rustflags = "-Clinker=C:\\src\\c\\src\\third_party\\llvm-build\\Release+Asserts\\bin\\lld-link.exe -Clink-arg=/winsysroot:C:\\src\\c\\src\\third_party\\depot_tools\\win_toolchain\\vs_files\\7393122652 -Ctarget-feature=+crt-static"

I have C:\\src\\c\\src\\third_party\\rust-toolchain\\bin at the front of my PATH environment variable to use the Chromium rust toolchain, which is the one I built with and without the above patch.

@bors bors closed this as completed in 775f6d8 Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. O-windows Operating system: Windows T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants