-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rust 1.77.1 fails to build on aarch64-unknown-netbsd with stack exhaustion #123551
Comments
@he32 Odd. How much memory does your emulated aarch64 have access to? |
My qemu-emulated arm64 system has 8GB allocated, and it emulates 4 CPU cores. This build was done with a concurrency of 3. However, that says nothing about what the default thread stack size is on this system. The default process stack limit is 8MB, but this build is run with "unlimited" stack size (and data and virtual size -- rust is a pig), so it is possible we're running into the hard limit for the stack size. I've looked at #122002 and applied it to 1.71.1 and I'm currently re-trying the build with that applied, though I'm not very hopeful it will make a difference. It seems that on NetBSD/aarch64 the maximum process stack is 64MB, ref. vmparam.h's
Though an experimentation in the shell says something slightly different:
Turns out the difference is due to address space layout randomization slop:
If I read the code correctly (not a guarantee), the default thread stack size is inherited from the process resource limit. |
WG-prioritization assigning priority @rustbot label -I-prioritize +P-low +regression-from-stable-to-stable |
Hmm. This is somewhat concerning and should not be happening on an aarch64 system, but I don't know if it's a problem on a native aarch64 system. |
Happens on native 64-core aarch64 system as well, w/ rust 1.77.2. |
In the mean time I have tried to use the cross-built (from amd64 targeting aarch64) rust compiler to build the
and the stack backtrace has 48 entries total. So ... this may not stem from the same underlying issue, or it might. It should, though, provide a test that can be replicated on other aarch64 systems relatively easily. So what is the state of testing of rust on other aarch64 targets? |
rustc passes the test suite on every commit for all the tests that are not specifically ignored for aarch64-unknown-linux-gnu, and while there are certainly a few of those they are not terribly numerous. |
Can the stack overflow error be convinced to report exactly what parameters (stack pointer, stack bounds, guard page addresses, whatever) led it to conclude the stack overflowed? And is there some way to determine whether this rustc thread that crashed is the process's main thread or a non-main thread created with pthread_create? I wouldn't be surprised if there were still something wrong with the stack guard detection logic after #122002. The resolution the PR settled on sounded fine but I didn't do anything to test it myself. It might be worthwhile to verify with a Rust program that the full range of stack space, from base to soft rlimit, can be written to and read from without crashing. |
@he32 I wonder if trying a newer gdb, either with build.sh -V MKCROSSGDB=yes tools, or from pkgsrc devel/gdb, might help to examine the core dump? |
thank you for the confirmed repro, by the way! that's weird. |
Hmm. I feel like I would hesitate before adding quite so much code to libstd, though maybe I'm just not imagining how slim it could be made. While I've expressed my thoughts about good diagnostics requiring some effort, there is still a limit to what should be done for everyone implicitly. However, rustc is its own program, and thus has no concerns like "accommodate smaller programs that don't want a lot of chaff in their binary". It is already not a small program. It is in fact several hundred megabytes of program. A few more bytes won't hurt much as long as they're actually useful from time to time (and not in the middle of a hot path, so don't affect icache too much). So on some platforms it has its own signal handler enabled, which tries to be much more informative. However, that handler uses rust/compiler/rustc_driver_impl/src/lib.rs Lines 90 to 98 in 4bc39f0
|
As for the "which thread is it" question, it is a spawned thread via our threadpool builder: rust/compiler/rustc_interface/src/util.rs Lines 84 to 117 in d371d17
|
From testing on a VM on a native aarch64 system with NetBSD 10.0, it seems the stack exhaustion issue started with #120188. After setting Since NetBSD does support TLS (and I believe x86_64-unknown-netbsd is fine?), I'd assume reverting the change for NetBSD wouldn't constitute a solution. Hopefully it helps narrow down where the issue is occurring though. |
This is an important clue. NetBSD on aarch64 has a known bug in it's TLS implementation. There is a bug report with a patch and with the patch applied the problem is no longer reproducable for me. @riastradh can we please just commit the patch without a test case since more and more things are breaking without it? Can |
I added a test case, verified it crashes in the releng testbed, and committed the fix, so if anyone wants to try with an ld.elf_so built with src/libexec/ld.elf_so/arch/aarch64/rtld_start.S rev. 1.6, that would be helpful to determine whether this bug was the culprit. (@he32?)
(This will almost certainly be 11.0 and 10.1, and possibly 9.5 if there is one.) |
@snowkat Thank you for diagnosing this!
No. Even if you can find some rude hack to allow it, it will be fairly deeply flawed. The Rust compiler lacks a way for people to conveniently compile for a specific OS version, so we define targets in terms of the minimum version we support. This has even led to the somewhat odd case of having a "windows" and a "win7" target. There are two realistic options here: remove As the on-file maintainer, I am inclined to defer to what @he32 prefers here. @riastradh Thank you for committing the fix. |
(We don't actually know yet whether what I committed fixes the issue. It could be a red herring. All I know is that it fixed the issue that we saw in Firefox. To be confirmed by a new build.) I suggest disabling has_thread_local on the aarch64-unknown-netbsd target (or aarch64--netbsd, which is what the GNU platform triple normally is, not sure whether that discrepancy will make a difference), if the setting can't reasonably be conditional on the OS version. A small performance penalty (not even a regression since Rust didn't use TLS before on aarch64--netbsd) for some niche use cases is better than breaking the build for everyone on all released versions of NetBSD. |
I agree it's best to disable |
Also set has_thread_local to false for NetBSD/aarch64*, due to NetBSD PR#58154, ref. comments in rust-lang/rust#123551 Verification remains.
All of our targets are defined by the full tuple. There is no such thing as a target, according to the Rust compiler, that does not have its target definition depend on both the operating system and the architecture. In particular, the spec for this target is defined here: rust/compiler/rustc_target/src/spec/targets/aarch64_unknown_netbsd.rs Lines 15 to 21 in 8bfcae7
That line: ..base::netbsd::opts() fills in the remaining fields with the NetBSD defaults, and will only fill in fields that were not explicitly passed as part of the constructor. |
> I agree it's best to disable `has_thread_local`, with a note
> that it should be re-enabled with a suitable minimum OS
> version bump at a future date. If we can avoid disabling it
> for x86 that would be preferable but we can also conditionally
> enable it per platform in the package manager for NetBSD's
> vendor builds of rust for those users who would like to have
> the feature.
All of our targets are defined by the full tuple. There is no
such thing as a target, according to the Rust compiler, that
does not have its target definition depend on **both** the
operating system and the architecture. In particular, the spec
for this target is defined here:
https://github.com/rust-lang/rust/blob/8bfcae730a5db2438bbda72796175bba21427be1/compiler/rustc_target/src/spec/targets/aarch64_unknown_netbsd.rs#L15-L21
That line:
```rust
..base::netbsd::opts()
```
fills in the remaining fields with the NetBSD defaults, and
will only fill in fields that were not explicitly passed as
part of the constructor.
I don't think we are prepared to require a not-yet-released
NetBSD version 11.0, 10.1 or 9.5 or a pre-release of any of those
for working rust on aarch64*, while we continue at least in name
to support 9.0 and onwards in general for pkgsrc.
So in order to attempt to get a working new rust on aarch64* for
NetBSD, I have committed
NetBSD/pkgsrc-wip@a90cd31
I know, two functionally disparate changes in one commit is
frowned on, but at least this is what I'm running with at the
moment; verification of a native build will need to wait till
after the weekend if noone else beats me to it.
Hm, I probably would need to re-build 1.78.0 for aarch64* with
that change applied as well, and re-upload the corresponding
aarch64* bits. Or version those...
|
@he32 Yep, that looks correct! Please upstream the patch once you have verified it works! Or by August 22 even if you don't. |
OK, I have a first indication of success: the cross-compiled Next on the plan is to build 1.78.0 with this fix applied, and try to build rust 1.79.0 natively on |
NetBSD aarch64 has a bug in the thread-local storage implementation, ref. http://gnats.netbsd.org/58154, and this is responsible for the build misery experienced earlier, ref. rust-lang/rust#123551 Therefore: turn it off for now on arm64. Ideally one could check whether the specific OS version has the fix or not, but e.g. __NetBSD_Version__ isn't easily available here that I know, and this goes against the pattern that OS version should not matter as long as it's >= supported version. So until the fix for NetBSD PR#58154 has propagated to NetBSD releases, and the old ones are no longer supported for pkgsrc (which is going to take a while), this is what we do.
I tried building rust 1.77.1 using the internal LLVM on an emulated aarch64 system, as part of an effort of putting a new rust version through its testing cycle to keep it working on our various NetBSD platforms.
I expected to see this happen: The build should complete
Instead, this happened: The build of 1.77.1 fails with stack exhaustion. As a contrast, 1.76.0 succeeds on this same host.
Meta
rustc --version --verbose
:As this is in the middle of the build, it's a little unclear which version of rustc is running on this point, though indications are that it's 1.77.1 in one of the bootstrap stages. Also, trying to get the bootstrap compiler to run from the CLI is also proving challenging:
I hestitate doing
x.py build -vvv
, both because it iself probably requies own settings / environment variables, and for fear that will turn into a multi-hour endevour.The build log ends with:
Unfortunately, gdb decided not to play ball with this one:
The text was updated successfully, but these errors were encountered: