Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when formatting u128 on aarch64 GNU/Linux #102196

Closed
prestontimmons opened this issue Sep 23, 2022 · 11 comments
Closed

Segmentation fault when formatting u128 on aarch64 GNU/Linux #102196

prestontimmons opened this issue Sep 23, 2022 · 11 comments
Labels
C-bug Category: This is a bug. O-Arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state S-needs-repro Status: This issue has no reproduction and needs a reproduction to make progress.

Comments

@prestontimmons
Copy link

Hello, we've noticed segmentation faults when running Rust binaries compiled on aarch64 GNU/Linux. We've seen this occur in multiple libraries that format or print SystemTime.

Architecture:

uname -a

5.10.135-122.509.amzn2.aarch64 #1 SMP Thu Aug 11 22:41:14 UTC 2022 aarch64 GNU/Linux

Reproducible example:

fn main() {
    let millis: u128 = 87329875;
    println!("{}", millis);
}

The segmentation fault occurs when fmt_u128 is called.

I tested this on 1.62.0 and nightly:

/builds/scratch# cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.23s
     Running `target/debug/scratch`
Segmentation fault (core dumped)

# rustc --version --verbose
rustc 1.62.0 (a8314ef7d 2022-06-27)
binary: rustc
commit-hash: a8314ef7d0ec7b75c336af2c9857bfaf43002bfc
commit-date: 2022-06-27
host: aarch64-unknown-linux-gnu
release: 1.62.0
LLVM version: 14.0.5
# cargo +nightly run
    Finished dev [unoptimized + debuginfo] target(s) in 0.24s
     Running `target/debug/scratch`
Segmentation fault (core dumped)

# rustc +nightly --version --verbose
rustc 1.66.0-nightly (e7119a030 2022-09-22)
binary: rustc
commit-hash: e7119a0300b87a3d670408ee8e847c6821b3ae80
commit-date: 2022-09-22
host: aarch64-unknown-linux-gnu
release: 1.66.0-nightly
LLVM version: 15.0.0

The segmentation fault does not occur in release mode:

# cargo run --release
    Finished release [optimized] target(s) in 0.23s
     Running `target/release/scratch`
87329875

It also does not occur if opt-level is set to greater than 0:

[profile.dev]
opt-level = 1

It also does not occur on Darwin aarch64:

uname -a

Darwin TC-4000660 21.6.0 Darwin Kernel Version 21.6.0: Wed Aug 10 14:28:23 PDT 2022; root:xnu-8020.141.5~2/RELEASE_ARM64_T6000 arm64

Meta

Valgrind traceback:

Backtrace

# valgrind target/debug/scratch
==5157== Memcheck, a memory error detector
==5157== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==5157== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==5157== Command: target/debug/scratch
==5157== 
==5157== Invalid read of size 4
==5157==    at 0x112474: alternate (mod.rs:1893)
==5157==    by 0x112474: core::fmt::Formatter::pad_integral (mod.rs:1366)
==5157==    by 0x111BBB: core::fmt::num::fmt_u128 (num.rs:641)
==5157==    by 0x112347: core::fmt::write (mod.rs:1202)
==5157==    by 0x15D5FB: write_fmt<std::io::stdio::StdoutLock> (mod.rs:1679)
==5157==    by 0x15D5FB: <&std::io::stdio::Stdout as std::io::Write>::write_fmt (stdio.rs:715)
==5157==    by 0x15E133: write_fmt (stdio.rs:689)
==5157==    by 0x15E133: print_to<std::io::stdio::Stdout> (stdio.rs:1017)
==5157==    by 0x15E133: std::io::stdio::_print (stdio.rs:1030)
==5157==    by 0x10CDCB: scratch::main (main.rs:3)
==5157==    by 0x10CEA3: core::ops::function::FnOnce::call_once (function.rs:251)
==5157==    by 0x11B3AB: std::sys_common::backtrace::__rust_begin_short_backtrace (backtrace.rs:122)
==5157==    by 0x17925F: std::rt::lang_start::{{closure}} (rt.rs:166)
==5157==    by 0x15B38B: call_once<(), (dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (function.rs:286)
==5157==    by 0x15B38B: do_call<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panicking.rs:464)
==5157==    by 0x15B38B: try<i32, &(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (panicking.rs:428)
==5157==    by 0x15B38B: catch_unwind<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panic.rs:137)
==5157==    by 0x15B38B: {closure#2} (rt.rs:148)
==5157==    by 0x15B38B: do_call<std::rt::lang_start_internal::{closure_env#2}, isize> (panicking.rs:464)
==5157==    by 0x15B38B: try<isize, std::rt::lang_start_internal::{closure_env#2}> (panicking.rs:428)
==5157==    by 0x15B38B: catch_unwind<std::rt::lang_start_internal::{closure_env#2}, isize> (panic.rs:137)
==5157==    by 0x15B38B: std::rt::lang_start_internal (rt.rs:148)
==5157==    by 0x17922B: std::rt::lang_start (rt.rs:165)
==5157==    by 0x10CE07: main (in /builds/scratch/target/debug/scratch)
==5157==  Address 0x31 is not stack'd, malloc'd or (recently) free'd
==5157== 
==5157== 
==5157== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==5157==  Access not within mapped region at address 0x57
==5157==    at 0x112474: alternate (mod.rs:1893)
==5157==    by 0x112474: core::fmt::Formatter::pad_integral (mod.rs:1366)
==5157==    by 0x111BBB: core::fmt::num::fmt_u128 (num.rs:641)
==5157==    by 0x112347: core::fmt::write (mod.rs:1202)
==5157==    by 0x15D5FB: write_fmt<std::io::stdio::StdoutLock> (mod.rs:1679)
==5157==    by 0x15D5FB: <&std::io::stdio::Stdout as std::io::Write>::write_fmt (stdio.rs:715)
==5157==    by 0x15E133: write_fmt (stdio.rs:689)
==5157==    by 0x15E133: print_to<std::io::stdio::Stdout> (stdio.rs:1017)
==5157==    by 0x15E133: std::io::stdio::_print (stdio.rs:1030)
==5157==    by 0x10CDCB: scratch::main (main.rs:3)
==5157==    by 0x10CEA3: core::ops::function::FnOnce::call_once (function.rs:251)
==5157==    by 0x11B3AB: std::sys_common::backtrace::__rust_begin_short_backtrace (backtrace.rs:122)
==5157==    by 0x17925F: std::rt::lang_start::{{closure}} (rt.rs:166)
==5157==    by 0x15B38B: call_once<(), (dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (function.rs:286)
==5157==    by 0x15B38B: do_call<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panicking.rs:464)
==5157==    by 0x15B38B: try<i32, &(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (panicking.rs:428)
==5157==    by 0x15B38B: catch_unwind<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panic.rs:137)
==5157==    by 0x15B38B: {closure#2} (rt.rs:148)
==5157==    by 0x15B38B: do_call<std::rt::lang_start_internal::{closure_env#2}, isize> (panicking.rs:464)
==5157==    by 0x15B38B: try<isize, std::rt::lang_start_internal::{closure_env#2}> (panicking.rs:428)
==5157==    by 0x15B38B: catch_unwind<std::rt::lang_start_internal::{closure_env#2}, isize> (panic.rs:137)
==5157==    by 0x15B38B: std::rt::lang_start_internal (rt.rs:148)
==5157==    by 0x17922B: std::rt::lang_start (rt.rs:165)
==5157==    by 0x10CE07: main (in /builds/scratch/target/debug/scratch)
==5157==  If you believe this happened as a result of a stack
==5157==  overflow in your program's main thread (unlikely but
==5157==  possible), you can try to increase the size of the
==5157==  main thread stack using the --main-stacksize= flag.
==5157==  The main thread stack size used in this run was 10485760.
==5157== 
==5157== HEAP SUMMARY:
==5157==     in use at exit: 1,109 bytes in 4 blocks
==5157==   total heap usage: 9 allocs, 5 frees, 2,997 bytes allocated
==5157== 
==5157== LEAK SUMMARY:
==5157==    definitely lost: 0 bytes in 0 blocks
==5157==    indirectly lost: 0 bytes in 0 blocks
==5157==      possibly lost: 0 bytes in 0 blocks
==5157==    still reachable: 1,109 bytes in 4 blocks
==5157==         suppressed: 0 bytes in 0 blocks
==5157== Rerun with --leak-check=full to see details of leaked memory
==5157== 
==5157== For lists of detected and suppressed errors, rerun with: -s
==5157== ERROR SUMMARY: 2 errors from 1 contexts (suppressed: 0 from 0)
Segmentation fault
<backtrace>

@prestontimmons prestontimmons added the C-bug Category: This is a bug. label Sep 23, 2022
@thomcc thomcc added the O-Arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state label Sep 25, 2022
@Noratrieb
Copy link
Member

I can't reproduce this segfault inside an arm64 docker container on an x86_64 host, so this seems to require a real machine and doesn't work under QEMU.
Linux 8ae19ce495c5 5.10.102.1-microsoft-standard-WSL2 #1 SMP Wed Mar 2 00:30:59 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

@saethlin
Copy link
Member

Thus far I cannot reproduce this issue. Perhaps because I'm on

Linux alarm 5.19.8-1-aarch64-ARCH #1 SMP PREEMPT Thu Sep 8 18:20:33 MDT 2022 aarch64 GNU/Linux

Is the above output from a graviton2 instance?

@thomcc
Copy link
Member

thomcc commented Sep 25, 2022

Are you using mold as your linker by any chance? Seems somewhat similar to #101247.

@prestontimmons
Copy link
Author

Thanks for looking into this.

  1. I also have not been able to reproduce this on x86_64 or in an emulated docker running on x86_64.

  2. Yes, it is a graviton2 instance using 5.10.135-122.509.amzn2.aarch64.

  3. No, this is using the default linker. mold has not been added.

I did some more testing and found an interesting result. When using cargo run directly on the host the segmentation fault is not occurring, but I see it consistently in the docker runner that runs on the host (this is part of our CI). The docker image is based on rust:1.62-slim-bullseye.

I'll dig deeper and find a more specific setup that reproduces it.

@Dirreke
Copy link
Contributor

Dirreke commented Sep 19, 2023

I met a similar issue on csky-arch when using println!.

  1. A small u128 will return the wrong result.
let a = 0_u128;
println!("{a}"); //14082568811966739713
let a = 1_u128;
println!("{a}"); //14082568811966739714
let a = 10_u128.pow(18);
println!("{a}"); //15082568811966739713
let a = 10_u128.pow(19);
println!("{a}"); //140825688119667397140000000421709631291
  1. A large u128 will return the segmentation fault
let a = 2_u128.pow(84);
println!("{a}"); //segmentation fault
  1. the calculation of u128 is correct and the other format type is correct
let a = 2_u128.pow(84) ;
println!("{:b}", a); //1000000000000000000000000000000000000000000000000000000000000000000000000000000000000
let a = (0_u128 + 1_u128 ) as u64;
println!("{:b}", a); //1

Actually, I'm working on migrating code to csky arch, which is a niche arch. I introduced the csky arch to rust by #113658 and introduced it to libc by rust-lang/libc#3301 .

I'm not sure what caused this issue. This issue is similar with yours. It confused me and I don't know if it is just my igorance in my PR or some error in any other code.


@saethlin saethlin added the E-needs-mcve Call for participation: This issue has a repro, but needs a Minimal Complete and Verifiable Example label Sep 19, 2023
@saethlin
Copy link
Member

saethlin commented Sep 19, 2023

Nobody ever came up with a reproducer of the original report. I just spun up a few graviton instances and tried to again, and I couldn't reproduce the originally-reported crash.

I'm sure we could help out if you can come up with a reproducer that doesn't require owning some niche hardware. Is there an emulator people can run?

Failing in that, I'd try reporting this problem to your local expert on your arch. I strongly suspect that whatever is going on here is not too Rust-specific. This is probably an LLVM or linker problem, so anyone who can reproduce the problem and is experienced with low-level debugging could really help us out here by identifying what has gone wrong with the codegen. If this happens without optimizations, it's probably fairly localized. For example, if someone can point out "The instructions look good up until this one, at which point it makes no sense. The executable should contain these instructions instead."

@Dirreke
Copy link
Contributor

Dirreke commented Oct 20, 2023

I met a similar issue on csky-arch when using println!.

  1. A small u128 will return the wrong result.
let a = 0_u128;
println!("{a}"); //14082568811966739713
let a = 1_u128;
println!("{a}"); //14082568811966739714
let a = 10_u128.pow(18);
println!("{a}"); //15082568811966739713
let a = 10_u128.pow(19);
println!("{a}"); //140825688119667397140000000421709631291
  1. A large u128 will return the segmentation fault
let a = 2_u128.pow(84);
println!("{a}"); //segmentation fault
  1. the calculation of u128 is correct and the other format type is correct
let a = 2_u128.pow(84) ;
println!("{:b}", a); //1000000000000000000000000000000000000000000000000000000000000000000000000000000000000
let a = (0_u128 + 1_u128 ) as u64;
println!("{:b}", a); //1

Actually, I'm working on migrating code to csky arch, which is a niche arch. I introduced the csky arch to rust by #113658 and introduced it to libc by rust-lang/libc#3301 .

I'm not sure what caused this issue. This issue is similar with yours. It confused me and I don't know if it is just my igorance in my PR or some error in any other code.

Fixed it by llvm/llvm-project#69732 .

@kpreid
Copy link
Contributor

kpreid commented Dec 25, 2023

Triage: Relabeling issues which don't have a runnable reproduction (as opposed to having a non-minimized one) to the new label S-needs-repro.
@rustbot label +S-needs-repro -E-needs-mcve

@rustbot rustbot added S-needs-repro Status: This issue has no reproduction and needs a reproduction to make progress. and removed E-needs-mcve Call for participation: This issue has a repro, but needs a Minimal Complete and Verifiable Example labels Dec 25, 2023
@workingjubilee workingjubilee added the O-csky Target: glaCSKY above covers over me~ label May 29, 2024
@Noratrieb
Copy link
Member

@Dirreke what's the status here? Your linked LLVM PR that was supposed to fix this was closed with "this is an error in the codegen of rust", which implies that there is something that needs to be fixed here. Is this true?

@Noratrieb Noratrieb removed the O-csky Target: glaCSKY above covers over me~ label Nov 9, 2024
@Noratrieb
Copy link
Member

@saethlin kindly reminded me that this issue has nothing to do with CSKY, as it was found on AArch64. Therefore I'm closing this as it does not have a reproduction, people tried to get one and failed.

@Noratrieb Noratrieb closed this as not planned Won't fix, can't repro, duplicate, stale Nov 9, 2024
@Dirreke
Copy link
Contributor

Dirreke commented Nov 10, 2024

@Dirreke what's the status here? Your linked LLVM PR that was supposed to fix this was closed with "this is an error in the codegen of rust", which implies that there is something that needs to be fixed here. Is this true?

Thanks. The issue that was produced in csky has been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. O-Arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state S-needs-repro Status: This issue has no reproduction and needs a reproduction to make progress.
Projects
None yet
Development

No branches or pull requests

8 participants