Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlocks on NetBSD around fork() #76600

Closed
he32 opened this issue Sep 11, 2020 · 15 comments
Closed

Deadlocks on NetBSD around fork() #76600

he32 opened this issue Sep 11, 2020 · 15 comments
Labels
C-bug Category: This is a bug. O-netbsd Operating system: NetBSD T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@he32
Copy link
Contributor

he32 commented Sep 11, 2020

I do realize this is likely to be ignored, but I just cannot let go of bringing it up anyway...

It appears to me that rust in certain circumstances is relying on undefined behavior, in that it is in general a multi-threaded program, and will in many cases do fork(), and perform many non-trivial tasks between fork() and exec(). Some of these things have manifested themselves in run-time problems observed on NetBSD: (detected) deadlocks in ld.elf_so (resulting in abort), deadlocks related to malloc() manifesting as hangs etc. In NetBSD, the fork() man page contains this passage:

     Child processes of a threaded program have additional restrictions, a
     child must only call functions that are async-signal-safe.  Very few
     functions are asynchronously safe (the list may be found in sigaction(2))
     and applications should make sure they call exec(3) as soon as possible.

A Linux/Debian man page for fork(2) also contains a similar passage:

       *  After a fork() in a multithreaded program, the child can safely call
          only async-signal-safe functions (see signal-safety(7))  until  such
          time as it calls execve(2).

Even though I don't point to offending code here, I have it on good authority that this is the root cause of the issues we have been observing on NetBSD. Is there any chance that the rust code could have a make-over so as to not rely on undefined behavior in this aspect?

@jonas-schievink
Copy link
Contributor

We do consider UB in the standard library a bug, but we can only fix it if you point out where the issue is. "rewrite the std::command module" is unlikely to be of much help.

@tmiasko
Copy link
Contributor

tmiasko commented Sep 11, 2020

There is one known async signal safety issue in the standard library, the use of pthread_mutex_unlock after fork #64718.

EDIT: There is one more issue, Command is using execvp which isn't asnyc-signal-safe, unlike variants which don't need to know PATH.

@LeSeulArtichaut LeSeulArtichaut added T-libs Relevant to the library team, which will review and decide on the PR/issue. C-bug Category: This is a bug. labels Sep 11, 2020
@tmiasko
Copy link
Contributor

tmiasko commented Sep 13, 2020

I can reproduce a deadlock on NetBSD; not including a reproducer, since it seems any trivial one will do. Backtrace generally shows je_jemalloc_prefork in the deadlocked child process.

The problem persisted after I patched calls to functions that aren't signal safe (removed the call to pthread_mutex_unlock, replaced execvp with execv, removed the call to pthread_sigmask - not listed as async signal safe in NetBSD manual).

The problem persisted after I rewrote the test case in C:

#include <stdio.h>
#include <assert.h>
#include <pthread.h>
#include <sys/types.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>

#define N 4

static pthread_t threads[N];

static void spawn() {
  pid_t p;
  p = fork();
  assert(p != -1);

  if (p == 0) {
    _exit(0);
  } else {
    pid_t r;
    int wstatus;
    r = waitpid(p, &wstatus, 0);
    assert(r != -1);
    assert(r == p);
  }
}

static void *run(void *arg) {
  for (int i=0; i != 10; ++i) {
    spawn();
  }
  return NULL;
}

int main() {
  int r;
  for (int i=0; i != N; ++i) {
    r = pthread_create(threads +i, NULL, run, NULL);
    assert(r == 0);
  }
  for (int i=0; i != N; ++i) {
    r = pthread_join(threads[i], NULL);
    assert(r == 0);
  }
  return 0;
}

@RalfJung RalfJung changed the title Rust should not rely on undefined behaviour Deadlocks on NetBSD around fork() Sep 13, 2020
@RalfJung RalfJung added the O-netbsd Operating system: NetBSD label Sep 13, 2020
@RalfJung
Copy link
Member

The problem persisted after I rewrote the test case in C:

So, doesn't this mean that this is not a Rust bug but a NetBSD bug?

I can reproduce a deadlock on NetBSD; not including a reproducer, since it seems any trivial one will do.

It would still be helpful to have one. Little details often matter surprisingly much.

@RalfJung
Copy link
Member

EDIT: There is one more issue, Command is using execvp which isn't asnyc-signal-safe, unlike variants which don't need to know PATH.

What is that based on? I checked the manpage and couldn't find information about signal safety of the various exec* methods. But using them after fork is one of their main usecases so this is rather surprising.

@jonas-schievink
Copy link
Contributor

man 7 signal-safety does not list execvp, so it is not async signal safe

@tmiasko
Copy link
Contributor

tmiasko commented Sep 13, 2020

Backtrace of child process after deadlock:

#0  0x00007f7fe3a0c3aa in ___lwp_park60 () from /usr/libexec/ld.elf_so
#1  0x00007f7fe3a01595 in _rtld_shared_enter () from /usr/libexec/ld.elf_so
#2  0x00007f7fe3a00b9b in _rtld_bind () from /usr/libexec/ld.elf_so
#3  0x00007f7fe3a007fd in _rtld_bind_start () from /usr/libexec/ld.elf_so
#4  0x0000000000000246 in ?? ()
#5  0x000075686ce431fa in _lwp_ctl () from /usr/lib/libc.so.12
#6  0x000075686cf06cf9 in je_jemalloc_prefork () from /usr/lib/libc.so.12
#7  0x000075686d6f3000 in ?? ()
#8  0x0000000000000001 in ?? ()
#9  0x0000000000400b89 in spawn () at t.c:20
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Potentially related NetBSD bug report https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=49816

The posix_spawn appears to work fine, and the Rust standard library already uses it on Linux, FreeBDS and macOS. Adding bindings to libc and updating the cfg directives in std might be easiest way forward, regardless of the root cause.

@he32
Copy link
Contributor Author

he32 commented Sep 13, 2020

Backtrace of child process after deadlock:

#0  0x00007f7fe3a0c3aa in ___lwp_park60 () from /usr/libexec/ld.elf_so
#1  0x00007f7fe3a01595 in _rtld_shared_enter () from /usr/libexec/ld.elf_so
#2  0x00007f7fe3a00b9b in _rtld_bind () from /usr/libexec/ld.elf_so
#3  0x00007f7fe3a007fd in _rtld_bind_start () from /usr/libexec/ld.elf_so
#4  0x0000000000000246 in ?? ()
#5  0x000075686ce431fa in _lwp_ctl () from /usr/lib/libc.so.12
#6  0x000075686cf06cf9 in je_jemalloc_prefork () from /usr/lib/libc.so.12
#7  0x000075686d6f3000 in ?? ()
#8  0x0000000000000001 in ?? ()
#9  0x0000000000400b89 in spawn () at t.c:20
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

This matches pretty well with one of the backtraces I've seen with routinator from NLnetLabs. This was observed with NetBSD 9.0_RC1. However, after upgrading to recent 9.0_STABLE, in particular getting this change from src/doc/CHANGES-9.1 into the system:

lib/libc/gen/pthread_atfork.c                   1.13,1.14
libexec/ld.elf_so/rtld.c                        1.204,1.205
libexec/ld.elf_so/rtld.h                        1.139,1.140
libexec/ld.elf_so/symbols.map                   1.3,1.4

        Introduce intermediate locking for fork, so that the dynamic
        linker is in a consistent state.
        [chs, ticket #907]

the hangs I had observed earlier have so far not re-surfaced with this particular program.

So... Both this and the C-based reproducer earlier appears to indicate that this may be a problem in NetBSD and not in rust. However, I think the C-based reproducer above has a different root cause. On my 9.0/amd64 test system, it didn't result in a hang with N=4, but did with N=400, resulting in one zombie, and a number of threads either in "wait" or in "parked" states.

@tmiasko
Copy link
Contributor

tmiasko commented Sep 13, 2020

I updated NetBSD from 9.0 to latest daily snapshot. The previous problems didn't reproduce so far, but I encountered another problem and reduced it to a program in C. Attaching / detaching debugger continues the execution, so that seems to be some form of missed notification.

C source
#include <stdio.h>
#include <assert.h>
#include <pthread.h>
#include <sys/types.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>

#define N 4

static pthread_t threads[N];

static void spawn() {
  pid_t p;
  p = fork();
  assert(p != -1);

  if (p == 0) {
    _exit(0);
  } else {
    pid_t r;
    int wstatus;
    r = waitpid(p, &wstatus, 0);
    assert(r != -1);
    assert(r == p);
  }
}

static void *run(void *arg) {
  int r;
  pthread_attr_t attr;
  pthread_t current = pthread_self();
  r = pthread_getattr_np(current, &attr);
  assert(r == 0);
  r = pthread_attr_destroy(&attr);
  assert(r == 0);

  for (int i=0; i != 10; ++i) {
    spawn();
  }
  return NULL;
}

int main() {
  int r;
  for (int i=0; i != N; ++i) {
    r = pthread_create(threads +i, NULL, run, NULL);
    assert(r == 0);
  }
  for (int i=0; i != N; ++i) {
    r = pthread_join(threads[i], NULL);
    assert(r == 0);
  }
  return 0;
}
Backtrace
Thread 2 (LWP 8923 of process 8923):
#0  0x00007bedce64556a in _lwp_wait () from /usr/lib/libc.so.12
#1  0x00007bedcee0c754 in pthread_join () from /usr/lib/libpthread.so.1
#2  0x0000000000400e8c in main () at ./w.c:51

Thread 1 (LWP 7395 of process 8923):
#0  0x00007bedce6a88ba in ___lwp_park60 () from /usr/lib/libc.so.12
#1  0x00007bedcee09791 in ?? () from /usr/lib/libpthread.so.1
#2  0x00007bedce6d6e1a in je_malloc_mutex_lock_slow () from /usr/lib/libc.so.12
#3  0x00007bedce70e601 in je_arena_choose_hard () from /usr/lib/libc.so.12
#4  0x00007bedce6b594f in je_tsd_tcache_data_init () from /usr/lib/libc.so.12
#5  0x00007bedce6b5b99 in je_tsd_tcache_enabled_data_init () from /usr/lib/libc.so.12
#6  0x00007bedce6b1c19 in je_tsd_fetch_slow () from /usr/lib/libc.so.12
#7  0x00007bedce70fa9c in calloc () from /usr/lib/libc.so.12
#8  0x00007bedcee0af73 in ?? () from /usr/lib/libpthread.so.1
#9  0x00007bedcee0b066 in pthread_attr_get_np () from /usr/lib/libpthread.so.1
#10 0x00007bedcee0b8eb in pthread_getattr_np () from /usr/lib/libpthread.so.1
#11 0x0000000000400d92 in run (arg=0x0) at ./w.c:33
#12 0x00007bedcee0bee0 in ?? () from /usr/lib/libpthread.so.1
#13 0x00007bedce6924e0 in ?? () from /usr/lib/libc.so.12
#14 0x0000000000200000 in ?? ()
#15 0x0000000000000000 in ?? ()

EDIT: This one can be reproduced without fork.

EDIT: Similar problems discussed on current-users mailing list:

@jyn514
Copy link
Member

jyn514 commented Sep 14, 2020

Both this and the C-based reproducer earlier appears to indicate that this may be a problem in NetBSD and not in rust
I encountered another problem and reduced it to a program in C.

Should this be closed, then?

@joshtriplett
Copy link
Member

man 7 signal-safety does not list execvp, so it is not async signal safe

To the best of my knowledge, execvp is only considered "not async signal safe" because all functions that read the environment aren't safe against concurrent modifications: if you change PATH concurrently from another thread, you have no guarantees whether it'll use the old PATH, the new, or some combination thereof. execvp is listed as MT-Safe env, meaning "it's multi-thread safe as long as you don't expect environment variables to do any synchronization for you".

@he32
Copy link
Contributor Author

he32 commented Sep 18, 2020

@tmiasko I had someone run your C program on NetBSD-current (not just NetBSD-9 stable), and there apparently the problem with stuck threads / processes was not reproducible, so there is hope that this bug won't be there when NetBSD 10 is released.

@tmiasko
Copy link
Contributor

tmiasko commented Sep 19, 2020

I can reproduce the last issue on daily snapshot from 2020-09-18, usually after just a few seconds of executing program in the loop. If you do reproduce it at some point, I would appreciate if you could report this upstream, since I generally don't use NetBSD.

Simplified reproducer, since it turns out neither fork nor pthread_getattr_np is required, except for the fact that latter is allocating memory:

#include <assert.h>
#include <pthread.h>
#include <stdlib.h>

#define N 4

static pthread_t threads[N];

static void *run(void *arg) {
        return malloc(1024);
}

int main() {
        for (int i = 0; i != N; ++i) assert(pthread_create(&threads[i], NULL, run, NULL) == 0);
        for (int i = 0; i != N; ++i) assert(pthread_join(threads[i], NULL) == 0);
}

@jyn514
Copy link
Member

jyn514 commented Sep 19, 2020

I'm going to close this, please report it upstream as a NetBSD issue. It's neither related to rust nor fork().

@jyn514 jyn514 closed this as completed Sep 19, 2020
@he32
Copy link
Contributor Author

he32 commented Sep 19, 2020

@tmiasko I reported http://gnats.netbsd.org/55670 -- this is apparently caused by concurrency bugs in our "jemalloc" implementation in the netbsd-9 code base. Reportedly this is fixed in NetBSD-current, and will be in NetBSD 10.0 when that comes out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. O-netbsd Operating system: NetBSD T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants