Deadlocks on NetBSD around fork() #76600

he32 · 2020-09-11T09:57:15Z

I do realize this is likely to be ignored, but I just cannot let go of bringing it up anyway...

It appears to me that rust in certain circumstances is relying on undefined behavior, in that it is in general a multi-threaded program, and will in many cases do fork(), and perform many non-trivial tasks between fork() and exec(). Some of these things have manifested themselves in run-time problems observed on NetBSD: (detected) deadlocks in ld.elf_so (resulting in abort), deadlocks related to malloc() manifesting as hangs etc. In NetBSD, the fork() man page contains this passage:

     Child processes of a threaded program have additional restrictions, a
     child must only call functions that are async-signal-safe.  Very few
     functions are asynchronously safe (the list may be found in sigaction(2))
     and applications should make sure they call exec(3) as soon as possible.

A Linux/Debian man page for fork(2) also contains a similar passage:

       *  After a fork() in a multithreaded program, the child can safely call
          only async-signal-safe functions (see signal-safety(7))  until  such
          time as it calls execve(2).

Even though I don't point to offending code here, I have it on good authority that this is the root cause of the issues we have been observing on NetBSD. Is there any chance that the rust code could have a make-over so as to not rely on undefined behavior in this aspect?

The text was updated successfully, but these errors were encountered:

jonas-schievink · 2020-09-11T10:14:38Z

We do consider UB in the standard library a bug, but we can only fix it if you point out where the issue is. "rewrite the std::command module" is unlikely to be of much help.

tmiasko · 2020-09-11T17:14:01Z

There is one known async signal safety issue in the standard library, the use of pthread_mutex_unlock after fork #64718.

EDIT: There is one more issue, Command is using execvp which isn't asnyc-signal-safe, unlike variants which don't need to know PATH.

tmiasko · 2020-09-13T11:27:51Z

I can reproduce a deadlock on NetBSD; not including a reproducer, since it seems any trivial one will do. Backtrace generally shows je_jemalloc_prefork in the deadlocked child process.

The problem persisted after I patched calls to functions that aren't signal safe (removed the call to pthread_mutex_unlock, replaced execvp with execv, removed the call to pthread_sigmask - not listed as async signal safe in NetBSD manual).

The problem persisted after I rewrote the test case in C:

#include <stdio.h>
#include <assert.h>
#include <pthread.h>
#include <sys/types.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>

#define N 4

static pthread_t threads[N];

static void spawn() {
  pid_t p;
  p = fork();
  assert(p != -1);

  if (p == 0) {
    _exit(0);
  } else {
    pid_t r;
    int wstatus;
    r = waitpid(p, &wstatus, 0);
    assert(r != -1);
    assert(r == p);
  }
}

static void *run(void *arg) {
  for (int i=0; i != 10; ++i) {
    spawn();
  }
  return NULL;
}

int main() {
  int r;
  for (int i=0; i != N; ++i) {
    r = pthread_create(threads +i, NULL, run, NULL);
    assert(r == 0);
  }
  for (int i=0; i != N; ++i) {
    r = pthread_join(threads[i], NULL);
    assert(r == 0);
  }
  return 0;
}

RalfJung · 2020-09-13T12:17:00Z

The problem persisted after I rewrote the test case in C:

So, doesn't this mean that this is not a Rust bug but a NetBSD bug?

I can reproduce a deadlock on NetBSD; not including a reproducer, since it seems any trivial one will do.

It would still be helpful to have one. Little details often matter surprisingly much.

RalfJung · 2020-09-13T12:23:10Z

EDIT: There is one more issue, Command is using execvp which isn't asnyc-signal-safe, unlike variants which don't need to know PATH.

What is that based on? I checked the manpage and couldn't find information about signal safety of the various exec* methods. But using them after fork is one of their main usecases so this is rather surprising.

jonas-schievink · 2020-09-13T12:27:21Z

man 7 signal-safety does not list execvp, so it is not async signal safe

tmiasko · 2020-09-13T13:38:30Z

Backtrace of child process after deadlock:

#0  0x00007f7fe3a0c3aa in ___lwp_park60 () from /usr/libexec/ld.elf_so
#1  0x00007f7fe3a01595 in _rtld_shared_enter () from /usr/libexec/ld.elf_so
#2  0x00007f7fe3a00b9b in _rtld_bind () from /usr/libexec/ld.elf_so
#3  0x00007f7fe3a007fd in _rtld_bind_start () from /usr/libexec/ld.elf_so
#4  0x0000000000000246 in ?? ()
#5  0x000075686ce431fa in _lwp_ctl () from /usr/lib/libc.so.12
#6  0x000075686cf06cf9 in je_jemalloc_prefork () from /usr/lib/libc.so.12
#7  0x000075686d6f3000 in ?? ()
#8  0x0000000000000001 in ?? ()
#9  0x0000000000400b89 in spawn () at t.c:20
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Potentially related NetBSD bug report https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=49816

The posix_spawn appears to work fine, and the Rust standard library already uses it on Linux, FreeBDS and macOS. Adding bindings to libc and updating the cfg directives in std might be easiest way forward, regardless of the root cause.

he32 · 2020-09-13T14:03:00Z

Backtrace of child process after deadlock:

#0  0x00007f7fe3a0c3aa in ___lwp_park60 () from /usr/libexec/ld.elf_so
#1  0x00007f7fe3a01595 in _rtld_shared_enter () from /usr/libexec/ld.elf_so
#2  0x00007f7fe3a00b9b in _rtld_bind () from /usr/libexec/ld.elf_so
#3  0x00007f7fe3a007fd in _rtld_bind_start () from /usr/libexec/ld.elf_so
#4  0x0000000000000246 in ?? ()
#5  0x000075686ce431fa in _lwp_ctl () from /usr/lib/libc.so.12
#6  0x000075686cf06cf9 in je_jemalloc_prefork () from /usr/lib/libc.so.12
#7  0x000075686d6f3000 in ?? ()
#8  0x0000000000000001 in ?? ()
#9  0x0000000000400b89 in spawn () at t.c:20
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

This matches pretty well with one of the backtraces I've seen with routinator from NLnetLabs. This was observed with NetBSD 9.0_RC1. However, after upgrading to recent 9.0_STABLE, in particular getting this change from src/doc/CHANGES-9.1 into the system:

lib/libc/gen/pthread_atfork.c                   1.13,1.14
libexec/ld.elf_so/rtld.c                        1.204,1.205
libexec/ld.elf_so/rtld.h                        1.139,1.140
libexec/ld.elf_so/symbols.map                   1.3,1.4

        Introduce intermediate locking for fork, so that the dynamic
        linker is in a consistent state.
        [chs, ticket #907]

the hangs I had observed earlier have so far not re-surfaced with this particular program.

So... Both this and the C-based reproducer earlier appears to indicate that this may be a problem in NetBSD and not in rust. However, I think the C-based reproducer above has a different root cause. On my 9.0/amd64 test system, it didn't result in a hang with N=4, but did with N=400, resulting in one zombie, and a number of threads either in "wait" or in "parked" states.

tmiasko · 2020-09-13T16:42:44Z

I updated NetBSD from 9.0 to latest daily snapshot. The previous problems didn't reproduce so far, but I encountered another problem and reduced it to a program in C. Attaching / detaching debugger continues the execution, so that seems to be some form of missed notification.

C source

#include <stdio.h>
#include <assert.h>
#include <pthread.h>
#include <sys/types.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>

#define N 4

static pthread_t threads[N];

static void spawn() {
  pid_t p;
  p = fork();
  assert(p != -1);

  if (p == 0) {
    _exit(0);
  } else {
    pid_t r;
    int wstatus;
    r = waitpid(p, &wstatus, 0);
    assert(r != -1);
    assert(r == p);
  }
}

static void *run(void *arg) {
  int r;
  pthread_attr_t attr;
  pthread_t current = pthread_self();
  r = pthread_getattr_np(current, &attr);
  assert(r == 0);
  r = pthread_attr_destroy(&attr);
  assert(r == 0);

  for (int i=0; i != 10; ++i) {
    spawn();
  }
  return NULL;
}

int main() {
  int r;
  for (int i=0; i != N; ++i) {
    r = pthread_create(threads +i, NULL, run, NULL);
    assert(r == 0);
  }
  for (int i=0; i != N; ++i) {
    r = pthread_join(threads[i], NULL);
    assert(r == 0);
  }
  return 0;
}

Backtrace

Thread 2 (LWP 8923 of process 8923):
#0  0x00007bedce64556a in _lwp_wait () from /usr/lib/libc.so.12
#1  0x00007bedcee0c754 in pthread_join () from /usr/lib/libpthread.so.1
#2  0x0000000000400e8c in main () at ./w.c:51

Thread 1 (LWP 7395 of process 8923):
#0  0x00007bedce6a88ba in ___lwp_park60 () from /usr/lib/libc.so.12
#1  0x00007bedcee09791 in ?? () from /usr/lib/libpthread.so.1
#2  0x00007bedce6d6e1a in je_malloc_mutex_lock_slow () from /usr/lib/libc.so.12
#3  0x00007bedce70e601 in je_arena_choose_hard () from /usr/lib/libc.so.12
#4  0x00007bedce6b594f in je_tsd_tcache_data_init () from /usr/lib/libc.so.12
#5  0x00007bedce6b5b99 in je_tsd_tcache_enabled_data_init () from /usr/lib/libc.so.12
#6  0x00007bedce6b1c19 in je_tsd_fetch_slow () from /usr/lib/libc.so.12
#7  0x00007bedce70fa9c in calloc () from /usr/lib/libc.so.12
#8  0x00007bedcee0af73 in ?? () from /usr/lib/libpthread.so.1
#9  0x00007bedcee0b066 in pthread_attr_get_np () from /usr/lib/libpthread.so.1
#10 0x00007bedcee0b8eb in pthread_getattr_np () from /usr/lib/libpthread.so.1
#11 0x0000000000400d92 in run (arg=0x0) at ./w.c:33
#12 0x00007bedcee0bee0 in ?? () from /usr/lib/libpthread.so.1
#13 0x00007bedce6924e0 in ?? () from /usr/lib/libc.so.12
#14 0x0000000000200000 in ?? ()
#15 0x0000000000000000 in ?? ()

EDIT: This one can be reproduced without fork.

EDIT: Similar problems discussed on current-users mailing list:

jyn514 · 2020-09-14T23:47:20Z

Both this and the C-based reproducer earlier appears to indicate that this may be a problem in NetBSD and not in rust
I encountered another problem and reduced it to a program in C.

Should this be closed, then?

joshtriplett · 2020-09-15T03:42:54Z

man 7 signal-safety does not list execvp, so it is not async signal safe

To the best of my knowledge, execvp is only considered "not async signal safe" because all functions that read the environment aren't safe against concurrent modifications: if you change PATH concurrently from another thread, you have no guarantees whether it'll use the old PATH, the new, or some combination thereof. execvp is listed as MT-Safe env, meaning "it's multi-thread safe as long as you don't expect environment variables to do any synchronization for you".

he32 · 2020-09-18T22:00:28Z

@tmiasko I had someone run your C program on NetBSD-current (not just NetBSD-9 stable), and there apparently the problem with stuck threads / processes was not reproducible, so there is hope that this bug won't be there when NetBSD 10 is released.

tmiasko · 2020-09-19T17:31:24Z

I can reproduce the last issue on daily snapshot from 2020-09-18, usually after just a few seconds of executing program in the loop. If you do reproduce it at some point, I would appreciate if you could report this upstream, since I generally don't use NetBSD.

Simplified reproducer, since it turns out neither fork nor pthread_getattr_np is required, except for the fact that latter is allocating memory:

#include <assert.h>
#include <pthread.h>
#include <stdlib.h>

#define N 4

static pthread_t threads[N];

static void *run(void *arg) {
        return malloc(1024);
}

int main() {
        for (int i = 0; i != N; ++i) assert(pthread_create(&threads[i], NULL, run, NULL) == 0);
        for (int i = 0; i != N; ++i) assert(pthread_join(threads[i], NULL) == 0);
}

jyn514 · 2020-09-19T17:45:27Z

I'm going to close this, please report it upstream as a NetBSD issue. It's neither related to rust nor fork().

he32 · 2020-09-19T20:50:54Z

@tmiasko I reported http://gnats.netbsd.org/55670 -- this is apparently caused by concurrency bugs in our "jemalloc" implementation in the netbsd-9 code base. Reportedly this is fixed in NetBSD-current, and will be in NetBSD 10.0 when that comes out.

LeSeulArtichaut added T-libs Relevant to the library team, which will review and decide on the PR/issue. C-bug Category: This is a bug. labels Sep 11, 2020

RalfJung changed the title ~~Rust should not rely on undefined behaviour~~ Deadlocks on NetBSD around fork() Sep 13, 2020

RalfJung added the O-netbsd Operating system: NetBSD label Sep 13, 2020

RalfJung mentioned this issue Sep 13, 2020

std::process::Command doesn't follow Unix signal safety in forked child #64718

Open

jyn514 closed this as completed Sep 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlocks on NetBSD around fork() #76600

Deadlocks on NetBSD around fork() #76600

he32 commented Sep 11, 2020

jonas-schievink commented Sep 11, 2020

tmiasko commented Sep 11, 2020 •

edited

Loading

tmiasko commented Sep 13, 2020

RalfJung commented Sep 13, 2020

RalfJung commented Sep 13, 2020

jonas-schievink commented Sep 13, 2020

tmiasko commented Sep 13, 2020

he32 commented Sep 13, 2020 •

edited

Loading

tmiasko commented Sep 13, 2020 •

edited

Loading

jyn514 commented Sep 14, 2020

joshtriplett commented Sep 15, 2020

he32 commented Sep 18, 2020

tmiasko commented Sep 19, 2020

jyn514 commented Sep 19, 2020

he32 commented Sep 19, 2020

Deadlocks on NetBSD around fork() #76600

Deadlocks on NetBSD around fork() #76600

Comments

he32 commented Sep 11, 2020

jonas-schievink commented Sep 11, 2020

tmiasko commented Sep 11, 2020 • edited Loading

tmiasko commented Sep 13, 2020

RalfJung commented Sep 13, 2020

RalfJung commented Sep 13, 2020

jonas-schievink commented Sep 13, 2020

tmiasko commented Sep 13, 2020

he32 commented Sep 13, 2020 • edited Loading

tmiasko commented Sep 13, 2020 • edited Loading

jyn514 commented Sep 14, 2020

joshtriplett commented Sep 15, 2020

he32 commented Sep 18, 2020

tmiasko commented Sep 19, 2020

jyn514 commented Sep 19, 2020

he32 commented Sep 19, 2020

tmiasko commented Sep 11, 2020 •

edited

Loading

he32 commented Sep 13, 2020 •

edited

Loading

tmiasko commented Sep 13, 2020 •

edited

Loading