-
-
Notifications
You must be signed in to change notification settings - Fork 30.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multiprocessing's default posix start method of 'fork'
is broken: change to `'forkserver' || 'spawn'
#84559
Comments
By default, multiprocessing uses fork() without exec() on POSIX. For a variety of reasons this can lead to inconsistent state in subprocesses: module-level globals are copied, which can mess up logging, threads don't survive fork(), etc.. The end results vary, but quite often are silent lockups. In real world usage, this results in users getting mysterious hangs they do not have the knowledge to debug. The fix for these people is to use "spawn" by default, which is the default on Windows. Just a small sample:
I suggest changing the default on POSIX to match Windows. |
Looks like as of 3.8 this only impacts Linux/non-macOS-POSIX, so I'll amend the above to say this will also make it consistent with macOS. |
Just got an email from someone for whom switching to "spawn" fixed a problem. Earlier this week someone tweeted about this fixing things. This keeps hitting people in the real world. |
Another person with the same issue: https://twitter.com/volcan01010/status/1324764531139248128 |
I just ran into and fixed (thanks to itamarst's blog post) a problem likely related to this. Multiprocessing workers performing work and sending a logging message back with success/fail info. I had a few intermittent deadlocks that became a recurring problem when I sped up the process that skipped tasks which had previously completed (I think this shortened the time between forking and attempting to send messages causing the third process to deadlock). After changing that it deadlocked *every time*. Switching to "spawn" at the top of the main function has fixed it. |
The problem with changing the default is that this will break any application that depends on passing non-picklable data to the child process (in addition to the potentially unexpected performance impact). The docs already contain a significant elaboration on the matter, but feel free to submit a PR that would make the various caveats more explicit: |
This change was made on macOS at some point, so why not Linux? "spawn" is already the default on macOS and Windows. |
The macOS change was required before "fork" simply ceased to work. |
Given people's general experience, I would not say that "fork" works on Linux either. More like "99% of the time it works, 1% it randomly breaks in mysterious way". |
Agreed, but again, changing will break some applications. We could switch to forkserver, but we should have a transition period where a FutureWarning will be displayed if people didn't explicitly set a start method. |
After updating PyPy3 to use Python 3.9's stdlib, we hit very bad hangs because of this — literally compiling a single file with "parallel" compileall could hang. In the end, we had to revert the change in how Python 3.9 starts workers because otherwise multiprocessing would be impossible to use: https://foss.heptapod.net/pypy/pypy/-/commit/c594b6c48a48386e8ac1f3f52d4b82f9c3e34784 This is a very bad default and what's even worse is that it often causes deadlocks that are hard to reproduce or debug. Furthermore, since "fork" is the default, people are unintentionally relying on its support for passing non-pickleable projects and are creating non-portable code. The code often becomes complex and hard to change before they discover the problem. Before we managed to figure out how to workaround the deadlocks in PyPy3, we were experimenting with switching the default to "spawn". Unfortunately, we've hit multiple projects that didn't work with this method, precisely because of pickling problems. Furthermore, they were surprised to learn that their code wouldn't work on macOS (in the end, many people perceive Python as a language for writing portable software). Finally, back in 2018 I've made one of my projects do parallel work using multiprocessing. It gave its users great speedup but for some it caused deadlocks that I couldn't reproduce nor debug. In the end, I had to revert it. Now that I've learned about this problem, I'm wondering if this wasn't precisely because of "fork" method. |
Provide a way for the calling code to specify which "multiprocessing context" to use to spawn subprocesses. See https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods I'm using this to allow us to mock out multiprocessing with multithreading in doctests. This will also let you more easily test differences between "spawn" and "fork" modes. I'm defaulting to using "spawn" because I think "fork" mode was the cause of some mysterious hanging in tests. General consensus seems to be "spawn" is less buggy: python/cpython#84559 I've felt like tests are consistently faster with it. Also uses the `multiprocessing.Manager` as a context manager so it gets cleaned up correctly. This might have been the cause of other hanging in local cluster execution.
Provide a way for the calling code to specify which "multiprocessing context" to use to spawn subprocesses. See https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods I'm using this to allow us to mock out multiprocessing with multithreading in doctests. This will also let you more easily test differences between "spawn" and "fork" modes. I'm defaulting to using "spawn" because I think "fork" mode was the cause of some mysterious hanging in tests. General consensus seems to be "spawn" is less buggy: python/cpython#84559 I've felt like tests are consistently faster with it. Also uses the `multiprocessing.Manager` as a context manager so it gets cleaned up correctly. This might have been the cause of other hanging in local cluster execution.
Another example: Nelson Elhage reports that "as of recently(?) pytorch silently deadlocks (even without GPUs involved at all) using method=fork so that's been fun to debug". Examples he provided:
|
After updating a couple of libraries in a project we are working on, the code would hang without much explanation. After much debugging, I think one of the reasons for our issues is the forking default (this issue). Our business logic does not use multiprocessing, but the underlying execution engine does (in our case Luigi). Turns out that gRPC client (which was buried deep into one of our dependencies) can hang in some cases when forked grpc/grpc#18075. This was the case for us, and was very tricky to debug. |
spawn
general plan:
|
The default of `fork` is known to be problematic. Python itself is changing the default to `spawn`. The new default is expected to be in place for Python 3.14. Python references for the change to the default: * python/cpython#84559 * python/cpython#100618 We also have several places where this option had to be set to `spawn` to make tests work. The AMD code even checks and overrides the value if it's not set to `spawn`. Simplify things for everyone and just default to `spawn`, but leave the option in place just in case, at least for now. Signed-off-by: Russell Bryant <[email protected]>
'fork'
is broken: change to 'spawn'
'fork'
is broken: change to `'forkserver' || 'spawn'
…ver` (GH-101556) Change the default multiprocessing start method away from fork to forkserver or spawn on the remaining platforms where it was fork. See the issue for context. This makes the default far more thread safe (other than for people spawning threads at import time... - don't do that!). Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com> Co-authored-by: Hugo van Kemenade <[email protected]>
This is in and done for 3.14 per our plan. |
Thank you so much to everyone who worked on this! |
…to missing hmac-sha256 Default to the spawn start method in that scenario.
…to missing hmac-sha256 Default to the spawn start method in that scenario.
We have observed that sometimes a multiprocessing worker fails to properly terminate, getting stuck somewhere in the python multiprocessing internals after the whole of `process_with_threads` has completed. This results in the entire test suite hanging at 99% completion, as the process join never completes. This appears to be due to starting the `responses_processor` thread before starting the worker processes - the default multiprocessing start method on POSIX is `fork` which directly forks the python interpreter without execing. This is generally unsafe in a multithreaded environment as the child process may fork while another thread of the parent has locked arbitrary mutexes or similar, meaning they are already-locked in the child without any thread to ever unlock them, leading to deadlocks if the child ever tries to lock them itself. In fact, the default is changing to `forkserver` in Python 3.14 precisely because of subtle issues like this (see python/cpython#84559). Rather than making that same change here now, move the thread creation after the process creation to remain compatible with both `fork` and `forkserver`. There is no need to start the thread that early anyway; the worst that could happen is a few responses piling up in the meantime. This appears to fix the hang, as it has not reproduced with this patch in several days of continuous runs (where previously it reproduced within a few minutes). It is possible that the macOS-specific logic at the top of the file that "[forces] forking behavior at the expense of safety" should be revisited too, since the docs suggest that system libraries could create threads without our knowledge, but this is deferred to future work as no specific problems have been observed yet, and the docs suggest that problems here would lead to crashes rather than hangs. Co-authored-by: Andrea Segalini <[email protected]>
Considering we recently have yet another case where changing
If docs improvements are needed, especially for users, I could work on that tomorrow. cc @gpshead |
Documentation wise, what we need to do is improve the "What's New in 3.14" entry for multiprocessing. It doesn't currently go into details on user visible code behavior differences of the change. So people not already intimately familiar with the semantic consequences of the differences in multiprocessing start method behaviors ("there are dozens of us!") is currently left unaware. Out of curiosity, did we get a pile of these bug reports with the release of Python 3.8? The start method was changed in 3.8 for macOS without even a deprecation period (because the platform gave us no choice). https://docs.python.org/3/whatsnew/3.8.html#multiprocessing
broken link, no idea what page you meant. We don't tend to list implementation details in a downloads page, many people do not get their Pythons (including alpha/beta/rcs) from such a place anyways. Downloads of versions link directly to What's New which is our canonical doc of important highlights for any given release. |
feel free to work up a docs change and loop me in on the PR. :) I think the big one to highlight is a human level explanation of what will be accessible in the child process code. Ideally without requiring the reader to understand pickle either. 😅 |
Ah sorry, I meant the following for instance: https://www.python.org/downloads/release/python-3131/. We have highlights of what changed with this version and not everyone looks at the docs when downloading the version (those highlights are the RM's responsibility I think?). I'll work on a docs PR now and tag you when I'm done.
I don't know :( I wasn't involved with Python at that time! |
Fixes warnings related to concurrent use of threading and `os.fork`, which has never been supported. ``` DeprecationWarning: This process (pid=395083) is multi-threaded, use of fork() may lead to deadlocks in the child. ``` Related: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods Related: https://docs.python.org/3/library/os.html#os.fork Related: python/cpython#84559
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
forkserver
#101556multiprocessing
start method changes #128173The text was updated successfully, but these errors were encountered: