gh-107219: Fix concurrent.futures terminate_broken() #109244

vstinner · 2023-09-10T23:28:10Z

Fix a race condition in concurrent.futures. When a process in the process pool was terminated abruptly (while the future was running or pending), close the connection write end. If the call queue is blocked on sending bytes to a worker process, closing the connection write end interrupts the send, so the queue can be closed.

Changes:

_ExecutorManagerThread.terminate_broken() now closes call_queue._writer.
multiprocessing PipeConnection.close() now interrupts WaitForMultipleObjects() in _send_bytes() by cancelling the overlapped operation.

Issue: test_concurrent_futures.test_deadlock: test_crash_big_data() hangs randomly on Windows #107219

vstinner · 2023-09-10T23:37:01Z

@serhiy-storchaka @methane @ambv @gpshead @pitrou: Would you mind to have a look?

I would like to merge this fix as soon as possible since the bug #107219 is affecting very badly the Python workflow. The CI failure rate is very high because of this test_concurrent_futures.test_deadlock hang.

For now, I prefer to use WSA_OPERATION_ABORTED = 995 in Lib/multiprocessing/connection.py to ease backports. Later, I will try to add this constant somewhere :-) My first attempt to add it to the errno module didn't work (I didn't insist, I was working on the fix).

vstinner · 2023-09-10T23:41:50Z

With this change, I can no longer reproduce bug.

On my Windows VM which has 2 CPUs, I can easily reproduce the hang in around 30 seconds on the Python main branch:

Terminal 1: python -m test test_concurrent_futures.test_deadlock -m test.test_concurrent_futures.test_deadlock.ProcessPoolSpawnExecutorDeadlockTest.test_crash_big_data --forever -v --timeout=10
Terminal 2: python -m test -j2 -r

I stressed the test with:

Terminal 1, terminal 2 and terminal 3 (3 processes):
- python -m test test_concurrent_futures.test_deadlock -m test.test_concurrent_futures.test_deadlock.ProcessPoolSpawnExecutorDeadlockTest.test_crash_big_data --forever -v --timeout=10
Terminal 4: python -m test test_concurrent_futures.test_deadlock -m test.test_concurrent_futures.test_deadlock.ProcessPoolSpawnExecutorDeadlockTest.test_crash_big_data --forever -v --timeout=30 -j2
Terminal 5: python -m test -j1 -r -u all

In 8 minutes, I failed to reproduce the bug anymore with this change.

Bonus: Moreover, I can no longer hang the test when I interrupt it with CTRL+C.

vstinner · 2023-09-10T23:53:39Z

Windows (x64) (pull_request) Successful

Oh! For the first time in like 2 weeks, test_concurrent_futures.test_deadlock did not hang in the GHA Windows x64 job!

Note: There are only these two unrelated failures:

2 re-run tests:
    test.test_asyncio.test_windows_events
    test.test_concurrent_futures.test_as_completed

These 2 tests passed when re-run in verbose mode (Result: FAILURE then SUCCESS).

vstinner · 2023-09-11T01:28:21Z

Lib/multiprocessing/connection.py

+            ov = self._send_ov
+            if ov is not None:
+                # Interrupt WaitForMultipleObjects() in _send_bytes()
+                ov.cancel()


asyncio uses a similar code in ProactorEventLoop:

cpython/Lib/asyncio/windows_events.py

Lines 67 to 81 in 1ec4537

def _cancel_overlapped(self):

if self._ov is None:

return

try:

self._ov.cancel()

except OSError as exc:

context = {

'message': 'Cancelling an overlapped future failed',

'exception': exc,

'future': self,

}

if self._source_traceback:

context['source_traceback'] = self._source_traceback

self._loop.call_exception_handler(context)

self._ov = None

asyncio uses more advanced code around to handle more cases. For example, in asyncio, the cancel() API is part of the public API.

Here the cancellation is a standard action in the Windows Overlapped API. The cancellation is synchronous, it's easy!

Hopefully, we are not in the very complicated RegisterWaitWithQueue() case! This case requires an asynchronous cancellation which is really complicated to handle: the completion of the cancellation should be awaited!? See this horror story: https://vstinner.github.io/asyncio-proactor-cancellation-from-hell.html

vstinner · 2023-09-11T01:33:10Z

Lib/multiprocessing/connection.py

+                # close() was called by another thread while
+                # WaitForMultipleObjects() was waiting for the overlapped
+                # operation.
+                raise OSError(errno.EPIPE, "handle is closed")


I chose to raise a BrokenPipeError exception here, since Queue._feed() has a special code path for that to ignore EPIPE errors silently:

cpython/Lib/multiprocessing/queues.py

Lines 255 to 257 in 1ec4537

except Exception as e:

if ignore_epipe and getattr(e, 'errno', 0) == errno.EPIPE:

return

And concurrent.futures uses this code path for its "call queue" which is causing troubles here:

cpython/Lib/concurrent/futures/process.py

Lines 724 to 732 in 1ec4537

self._call_queue = _SafeQueue(

max_size=queue_size, ctx=self._mp_context,

pending_work_items=self._pending_work_items,

shutdown_lock=self._shutdown_lock,

thread_wakeup=self._executor_manager_thread_wakeup)

# Killed worker processes can produce spurious "broken pipe"

# tracebacks in the queue's own worker thread. But we detect killed

# processes anyway, so silence the tracebacks.

self._call_queue._ignore_epipe = True

sounds like we got lucky that callers were handling one thing we could raise! :)

At the beginning, I started by adding a new exception. But I chose to reuse the existing code instead. IMO BrokenPipeError perfectly makes sense for a PipeConnection.

serhiy-storchaka

LGTM.

But I have one suggestion and one question/suggestion.

serhiy-storchaka · 2023-09-11T07:04:58Z

Lib/multiprocessing/connection.py

                nwritten, err = ov.GetOverlappedResult(True)
+            if err == WSA_OPERATION_ABORTED:


What other value can it be? There is assert err == 0 below, so I guess that any error was unexpected.

Could we simply check that err is not zero here?

I chose to write a minimalist change: change at least code as possible. I introduce one new error, I added a check for this error, and that's all. I don't know the code enough to answer to your question. I'm not a multiprocessing or Windows API expert at all :-(

Lib/multiprocessing/connection.py

Fix a race condition in concurrent.futures. When a process in the process pool was terminated abruptly (while the future was running or pending), close the connection write end. If the call queue is blocked on sending bytes to a worker process, closing the connection write end interrupts the send, so the queue can be closed. Changes: * _ExecutorManagerThread.terminate_broken() now closes call_queue._writer. * multiprocessing PipeConnection.close() now interrupts WaitForMultipleObjects() in _send_bytes() by cancelling the overlapped operation.

Address Serhiy's review.

Lib/multiprocessing/connection.py

serhiy-storchaka · 2023-09-11T08:11:04Z

Lib/multiprocessing/connection.py

@@ -41,6 +42,7 @@
 BUFSIZE = 8192
 # A very generous timeout when it comes to local connections...
 CONNECTION_TIMEOUT = 20.
+WSA_OPERATION_ABORTED = 995


It is the same as _winapi.ERROR_OPERATION_ABORTED.

Now I'm confused. I don't recall which doc I was looking to. WriteFile() is documented to return ERROR_OPERATION_ABORTED when it's canceled: https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefile

miss-islington · 2023-09-11T08:11:35Z

Thanks @vstinner for the PR 🌮🎉.. I'm working now to backport this PR to: 3.11, 3.12.
🐍🍒⛏🤖

bedevere-bot · 2023-09-11T08:11:36Z

There's a new commit after the PR has been approved.

@serhiy-storchaka: please review the changes made to this pull request.

…109244) Fix a race condition in concurrent.futures. When a process in the process pool was terminated abruptly (while the future was running or pending), close the connection write end. If the call queue is blocked on sending bytes to a worker process, closing the connection write end interrupts the send, so the queue can be closed. Changes: * _ExecutorManagerThread.terminate_broken() now closes call_queue._writer. * multiprocessing PipeConnection.close() now interrupts WaitForMultipleObjects() in _send_bytes() by cancelling the overlapped operation. (cherry picked from commit a9b1f84) Co-authored-by: Victor Stinner <[email protected]>

bedevere-bot · 2023-09-11T08:11:47Z

GH-109254 is a backport of this pull request to the 3.12 branch.

bedevere-bot · 2023-09-11T08:11:56Z

GH-109255 is a backport of this pull request to the 3.11 branch.

vstinner · 2023-09-11T08:13:50Z

PR merged, thanks for the review @serhiy-storchaka.

I wanted to merge this fix ASAP since it prevented to merge others PRs.

…109244) Fix a race condition in concurrent.futures. When a process in the process pool was terminated abruptly (while the future was running or pending), close the connection write end. If the call queue is blocked on sending bytes to a worker process, closing the connection write end interrupts the send, so the queue can be closed. Changes: * _ExecutorManagerThread.terminate_broken() now closes call_queue._writer. * multiprocessing PipeConnection.close() now interrupts WaitForMultipleObjects() in _send_bytes() by cancelling the overlapped operation. (cherry picked from commit a9b1f84) Co-authored-by: Victor Stinner <[email protected]>

serhiy-storchaka

According to the sources of GetOverlappedResult() in _winapi.c, the only value of err can be ERROR_SUCCESS (0), ERROR_MORE_DATA, ERROR_OPERATION_ABORTED, ERROR_IO_INCOMPLETE.

serhiy-storchaka · 2023-09-11T08:16:35Z

Great work, @vstinner!

… (#109255) gh-107219: Fix concurrent.futures terminate_broken() (GH-109244) Fix a race condition in concurrent.futures. When a process in the process pool was terminated abruptly (while the future was running or pending), close the connection write end. If the call queue is blocked on sending bytes to a worker process, closing the connection write end interrupts the send, so the queue can be closed. Changes: * _ExecutorManagerThread.terminate_broken() now closes call_queue._writer. * multiprocessing PipeConnection.close() now interrupts WaitForMultipleObjects() in _send_bytes() by cancelling the overlapped operation. (cherry picked from commit a9b1f84) Co-authored-by: Victor Stinner <[email protected]>

vstinner · 2023-09-11T21:05:30Z

According to the sources of GetOverlappedResult() in _winapi.c, the only value of err can be ERROR_SUCCESS (0), ERROR_MORE_DATA, ERROR_OPERATION_ABORTED, ERROR_IO_INCOMPLETE.

Well, if you're confident, you can modify the assert err == 0 in the code.

By the way, having nwritten, err = ov.GetOverlappedResult(True) in the finally: block sounds wrong to me. What if _winapi.WaitForMultipleObjects() raises an exception? Why is it important to call ov.GetOverlappedResult(True) in this case? But well, since I don't know the code, I prefer to not touch it!

Great work, @vstinner!

Thanks.

… (#109254) gh-107219: Fix concurrent.futures terminate_broken() (GH-109244) Fix a race condition in concurrent.futures. When a process in the process pool was terminated abruptly (while the future was running or pending), close the connection write end. If the call queue is blocked on sending bytes to a worker process, closing the connection write end interrupts the send, so the queue can be closed. Changes: * _ExecutorManagerThread.terminate_broken() now closes call_queue._writer. * multiprocessing PipeConnection.close() now interrupts WaitForMultipleObjects() in _send_bytes() by cancelling the overlapped operation. (cherry picked from commit a9b1f84) Co-authored-by: Victor Stinner <[email protected]>

vstinner added needs backport to 3.11 only security fixes needs backport to 3.12 bug and security fixes labels Sep 10, 2023

bedevere-bot added the awaiting core review label Sep 10, 2023

bedevere-bot mentioned this pull request Sep 10, 2023

test_concurrent_futures.test_deadlock: test_crash_big_data() hangs randomly on Windows #107219

Closed

vstinner commented Sep 11, 2023

View reviewed changes

serhiy-storchaka self-requested a review September 11, 2023 06:47

serhiy-storchaka approved these changes Sep 11, 2023

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting core review labels Sep 11, 2023

vstinner added 2 commits September 11, 2023 09:47

Remove PipeConnection.__init__()

069fbfa

Address Serhiy's review.

vstinner force-pushed the cf_termine_broken branch from 9987dc7 to 069fbfa Compare September 11, 2023 07:47

vstinner enabled auto-merge (squash) September 11, 2023 07:48

serhiy-storchaka reviewed Sep 11, 2023

View reviewed changes

Lib/multiprocessing/connection.py Outdated Show resolved Hide resolved

serhiy-storchaka reviewed Sep 11, 2023

View reviewed changes

vstinner merged commit a9b1f84 into python:main Sep 11, 2023

vstinner deleted the cf_termine_broken branch September 11, 2023 08:11

bedevere-bot removed the awaiting merge label Sep 11, 2023

bedevere-bot added the awaiting core review label Sep 11, 2023

bedevere-bot requested a review from serhiy-storchaka September 11, 2023 08:11

bedevere-bot removed the needs backport to 3.12 bug and security fixes label Sep 11, 2023

bedevere-bot removed the needs backport to 3.11 only security fixes label Sep 11, 2023

serhiy-storchaka reviewed Sep 11, 2023

View reviewed changes

vstinner mentioned this pull request Sep 11, 2023

gh-109162: libregrtest: move code around #109253

Merged

vstinner mentioned this pull request Sep 22, 2023

test_concurrent_futures.test_shutdown: test_interpreter_shutdown() fails randomly (race condition) #109047

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-107219: Fix concurrent.futures terminate_broken() #109244

gh-107219: Fix concurrent.futures terminate_broken() #109244

vstinner commented Sep 10, 2023 •

edited by bedevere-bot

Loading

vstinner commented Sep 10, 2023

vstinner commented Sep 10, 2023

vstinner commented Sep 10, 2023

vstinner Sep 11, 2023

vstinner Sep 11, 2023

gpshead Sep 12, 2023

vstinner Sep 12, 2023 •

edited

Loading

serhiy-storchaka left a comment

serhiy-storchaka Sep 11, 2023

vstinner Sep 11, 2023

serhiy-storchaka Sep 11, 2023

vstinner Sep 12, 2023 •

edited

Loading

miss-islington commented Sep 11, 2023

bedevere-bot commented Sep 11, 2023

bedevere-bot commented Sep 11, 2023

bedevere-bot commented Sep 11, 2023

vstinner commented Sep 11, 2023

serhiy-storchaka left a comment

serhiy-storchaka commented Sep 11, 2023

vstinner commented Sep 11, 2023

	def _cancel_overlapped(self):
	if self._ov is None:
	return
	try:
	self._ov.cancel()
	except OSError as exc:
	context = {
	'message': 'Cancelling an overlapped future failed',
	'exception': exc,
	'future': self,
	}
	if self._source_traceback:
	context['source_traceback'] = self._source_traceback
	self._loop.call_exception_handler(context)
	self._ov = None

	except Exception as e:
	if ignore_epipe and getattr(e, 'errno', 0) == errno.EPIPE:
	return

	self._call_queue = _SafeQueue(
	max_size=queue_size, ctx=self._mp_context,
	pending_work_items=self._pending_work_items,
	shutdown_lock=self._shutdown_lock,
	thread_wakeup=self._executor_manager_thread_wakeup)
	# Killed worker processes can produce spurious "broken pipe"
	# tracebacks in the queue's own worker thread. But we detect killed
	# processes anyway, so silence the tracebacks.
	self._call_queue._ignore_epipe = True

		nwritten, err = ov.GetOverlappedResult(True)
		if err == WSA_OPERATION_ABORTED:

gh-107219: Fix concurrent.futures terminate_broken() #109244

gh-107219: Fix concurrent.futures terminate_broken() #109244

Conversation

vstinner commented Sep 10, 2023 • edited by bedevere-bot Loading

vstinner commented Sep 10, 2023

vstinner commented Sep 10, 2023

vstinner commented Sep 10, 2023

vstinner Sep 11, 2023

Choose a reason for hiding this comment

vstinner Sep 11, 2023

Choose a reason for hiding this comment

gpshead Sep 12, 2023

Choose a reason for hiding this comment

vstinner Sep 12, 2023 • edited Loading

Choose a reason for hiding this comment

serhiy-storchaka left a comment

Choose a reason for hiding this comment

serhiy-storchaka Sep 11, 2023

Choose a reason for hiding this comment

vstinner Sep 11, 2023

Choose a reason for hiding this comment

serhiy-storchaka Sep 11, 2023

Choose a reason for hiding this comment

vstinner Sep 12, 2023 • edited Loading

Choose a reason for hiding this comment

miss-islington commented Sep 11, 2023

bedevere-bot commented Sep 11, 2023

bedevere-bot commented Sep 11, 2023

bedevere-bot commented Sep 11, 2023

vstinner commented Sep 11, 2023

serhiy-storchaka left a comment

Choose a reason for hiding this comment

serhiy-storchaka commented Sep 11, 2023

vstinner commented Sep 11, 2023

vstinner commented Sep 10, 2023 •

edited by bedevere-bot

Loading

vstinner Sep 12, 2023 •

edited

Loading

vstinner Sep 12, 2023 •

edited

Loading