Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-107219: Fix concurrent.futures terminate_broken() #109244

Merged
merged 2 commits into from
Sep 11, 2023

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Sep 10, 2023

Fix a race condition in concurrent.futures. When a process in the process pool was terminated abruptly (while the future was running or pending), close the connection write end. If the call queue is blocked on sending bytes to a worker process, closing the connection write end interrupts the send, so the queue can be closed.

Changes:

  • _ExecutorManagerThread.terminate_broken() now closes call_queue._writer.
  • multiprocessing PipeConnection.close() now interrupts WaitForMultipleObjects() in _send_bytes() by cancelling the overlapped operation.

@vstinner
Copy link
Member Author

@serhiy-storchaka @methane @ambv @gpshead @pitrou: Would you mind to have a look?

I would like to merge this fix as soon as possible since the bug #107219 is affecting very badly the Python workflow. The CI failure rate is very high because of this test_concurrent_futures.test_deadlock hang.

For now, I prefer to use WSA_OPERATION_ABORTED = 995 in Lib/multiprocessing/connection.py to ease backports. Later, I will try to add this constant somewhere :-) My first attempt to add it to the errno module didn't work (I didn't insist, I was working on the fix).

@vstinner
Copy link
Member Author

With this change, I can no longer reproduce bug.

On my Windows VM which has 2 CPUs, I can easily reproduce the hang in around 30 seconds on the Python main branch:

  • Terminal 1: python -m test test_concurrent_futures.test_deadlock -m test.test_concurrent_futures.test_deadlock.ProcessPoolSpawnExecutorDeadlockTest.test_crash_big_data --forever -v --timeout=10
  • Terminal 2: python -m test -j2 -r

I stressed the test with:

  • Terminal 1, terminal 2 and terminal 3 (3 processes):

    • python -m test test_concurrent_futures.test_deadlock -m test.test_concurrent_futures.test_deadlock.ProcessPoolSpawnExecutorDeadlockTest.test_crash_big_data --forever -v --timeout=10
  • Terminal 4: python -m test test_concurrent_futures.test_deadlock -m test.test_concurrent_futures.test_deadlock.ProcessPoolSpawnExecutorDeadlockTest.test_crash_big_data --forever -v --timeout=30 -j2

  • Terminal 5: python -m test -j1 -r -u all

In 8 minutes, I failed to reproduce the bug anymore with this change.

Bonus: Moreover, I can no longer hang the test when I interrupt it with CTRL+C.

@vstinner
Copy link
Member Author

Windows (x64) (pull_request) Successful

Oh! For the first time in like 2 weeks, test_concurrent_futures.test_deadlock did not hang in the GHA Windows x64 job!

Note: There are only these two unrelated failures:

2 re-run tests:
    test.test_asyncio.test_windows_events
    test.test_concurrent_futures.test_as_completed

These 2 tests passed when re-run in verbose mode (Result: FAILURE then SUCCESS).

ov = self._send_ov
if ov is not None:
# Interrupt WaitForMultipleObjects() in _send_bytes()
ov.cancel()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asyncio uses a similar code in ProactorEventLoop:

def _cancel_overlapped(self):
if self._ov is None:
return
try:
self._ov.cancel()
except OSError as exc:
context = {
'message': 'Cancelling an overlapped future failed',
'exception': exc,
'future': self,
}
if self._source_traceback:
context['source_traceback'] = self._source_traceback
self._loop.call_exception_handler(context)
self._ov = None

asyncio uses more advanced code around to handle more cases. For example, in asyncio, the cancel() API is part of the public API.

Here the cancellation is a standard action in the Windows Overlapped API. The cancellation is synchronous, it's easy!

Hopefully, we are not in the very complicated RegisterWaitWithQueue() case! This case requires an asynchronous cancellation which is really complicated to handle: the completion of the cancellation should be awaited!? See this horror story: https://vstinner.github.io/asyncio-proactor-cancellation-from-hell.html

# close() was called by another thread while
# WaitForMultipleObjects() was waiting for the overlapped
# operation.
raise OSError(errno.EPIPE, "handle is closed")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to raise a BrokenPipeError exception here, since Queue._feed() has a special code path for that to ignore EPIPE errors silently:

except Exception as e:
if ignore_epipe and getattr(e, 'errno', 0) == errno.EPIPE:
return

And concurrent.futures uses this code path for its "call queue" which is causing troubles here:

self._call_queue = _SafeQueue(
max_size=queue_size, ctx=self._mp_context,
pending_work_items=self._pending_work_items,
shutdown_lock=self._shutdown_lock,
thread_wakeup=self._executor_manager_thread_wakeup)
# Killed worker processes can produce spurious "broken pipe"
# tracebacks in the queue's own worker thread. But we detect killed
# processes anyway, so silence the tracebacks.
self._call_queue._ignore_epipe = True

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds like we got lucky that callers were handling one thing we could raise! :)

Copy link
Member Author

@vstinner vstinner Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the beginning, I started by adding a new exception. But I chose to reuse the existing code instead. IMO BrokenPipeError perfectly makes sense for a PipeConnection.

@serhiy-storchaka serhiy-storchaka self-requested a review September 11, 2023 06:47
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

But I have one suggestion and one question/suggestion.

nwritten, err = ov.GetOverlappedResult(True)
if err == WSA_OPERATION_ABORTED:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What other value can it be? There is assert err == 0 below, so I guess that any error was unexpected.

Could we simply check that err is not zero here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to write a minimalist change: change at least code as possible. I introduce one new error, I added a check for this error, and that's all. I don't know the code enough to answer to your question. I'm not a multiprocessing or Windows API expert at all :-(

Lib/multiprocessing/connection.py Outdated Show resolved Hide resolved
Fix a race condition in concurrent.futures. When a process in the
process pool was terminated abruptly (while the future was running or
pending), close the connection write end. If the call queue is
blocked on sending bytes to a worker process, closing the connection
write end interrupts the send, so the queue can be closed.

Changes:

* _ExecutorManagerThread.terminate_broken() now closes
  call_queue._writer.
* multiprocessing PipeConnection.close() now interrupts
  WaitForMultipleObjects() in _send_bytes() by cancelling the
  overlapped operation.
Address Serhiy's review.
@vstinner vstinner enabled auto-merge (squash) September 11, 2023 07:48
@@ -41,6 +42,7 @@
BUFSIZE = 8192
# A very generous timeout when it comes to local connections...
CONNECTION_TIMEOUT = 20.
WSA_OPERATION_ABORTED = 995
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the same as _winapi.ERROR_OPERATION_ABORTED.

Copy link
Member Author

@vstinner vstinner Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I'm confused. I don't recall which doc I was looking to. WriteFile() is documented to return ERROR_OPERATION_ABORTED when it's canceled: https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefile

@vstinner vstinner merged commit a9b1f84 into python:main Sep 11, 2023
@vstinner vstinner deleted the cf_termine_broken branch September 11, 2023 08:11
@miss-islington
Copy link
Contributor

Thanks @vstinner for the PR 🌮🎉.. I'm working now to backport this PR to: 3.11, 3.12.
🐍🍒⛏🤖

@bedevere-bot
Copy link

There's a new commit after the PR has been approved.

@serhiy-storchaka: please review the changes made to this pull request.

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Sep 11, 2023
…109244)

Fix a race condition in concurrent.futures. When a process in the
process pool was terminated abruptly (while the future was running or
pending), close the connection write end. If the call queue is
blocked on sending bytes to a worker process, closing the connection
write end interrupts the send, so the queue can be closed.

Changes:

* _ExecutorManagerThread.terminate_broken() now closes
  call_queue._writer.
* multiprocessing PipeConnection.close() now interrupts
  WaitForMultipleObjects() in _send_bytes() by cancelling the
  overlapped operation.
(cherry picked from commit a9b1f84)

Co-authored-by: Victor Stinner <[email protected]>
@bedevere-bot
Copy link

GH-109254 is a backport of this pull request to the 3.12 branch.

@bedevere-bot bedevere-bot removed the needs backport to 3.12 bug and security fixes label Sep 11, 2023
@bedevere-bot
Copy link

GH-109255 is a backport of this pull request to the 3.11 branch.

@bedevere-bot bedevere-bot removed the needs backport to 3.11 only security fixes label Sep 11, 2023
@vstinner
Copy link
Member Author

PR merged, thanks for the review @serhiy-storchaka.

I wanted to merge this fix ASAP since it prevented to merge others PRs.

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Sep 11, 2023
…109244)

Fix a race condition in concurrent.futures. When a process in the
process pool was terminated abruptly (while the future was running or
pending), close the connection write end. If the call queue is
blocked on sending bytes to a worker process, closing the connection
write end interrupts the send, so the queue can be closed.

Changes:

* _ExecutorManagerThread.terminate_broken() now closes
  call_queue._writer.
* multiprocessing PipeConnection.close() now interrupts
  WaitForMultipleObjects() in _send_bytes() by cancelling the
  overlapped operation.
(cherry picked from commit a9b1f84)

Co-authored-by: Victor Stinner <[email protected]>
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the sources of GetOverlappedResult() in _winapi.c, the only value of err can be ERROR_SUCCESS (0), ERROR_MORE_DATA, ERROR_OPERATION_ABORTED, ERROR_IO_INCOMPLETE.

@serhiy-storchaka
Copy link
Member

Great work, @vstinner!

vstinner added a commit that referenced this pull request Sep 11, 2023
… (#109255)

gh-107219: Fix concurrent.futures terminate_broken() (GH-109244)

Fix a race condition in concurrent.futures. When a process in the
process pool was terminated abruptly (while the future was running or
pending), close the connection write end. If the call queue is
blocked on sending bytes to a worker process, closing the connection
write end interrupts the send, so the queue can be closed.

Changes:

* _ExecutorManagerThread.terminate_broken() now closes
  call_queue._writer.
* multiprocessing PipeConnection.close() now interrupts
  WaitForMultipleObjects() in _send_bytes() by cancelling the
  overlapped operation.
(cherry picked from commit a9b1f84)

Co-authored-by: Victor Stinner <[email protected]>
@vstinner
Copy link
Member Author

According to the sources of GetOverlappedResult() in _winapi.c, the only value of err can be ERROR_SUCCESS (0), ERROR_MORE_DATA, ERROR_OPERATION_ABORTED, ERROR_IO_INCOMPLETE.

Well, if you're confident, you can modify the assert err == 0 in the code.

By the way, having nwritten, err = ov.GetOverlappedResult(True) in the finally: block sounds wrong to me. What if _winapi.WaitForMultipleObjects() raises an exception? Why is it important to call ov.GetOverlappedResult(True) in this case? But well, since I don't know the code, I prefer to not touch it!

Great work, @vstinner!

Thanks.

Yhg1s pushed a commit that referenced this pull request Oct 2, 2023
… (#109254)

gh-107219: Fix concurrent.futures terminate_broken() (GH-109244)

Fix a race condition in concurrent.futures. When a process in the
process pool was terminated abruptly (while the future was running or
pending), close the connection write end. If the call queue is
blocked on sending bytes to a worker process, closing the connection
write end interrupts the send, so the queue can be closed.

Changes:

* _ExecutorManagerThread.terminate_broken() now closes
  call_queue._writer.
* multiprocessing PipeConnection.close() now interrupts
  WaitForMultipleObjects() in _send_bytes() by cancelling the
  overlapped operation.
(cherry picked from commit a9b1f84)

Co-authored-by: Victor Stinner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants