Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python GIL: Drop token race condition #598

Closed
phil-opp opened this issue Jul 19, 2024 · 2 comments · Fixed by #601
Closed

Python GIL: Drop token race condition #598

phil-opp opened this issue Jul 19, 2024 · 2 comments · Fixed by #601
Labels
bug Something isn't working

Comments

@phil-opp
Copy link
Collaborator

PR #568 drastically reduced some drop token timeouts to work around a "drop token race condition with the GIL". To reproduce the issue, the following example was posted: #568 (comment)

The goal of this issue is to find the root cause of this bug so that we can increase the mentioned timeouts again. The problem with the reduced timeouts is that slower receivers (or receivers with larger queue sizes) might not be able to read the the last outputs of nodes after they exit.

@github-actions github-actions bot added the bug Something isn't working label Jul 19, 2024
@phil-opp
Copy link
Collaborator Author

@haixuanTao I reverted the changes of #568 on my machine and tried the example shown in #568 (comment) and I can reproduce the grace period kills. However, once I increase the grace period to 15 seconds, all nodes stop properly after a few seconds when I do a ctrl-c. Could you try whether this works for you too?

@haixuanTao
Copy link
Collaborator

I see.

So a quick way to probably reproduce the problem, is to try to delete the node before the

git clone git@github.com:dora-rs/dora-benchmark.git --branch simple-drop-token-example
cd dora-rs/py-latency
dora up
dora start dataflow_node.yml
# Check the logs of having a warnings
vim node_2.py
# Go to end of file and try Example 2 and notice no warnings.

This should work with #568 change reverted.

phil-opp added a commit that referenced this issue Jul 21, 2024
When dropping the `DoraNode`, it waits for the remaining drop tokens. This only works if all the dora events were already dropped before. With the Python GC, this is not guaranteed as some events might still be live on the heap (the user might even use them later). In such cases, we waited until we ran into a timeout, which resulted in very long exit times (see #598).

This commit fixes this issue by adding a reference-counted copy of the `DoraNode` and `EventStream` to every event given to Python. This way, we can ensure that the underlying `DoraNode` is only dropped after the last event reference has been freed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants