-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python GIL: Drop token race condition #598
Comments
@haixuanTao I reverted the changes of #568 on my machine and tried the example shown in #568 (comment) and I can reproduce the grace period kills. However, once I increase the grace period to 15 seconds, all nodes stop properly after a few seconds when I do a ctrl-c. Could you try whether this works for you too? |
I see. So a quick way to probably reproduce the problem, is to try to delete the node before the git clone git@github.com:dora-rs/dora-benchmark.git --branch simple-drop-token-example
cd dora-rs/py-latency
dora up
dora start dataflow_node.yml
# Check the logs of having a warnings
vim node_2.py
# Go to end of file and try Example 2 and notice no warnings. This should work with #568 change reverted. |
When dropping the `DoraNode`, it waits for the remaining drop tokens. This only works if all the dora events were already dropped before. With the Python GC, this is not guaranteed as some events might still be live on the heap (the user might even use them later). In such cases, we waited until we ran into a timeout, which resulted in very long exit times (see #598). This commit fixes this issue by adding a reference-counted copy of the `DoraNode` and `EventStream` to every event given to Python. This way, we can ensure that the underlying `DoraNode` is only dropped after the last event reference has been freed.
PR #568 drastically reduced some drop token timeouts to work around a "drop token race condition with the GIL". To reproduce the issue, the following example was posted: #568 (comment)
The goal of this issue is to find the root cause of this bug so that we can increase the mentioned timeouts again. The problem with the reduced timeouts is that slower receivers (or receivers with larger queue sizes) might not be able to read the the last outputs of nodes after they exit.
The text was updated successfully, but these errors were encountered: