Python GIL: Drop token race condition #598

phil-opp · 2024-07-19T15:05:11Z

PR #568 drastically reduced some drop token timeouts to work around a "drop token race condition with the GIL". To reproduce the issue, the following example was posted: #568 (comment)

The goal of this issue is to find the root cause of this bug so that we can increase the mentioned timeouts again. The problem with the reduced timeouts is that slower receivers (or receivers with larger queue sizes) might not be able to read the the last outputs of nodes after they exit.

phil-opp · 2024-07-19T15:18:44Z

@haixuanTao I reverted the changes of #568 on my machine and tried the example shown in #568 (comment) and I can reproduce the grace period kills. However, once I increase the grace period to 15 seconds, all nodes stop properly after a few seconds when I do a ctrl-c. Could you try whether this works for you too?

haixuanTao · 2024-07-20T04:57:48Z

I see.

So a quick way to probably reproduce the problem, is to try to delete the node before the

git clone git@github.com:dora-rs/dora-benchmark.git --branch simple-drop-token-example
cd dora-rs/py-latency
dora up
dora start dataflow_node.yml
# Check the logs of having a warnings
vim node_2.py
# Go to end of file and try Example 2 and notice no warnings.

This should work with #568 change reverted.

When dropping the `DoraNode`, it waits for the remaining drop tokens. This only works if all the dora events were already dropped before. With the Python GC, this is not guaranteed as some events might still be live on the heap (the user might even use them later). In such cases, we waited until we ran into a timeout, which resulted in very long exit times (see #598). This commit fixes this issue by adding a reference-counted copy of the `DoraNode` and `EventStream` to every event given to Python. This way, we can ensure that the underlying `DoraNode` is only dropped after the last event reference has been freed.

github-actions bot added the bug Something isn't working label Jul 19, 2024

phil-opp mentioned this issue Jul 21, 2024

Delay dropping of DoraNode in Python until all event data is freed #601

Merged

haixuanTao closed this as completed in #601 Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python GIL: Drop token race condition #598

Python GIL: Drop token race condition #598

phil-opp commented Jul 19, 2024

phil-opp commented Jul 19, 2024

haixuanTao commented Jul 20, 2024

Python GIL: Drop token race condition #598

Python GIL: Drop token race condition #598

Comments

phil-opp commented Jul 19, 2024

phil-opp commented Jul 19, 2024

haixuanTao commented Jul 20, 2024