Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving queue size and making node flume queue bigger #724

Merged
merged 20 commits into from
Dec 6, 2024

Conversation

haixuanTao
Copy link
Collaborator

Moving queue size to node fixes long latency issues for python node and make it possible to set the right queue_size.

I also changed flume queue length within nodes to makes this possible.

This seems to fix some of the graceful stop issue in Python but further investigation is required.

@haixuanTao
Copy link
Collaborator Author

After double checking, it seems that we're back to the problem that if event["value"] within python and all its reference get delete first there is no deadlock.

Meaning that the current DelayedCleanup solution within rust does not work.

I think we should replace our current DelayedCleanup Solution with a python based solution.

I have opened #726 to temporarely ignore this issue.

I have double checked that this actually does not create memory leak as DropToken are well reported back to the origin nodes cleaning up the shared memory.

@haixuanTao haixuanTao requested a review from phil-opp December 1, 2024 20:13
apis/rust/node/src/event_stream/mod.rs Outdated Show resolved Hide resolved
apis/rust/node/src/event_stream/mod.rs Outdated Show resolved Hide resolved
apis/rust/node/src/event_stream/mod.rs Outdated Show resolved Hide resolved
haixuanTao added a commit that referenced this pull request Dec 5, 2024
This is a followup PR to #724, I have found out that the issue is that
the reference counting of the dora node does not delay the cleanup and
the node is still cleaned up before the pyarrow array creating a
deadlock.

This PR reduces the timeout time to start cleanup drop tokens to make
sure that the process does not endlessly hangs creating a cascading
effect of nodes not ending.

In the future, I hope that we can create a **python based** reference
counting between the python node and its generated event so that each
event holds a reference to the nodes and get cleaned up before the node
gets cleaned up
haixuanTao added a commit that referenced this pull request Dec 5, 2024
This PR fixes the issue that dora node are not fair between inputs when
frequency of both inputs are different and the processing time between
input is high. What end up happening is that one input might be
overwhelmingly called as it's frequency is higher or lower depending on
the queue_size.

This PR fixes this issue by adding a scheduler that is always going to
check that the next input is the one that has been waiting the longest
within the queue making fairness between inputs.

This PR is a follow up PR to #724 that is rewriting the queue within
nodes instead of the daemon.
@haixuanTao haixuanTao force-pushed the move-queue-management-into-node branch from 4f3c6e4 to 73bd73c Compare December 5, 2024 13:42
@haixuanTao haixuanTao merged commit 8ad81eb into main Dec 6, 2024
72 of 73 checks passed
@haixuanTao haixuanTao deleted the move-queue-management-into-node branch December 6, 2024 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants