Moving queue size and making node flume queue bigger #724

haixuanTao · 2024-11-30T12:33:32Z

Moving queue size to node fixes long latency issues for python node and make it possible to set the right queue_size.

I also changed flume queue length within nodes to makes this possible.

This seems to fix some of the graceful stop issue in Python but further investigation is required.

haixuanTao · 2024-12-01T19:40:02Z

After double checking, it seems that we're back to the problem that if event["value"] within python and all its reference get delete first there is no deadlock.

Meaning that the current DelayedCleanup solution within rust does not work.

I think we should replace our current DelayedCleanup Solution with a python based solution.

I have opened #726 to temporarely ignore this issue.

I have double checked that this actually does not create memory leak as DropToken are well reported back to the origin nodes cleaning up the shared memory.

apis/rust/node/src/event_stream/mod.rs

This is a followup PR to #724, I have found out that the issue is that the reference counting of the dora node does not delay the cleanup and the node is still cleaned up before the pyarrow array creating a deadlock. This PR reduces the timeout time to start cleanup drop tokens to make sure that the process does not endlessly hangs creating a cascading effect of nodes not ending. In the future, I hope that we can create a **python based** reference counting between the python node and its generated event so that each event holds a reference to the nodes and get cleaned up before the node gets cleaned up

This PR fixes the issue that dora node are not fair between inputs when frequency of both inputs are different and the processing time between input is high. What end up happening is that one input might be overwhelmingly called as it's frequency is higher or lower depending on the queue_size. This PR fixes this issue by adding a scheduler that is always going to check that the next input is the one that has been waiting the longest within the queue making fairness between inputs. This PR is a follow up PR to #724 that is rewriting the queue within nodes instead of the daemon.

…ondition that is cleaning up node first and then python arrow reference.

…t get stuck

…ctive on the receiver side and closing sender side shared memory with a lower timeout

This enables more comprehensive errors on when using python installs

haixuanTao mentioned this pull request Dec 1, 2024

Reduce event stream timeout #726

Merged

haixuanTao requested a review from phil-opp December 1, 2024 20:13

phil-opp reviewed Dec 2, 2024

View reviewed changes

apis/rust/node/src/event_stream/mod.rs Outdated Show resolved Hide resolved

apis/rust/node/src/event_stream/mod.rs Outdated Show resolved Hide resolved

apis/rust/node/src/event_stream/mod.rs Outdated Show resolved Hide resolved

haixuanTao mentioned this pull request Dec 3, 2024

Add a scheduler to the node api to manage fairness among inputs #728

Merged

haixuanTao added 11 commits December 5, 2024 14:42

Moving queue size and making node flume queue bigger

5ccdbc5

Fix clippy warning

7487ce1

Reduce eventstream timeout as it is linked to Python cleanup race c…

45a7ee9

…ondition that is cleaning up node first and then python arrow reference.

Dropping remaining drop token on exit so that the origin node does no…

a2d6514

…t get stuck

Replace drop token mecanism on closed channel by leaving drop token a…

411436d

…ctive on the receiver side and closing sender side shared memory with a lower timeout

Add a scheduler to the node api to manage fairness among inputs

f77f400

Make the non input event a priority when collecting next event

90752a0

Rewrite the scheduler to avoid overhead

aefb8f0

Fix typo

acf4d0c

Fix NON_INPUT_EVENT

3e5a3d7

Add test for queue latency

73bd73c

haixuanTao force-pushed the move-queue-management-into-node branch from 4f3c6e4 to 73bd73c Compare December 5, 2024 13:42

haixuanTao and others added 9 commits December 5, 2024 15:08

Adding additional test for latency

1aeaa1b

Add test within CI/CD

9f471de

Use eyre within dora-rerun

f7dfcaa

Use eyre for dora-kit-car

05ebf7b

Fix clippy warning about queue size within listener loop

234ea94

Fix time test

b64c9a0

make dora error warnings instead of raising an error

df191bd

Add more time for receiving data

71d963e

Add eyre to pyo3 node (#730)

c8a8b05

This enables more comprehensive errors on when using python installs

haixuanTao merged commit 8ad81eb into main Dec 6, 2024
72 of 73 checks passed

haixuanTao deleted the move-queue-management-into-node branch December 6, 2024 14:32

phil-opp mentioned this pull request Dec 10, 2024

Use Receiver::try_recv instead of Stream::next with timeout #734

Open

haixuanTao mentioned this pull request Dec 11, 2024

Test queue size latest latency #652

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moving queue size and making node flume queue bigger #724

Moving queue size and making node flume queue bigger #724

haixuanTao commented Nov 30, 2024

haixuanTao commented Dec 1, 2024

Moving queue size and making node flume queue bigger #724

Moving queue size and making node flume queue bigger #724

Conversation

haixuanTao commented Nov 30, 2024

haixuanTao commented Dec 1, 2024