Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign: Create a dora-daemon as a communication broker #162

Merged
merged 230 commits into from
Mar 6, 2023

Conversation

phil-opp
Copy link
Collaborator

@phil-opp phil-opp commented Dec 27, 2022

Create a dora-daemon executable to manage communication per machine. The daemon allocates shared memory for messages, which makes it possible to send outputs to local receivers without copying. For remote receivers, the message is copied and sent over the network using some communication layer implementation.

The new process to send outputs looks like this:

image

Since the daemon handles all advanced communication, the nodes no longer need to link the communication layer. This should make building and distribution much simpler since we don't need any C++ dependencies with complex build scripts anymore (e.g. iceoryx).

If the shared memory implementation is fast and flexible enough, we might even be able to unify custom nodes and operators again.

ToDo:

  • Adjust the node APIs for the new design
    • Rust
    • Python
    • C
    • C++
  • Update the examples
  • Update CLI command to work with the new design
    • dora start
    • dora stop
    • dora up
    • dora destroy
  • Adjust the dora-runtime (or remove it?)
  • Integration tests
    • for stopping nodes (tested through the dataflow examples)
    • simulate network disconnects

Performance Improvements:

  • Create performance benchmark (throughput, latency)
  • Use RPC framework instead of custom TCP socket for coordinator <-> daemon <-> node communication
  • Reuse shared memory allocations instead of deallocating them after use
  • Construct metadata directly in node API (instead of daemon)
  • Pass metadata through shared memory
  • Use shared memory for control channels (instead of TCP sockets)
    • Use Event type of raw-sync crate for synchronization

Avoids synchronization issues, e.g. if one node sends outputs before the receiver was launched and registered.
The previous connection might be closed already. It's difficult to check for that without sending any actual data, so we just permit re-registering of daemon connections. Previous connections are closed.
@phil-opp
Copy link
Collaborator Author

phil-opp commented Mar 1, 2023

@haixuanTao I found and fixed a deadlock that happened when the dora node was dropped before the event stream in the Rust node API. Since the other APIs build on top of the Rust API, it affected them too.

phil-opp added 9 commits March 1, 2023 15:19
Ensures that we don't exit while some listener task is still sending out a reply.
Fixes a deadlock issue when the `incoming` channel is closed first. The problem was that the `send_out_buf` was set to `Fuse::terminated` without breaking the loop, so on the next loop iteration both futures were already terminated.
@phil-opp
Copy link
Collaborator Author

phil-opp commented Mar 6, 2023

I discussed this PR with @haixuanTao. We believe that all the critical bugs are fixed now. So we decided to go ahead and merge this PR, once the CI is green. We will handle the other open tasks and the possible performance improvements in follow-up PRs.

@phil-opp phil-opp enabled auto-merge March 6, 2023 11:28
@phil-opp phil-opp disabled auto-merge March 6, 2023 11:50
@phil-opp
Copy link
Collaborator Author

phil-opp commented Mar 6, 2023

Looks like we have some stop issues on Windows now. I'll look into it.

phil-opp added 6 commits March 6, 2023 17:37
We might not receive any `random` input if the startup of the operator is delayed and the source node is already finished.
… subscribe

These two messages can be essential for correctness. For example, a node might not finish properly when an `InputClosed` event is lost. So we need to always send them, even if the target node was not subscribed yet when the event occurred.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants