Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for distributed deployments with multiple daemons #256

Merged
merged 33 commits into from
Apr 28, 2023

Conversation

phil-opp
Copy link
Collaborator

  • Adds support for a new deploy.machine key to the dataflow YAML format.
    • Can be specified for individual nodes or for the whole dataflow.
    • Dataflows can be split across multiple machines/daemons.
    • If not set, the machine ID defaults to the empty string.
  • The dora daemon supports a new --machine-id argument, defaulting to the empty string. Multiple daemons can be connected to a coordinator, as long as they have different machine IDs.
  • The dora coordinator supports a new --port argument to change the port that the coordinator listens on. Defaults to 53290.
  • The dora daemon supports a new --coordinator-addr argument to set the IP and port number of the coordinator. Defaults to 127.0.0.1:53290.
  • Nodes will synchronize their start across machines through new coordinator messages. Similarly, the coordinator will forward outputs and InputClosed events between machines.

@phil-opp phil-opp requested a review from haixuanTao April 18, 2023 17:01
@haixuanTao
Copy link
Collaborator

haixuanTao commented Apr 19, 2023

I think there is a lot of good ideas in this PR, so thanks Philipp!

I think that we maybe lacking a file-system management piece of software to manage file between machines but also changes. I was looking at the implementation of cargo and how it manages cargo.lock and git-rs and I wonder if there is thing we can salvage or even use cargo to manage our user-defined nodes.

In the way react and angular reuse npm to manage their user-implemented modules.

But this is a big features that needs its own PR, and I think we can assume in this PR that the user has made the appropriate changes within its filesystem.

I will test it maybe further later during the day.

@phil-opp
Copy link
Collaborator Author

Yeah, we still need some kind of deploy functionality to get the nodes and operators from the CLI machine to the target machines. We already support URL sources for nodes and operators,

I think that we maybe lacking a file-system management piece of software to manage file between machines but also changes.

Agreed. There are currently only two options to distribute the node/operator binaries across machines:

  • Copy them manually, or through some custom script.
  • Use URL sources for nodes and operators, e.g. a GitHub release.

I think the second option is already quite convenient for "finished" nodes and operators, but it's cumbersome during development. We should try to make the edit->compile->deploy cycle easier for distributed dataflows too.

Maybe we could make the CLI send the executables via TCP as part of the spawn command? The receiving daemon could store them to the file system and then run them. This would make things much easier for nodes/operators written in Rust, as you could just develop as usual on the CLI machine and the dora start command would take care of everything else. For Python, we probably need some way to copy additional files, and set up a correct environment too...

Removes the separate `dora-runtime` binary. The runtime can now be started by passing `--run-dora-runtime` to `dora-daemon`. This change makes setup and deployment easier since it removes one executable that needs to be copied across machines.
Integrate `dora-runtime` into `dora-daemon`
Copy link
Collaborator

@haixuanTao haixuanTao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I retested this branch with the implementation of the log branch and it is ok for me. An example of the YAML description with local and remote will be appreciated before merging.

@heyong4725
Copy link
Collaborator

Local and remote seems a little confusing to me regarding to distributed deployment, i.e., respect to which machine is local and/or remote? I think we should have an end-2-end dataflow graph describing distributed deployment at the control plane (CLI / coordinator layer), specifying pub/sub communication middleware provider (e.g., zenoh, DDS, or SOME/IP) with their corresponding configuration.

@phil-opp
Copy link
Collaborator Author

phil-opp commented Apr 26, 2023

As discussed in today's meeting:

  • coordinator should not be involved in data forwarding -> daemons should talk to themselves instead (via TCP or zenoh)
  • prefix newly introduced YAML keys with e.g. _unstable -> consider this PR as an unstable feature -> postpone open design questions about YAML format deployment of node executables
  • pass contents of dataflow YAML to daemons, instead of path (discussed with Xavier)

We still need to pass the path through a new `working_dir` field as we haven't figured out deployment of compiled nodes/operators yet.
The path should also be valid on the receiving node, which might run in a different directory.
…g coordinator

The coordinator is our control plane and should not be involved in data plane operations. This way, the dataflow can continue even if the coordinator fails.
@phil-opp
Copy link
Collaborator Author

I implemented the points discussed in the latest meeting and merged the latest changes from main. I think this PR is ready to be merged now. Please let me know if you have any other concerns.

@haixuanTao haixuanTao merged commit 5c699cc into main Apr 28, 2023
@haixuanTao haixuanTao deleted the multiple-daemons branch April 28, 2023 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants