Add support for distributed deployments with multiple daemons #256

phil-opp · 2023-04-18T17:01:00Z

Adds support for a new deploy.machine key to the dataflow YAML format.
- Can be specified for individual nodes or for the whole dataflow.
- Dataflows can be split across multiple machines/daemons.
- If not set, the machine ID defaults to the empty string.
The dora daemon supports a new --machine-id argument, defaulting to the empty string. Multiple daemons can be connected to a coordinator, as long as they have different machine IDs.
The dora coordinator supports a new --port argument to change the port that the coordinator listens on. Defaults to 53290.
The dora daemon supports a new --coordinator-addr argument to set the IP and port number of the coordinator. Defaults to 127.0.0.1:53290.
Nodes will synchronize their start across machines through new coordinator messages. Similarly, the coordinator will forward outputs and InputClosed events between machines.

Ensures that we set them also when using the daemon's `run_dataflow` function for examples.

Instead of re-connecting for each message.

Allow passing runtime path via coordinator

examples/multiple-daemons/dataflow.yml

haixuanTao · 2023-04-19T02:09:08Z

I think there is a lot of good ideas in this PR, so thanks Philipp!

I think that we maybe lacking a file-system management piece of software to manage file between machines but also changes. I was looking at the implementation of cargo and how it manages cargo.lock and git-rs and I wonder if there is thing we can salvage or even use cargo to manage our user-defined nodes.

In the way react and angular reuse npm to manage their user-implemented modules.

But this is a big features that needs its own PR, and I think we can assume in this PR that the user has made the appropriate changes within its filesystem.

I will test it maybe further later during the day.

phil-opp · 2023-04-19T08:47:48Z

Yeah, we still need some kind of deploy functionality to get the nodes and operators from the CLI machine to the target machines. We already support URL sources for nodes and operators,

I think that we maybe lacking a file-system management piece of software to manage file between machines but also changes.

Agreed. There are currently only two options to distribute the node/operator binaries across machines:

Copy them manually, or through some custom script.
Use URL sources for nodes and operators, e.g. a GitHub release.

I think the second option is already quite convenient for "finished" nodes and operators, but it's cumbersome during development. We should try to make the edit->compile->deploy cycle easier for distributed dataflows too.

Maybe we could make the CLI send the executables via TCP as part of the spawn command? The receiving daemon could store them to the file system and then run them. This would make things much easier for nodes/operators written in Rust, as you could just develop as usual on the CLI machine and the dora start command would take care of everything else. For Python, we probably need some way to copy additional files, and set up a correct environment too...

…plates, and docs

Removes the separate `dora-runtime` binary. The runtime can now be started by passing `--run-dora-runtime` to `dora-daemon`. This change makes setup and deployment easier since it removes one executable that needs to be copied across machines.

Integrate `dora-runtime` into `dora-daemon`

haixuanTao

So, I retested this branch with the implementation of the log branch and it is ok for me. An example of the YAML description with local and remote will be appreciated before merging.

heyong4725 · 2023-04-24T21:24:50Z

Local and remote seems a little confusing to me regarding to distributed deployment, i.e., respect to which machine is local and/or remote? I think we should have an end-2-end dataflow graph describing distributed deployment at the control plane (CLI / coordinator layer), specifying pub/sub communication middleware provider (e.g., zenoh, DDS, or SOME/IP) with their corresponding configuration.

phil-opp · 2023-04-26T15:23:37Z

As discussed in today's meeting:

coordinator should not be involved in data forwarding -> daemons should talk to themselves instead (via TCP or zenoh)
prefix newly introduced YAML keys with e.g. _unstable -> consider this PR as an unstable feature -> postpone open design questions about YAML format deployment of node executables
pass contents of dataflow YAML to daemons, instead of path (discussed with Xavier)

We still need to pass the path through a new `working_dir` field as we haven't figured out deployment of compiled nodes/operators yet.

The path should also be valid on the receiving node, which might run in a different directory.

…ten`

…y keys

…g coordinator The coordinator is our control plane and should not be involved in data plane operations. This way, the dataflow can continue even if the coordinator fails.

phil-opp · 2023-04-27T14:22:05Z

I implemented the points discussed in the latest meeting and merged the latest changes from main. I think this PR is ready to be merged now. Please let me know if you have any other concerns.

phil-opp added 20 commits April 13, 2023 13:23

Allow configuring deploy machine

6ce8b7a

Fill in default machine

d81de13

Create spawn_dataflow_on_machine function

ccf0ee7

Only start nodes of own machine in daemon

1b755fb

Set deploy defaults when resolving aliases

7b8de70

Ensures that we set them also when using the daemon's `run_dataflow` function for examples.

Create new ResolvedDeploy type with non-Option fields

b1d9b58

Add spawn logging in coordinator

a59e633

Add TODOs in daemon for notifying remote nodes

20d954a

Keep dora-coordinator connection open

89cc6e4

Instead of re-connecting for each message.

Forward outputs through coordinator to target machine and nodes

7397bcc

Start working on a example that uses multiple daemons

c515651

Only set ctrl-c handler in deamon when it's run as a binary

4f649e8

Synchronize node startup with coordinator

852fc2c

Make coordinator usable in examples

188813c

Allow passing runtime path via coordinator

Add machine ID to daemon trace messages

3f5f412

Run distributed dataflow in multiple-daemons example

9f76239

Report closed inputs to remote daemons

c523b3b

Use control messages in multiple-daemons example instead of sleeping

17c52e3

Run multiple-daemons example on CI

cff3ebd

Add daemon arguments for setting machine ID and coordinator addr

647d5e2

phil-opp requested a review from haixuanTao April 18, 2023 17:01

Merge branch 'main' into multiple-daemons

62fd4f1

haixuanTao reviewed Apr 19, 2023

View reviewed changes

examples/multiple-daemons/dataflow.yml Outdated Show resolved Hide resolved

This was referenced Apr 19, 2023

Integrate dora-runtime into dora-daemon #257

Merged

CLI: Improve error messages when coordinator is not running #254

Merged

phil-opp added 3 commits April 19, 2023 12:45

Rework communication config

d6571bf

Remove deprecated communication options from dataflow examples, tem…

72a57ce

…plates, and docs

Integrate dora-runtime into dora-daemon

aac5e65

Removes the separate `dora-runtime` binary. The runtime can now be started by passing `--run-dora-runtime` to `dora-daemon`. This change makes setup and deployment easier since it removes one executable that needs to be copied across machines.

Merge pull request #257 from dora-rs/remove-runtime-binary

18cadab

Integrate `dora-runtime` into `dora-daemon`

haixuanTao mentioned this pull request Apr 21, 2023

Provide a way to access logs through the CLI #259

Merged

haixuanTao approved these changes Apr 24, 2023

View reviewed changes

phil-opp added 6 commits April 27, 2023 09:57

Prefix new YAML keys with _unstable_

d535dfe

Pass parsed dataflow descriptor instead of path

c28427d

We still need to pass the path through a new `working_dir` field as we haven't figured out deployment of compiled nodes/operators yet.

Deny unknown fields in operator and deploy config

a72f17b

Canoncialize dataflow path to determine working dir

bf12a7b

The path should also be valid on the receiving node, which might run in a different directory.

Fix: deny_unknown_fields is not supported in combination with `flat…

504e98e

…ten`

Fix multiple-daemons example: Use new _unstable_ prefix for deplo…

f621920

…y keys

phil-opp force-pushed the multiple-daemons branch from 02db1a5 to f621920 Compare April 27, 2023 09:02

phil-opp added 2 commits April 27, 2023 11:19

Merge branch 'main' into multiple-daemons

8bf124a

Send cross-machine outputs directly to target daemon without involvin…

e13415d

…g coordinator The coordinator is our control plane and should not be involved in data plane operations. This way, the dataflow can continue even if the coordinator fails.

haixuanTao merged commit 5c699cc into main Apr 28, 2023

haixuanTao deleted the multiple-daemons branch April 28, 2023 08:16

haixuanTao mentioned this pull request Apr 11, 2024

Tracking issue for distributed dora dataflow #459

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for distributed deployments with multiple daemons #256

Add support for distributed deployments with multiple daemons #256

phil-opp commented Apr 18, 2023

haixuanTao commented Apr 19, 2023 •

edited

Loading

phil-opp commented Apr 19, 2023

haixuanTao left a comment

heyong4725 commented Apr 24, 2023

phil-opp commented Apr 26, 2023 •

edited

Loading

phil-opp commented Apr 27, 2023

Add support for distributed deployments with multiple daemons #256

Add support for distributed deployments with multiple daemons #256

Conversation

phil-opp commented Apr 18, 2023

haixuanTao commented Apr 19, 2023 • edited Loading

phil-opp commented Apr 19, 2023

haixuanTao left a comment

Choose a reason for hiding this comment

heyong4725 commented Apr 24, 2023

phil-opp commented Apr 26, 2023 • edited Loading

phil-opp commented Apr 27, 2023

haixuanTao commented Apr 19, 2023 •

edited

Loading

phil-opp commented Apr 26, 2023 •

edited

Loading