Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for distributed dora dataflow #459

Open
2 of 4 tasks
haixuanTao opened this issue Apr 7, 2024 · 17 comments
Open
2 of 4 tasks

Tracking issue for distributed dora dataflow #459

haixuanTao opened this issue Apr 7, 2024 · 17 comments
Labels
bug Something isn't working cli CLI coordinator daemon

Comments

@haixuanTao
Copy link
Collaborator

haixuanTao commented Apr 7, 2024

Describe the bug
Dora remote connection has been long awaited. We however need to fix a couple of things to make it work:

Tracking issue

  • Make hard-coded localhost address configurable in the daemon and the coordinator
  • Make working_dir configurable dependent on the machine on multi daemon situation.

Update

Additional Tracking issue

  • Make CLI coordinator connection configurable.

Currently the CLI can only connect to the local preconfigured coordinator. We should make it configurable such as:

dora start dataflow.yaml --coordinator-addr IP:PORT

https://github.com/dora-rs/dora/blob/ea47a556a77f6c64f29ab07939d747edc8573671/binaries/cli/src/main.rs#L469-L472

  • Make working_dir configurable dependent on the machine on multi daemon situation.
@github-actions github-actions bot added bug Something isn't working coordinator daemon labels Apr 7, 2024
@phil-opp
Copy link
Collaborator

phil-opp commented Apr 8, 2024

[ ] Make hard-coded localhost address configurable in the daemon and the coordinator

Which hardcoded addresses do you mean? This should be already configurable by passing a --coordinator-addr argument when spawning the daemon. You should also pass a --machine-id in this case.

@haixuanTao
Copy link
Collaborator Author

Yeah sorry could have been more precise:

  • pub async fn spawn_listener_loop(
    machine_id: String,
    events_tx: flume::Sender<Timestamped<InterDaemonEvent>>,
    ) -> eyre::Result<SocketAddr> {
    let localhost = Ipv4Addr::new(127, 0, 0, 1);
    let socket = match TcpListener::bind((localhost, 0)).await {
    Ok(socket) => socket,
    Err(err) => {
    return Err(eyre::Report::new(err).wrap_err("failed to create local TCP listener"))
    }
    };
    let socket_addr = socket
    .local_addr()
    .wrap_err("failed to get local addr of socket")?;
    tokio::spawn(async move {
    listener_loop(socket, events_tx).await;
    tracing::debug!("inter-daemon listener loop finished for machine `{machine_id}`");
    });
    Ok(socket_addr)
    }
  • pub async fn create_listener(port: u16) -> eyre::Result<TcpListener> {
    let localhost = Ipv4Addr::new(127, 0, 0, 1);
    let socket = match TcpListener::bind((localhost, port)).await {
    Ok(socket) => socket,
    Err(err) => {
    return Err(eyre::Report::new(err).wrap_err("failed to create local TCP listener"))
    }
    };
    Ok(socket)
    }

@phil-opp
Copy link
Collaborator

phil-opp commented Apr 8, 2024

Ah yes, we should use 0.0.0.0 in the coordinator to listen for incoming connections on all interfaces. For the daemon, the 127.0.0.1 should be fine since only local nodes should initiate connections to the daemon.

@haixuanTao
Copy link
Collaborator Author

I have just looked rapidly but isn't this: -

pub async fn spawn_listener_loop(
machine_id: String,
events_tx: flume::Sender<Timestamped<InterDaemonEvent>>,
) -> eyre::Result<SocketAddr> {
let localhost = Ipv4Addr::new(127, 0, 0, 1);
let socket = match TcpListener::bind((localhost, 0)).await {
Ok(socket) => socket,
Err(err) => {
return Err(eyre::Report::new(err).wrap_err("failed to create local TCP listener"))
}
};
let socket_addr = socket
.local_addr()
.wrap_err("failed to get local addr of socket")?;
tokio::spawn(async move {
listener_loop(socket, events_tx).await;
tracing::debug!("inter-daemon listener loop finished for machine `{machine_id}`");
});
Ok(socket_addr)
}

for inter_daemon connection?

@phil-opp
Copy link
Collaborator

phil-opp commented Apr 8, 2024

Ah yes, then we should listen on 0.0.0.0 in the deamon as well.

@Michael-J-Ward
Copy link
Contributor

Make working_dir configurable dependent on the machine on multi daemon situation.

Should dora attempt to create the working_dir if it doesn't exist or fail out with an error?

Ah yes, then we should listen on 0.0.0.0 in the deamon as well.

Should 0.0.0.0 be the default, or should it not be configurable? (I don't have experience networking robots, but I'd assume connecting them through a VPN would be common - cargo run -- --bind <vpn-interface-ip>:<port>)

@haixuanTao
Copy link
Collaborator Author

Make working_dir configurable dependent on the machine on multi daemon situation.

Should dora attempt to create the working_dir if it doesn't exist or fail out with an error?

So there is couple of things that we consider: #256 (comment)

I think that we could maybe as a first iteration expect the working_dir to exist with the right files in it, and error out on missing files. We can in a follow up PR handle creating missing folder.

For more, see: #256 (comment)

Ah yes, then we should listen on 0.0.0.0 in the deamon as well.

Should 0.0.0.0 be the default, or should it not be configurable? (I don't have experience networking robots, but I'd assume connecting them through a VPN would be common - cargo run -- --bind <vpn-interface-ip>:<port>)

It does seem like being configurable would be a plus. I think that by revisiting the port argument it should be fairly straightforward to make addr configurable.

@phil-opp
Copy link
Collaborator

As far as I know, 0.0.0.0 acts as a kind of wildcard when listening. So it should listen for incoming connections on all available interfaces.

@Michael-J-Ward
Copy link
Contributor

As far as I know, 0.0.0.0 acts as a kind of wildcard when listening. So it should listen for incoming connections on all available interfaces.

Correct- binding to 0.0.0.0 will listen all interfaces available to the process.

Binding to <vpn-net-ip> commonly restricts connections to only allow other machines on the VPN.

@phil-opp
Copy link
Collaborator

Ah, that makes sense, thanks for clarifying! I agree that making this configurable is a good idea then.

Michael-J-Ward added a commit to Michael-J-Ward/dora that referenced this issue Apr 12, 2024
Michael-J-Ward added a commit to Michael-J-Ward/dora that referenced this issue Apr 12, 2024
Michael-J-Ward added a commit to Michael-J-Ward/dora that referenced this issue Apr 16, 2024
Michael-J-Ward added a commit to Michael-J-Ward/dora that referenced this issue Apr 16, 2024
@Michael-J-Ward
Copy link
Contributor

Many of the cli commands end up using the connect_to_coordinator function.

Should all of these be made configurable as well?

fn connect_to_coordinator() -> std::io::Result<Box<TcpRequestReplyConnection>> {
TcpLayer::new().connect(control_socket_addr())
}

@Michael-J-Ward
Copy link
Contributor

Michael-J-Ward commented Apr 17, 2024

Reupping the above after taking a closer look at the daemon CLI command, which explicitly ignores the coordination addr when startingn a dataflow and advises

Please use the start command for remote coordinator

But start uses the default connect_to_coordinator from above, not using any passed arguments.

match run_dataflow {
Some(dataflow_path) => {
tracing::info!("Starting dataflow `{}`", dataflow_path.display());
if let Some(coordinator_addr) = coordinator_addr {
tracing::info!(
"Not using coordinator addr {} as `run_dataflow` is for local dataflow only. Please use the `start` command for remote coordinator",
coordinator_addr
);
}
Daemon::run_dataflow(&dataflow_path).await
}
None => {
let addr = coordinator_addr.unwrap_or_else(|| {
tracing::info!("Starting in local mode");
let localhost = Ipv4Addr::new(127, 0, 0, 1);
(localhost, DORA_COORDINATOR_PORT_DEFAULT).into()
});
Daemon::run(addr, machine_id.unwrap_or_default()).await
}
}
})
.context("failed to run dora-daemon")?

I'm asking for clarity because I know dora is going through rapid iteration, so I'm uncertain what the current "expected" user routes are.

@haixuanTao
Copy link
Collaborator Author

I understand, just want to say thanks for all the effort you put in debugging this.

Truth be told, the existing daemon <-> coordinator <-> daemon was merged as an experimental block (#256) to make sure that what we build would essentially support distributed communication but never really got stabilized. This is why you may see scaffolding still on the current codebase.

We really want to get the distributed dataflow working but we have been very busy with both LLMs and ROS2-bridges.

The daemon run feature is kind of a scaffolding for running dataflow in the CI/CD, we don't really want to expose this feature as is and may also be reworked.

The exact cli pattern to connect to the coordinator and the daemon in a distributed manner is IMO not yet structured, but, i'm fully open to draft a first implementation and then adjust depending on the faced issues.

How does that sound?

@phil-opp
Copy link
Collaborator

Yes, thanks a lot for all your work on this!

The correct solution for the CLI would be to add a coordinator_addr argument to the connect_to_coordinator function. We would also need some way to set this argument when running a dora CLI command. I think we can start with a new command-line argument for this and look for more convenient ways in the future. What do you think?

@bobd988
Copy link
Contributor

bobd988 commented May 16, 2024

I received feedback from two serious customers that Zenoh support across networks is a key differentiator feature.

@haixuanTao
Copy link
Collaborator Author

Then we can prioritise it after #478 and #497 and #495

@phil-opp
Copy link
Collaborator

phil-opp commented Jun 5, 2024

I opened an issue with the next required steps for distributed dataflows: #535. Discussion and pull requests are welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cli CLI coordinator daemon
Projects
None yet
Development

No branches or pull requests

4 participants