-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplifying dora: Removing background process and make dora a single process. #691
Comments
Unfortunately we need the coordinator for deploying dora to multiple machines. It is responsible for connecting daemons to each other and for handling CLI commands. How would you do these things without the coordinator?
But how are the daemons finding each other? We could use some multicast messages for local networks or use some existing communication layer such as zenoh.
The reason we're using machine IDs instead of IPs is that the IPs might change or even be dynamic. If you want to commit your dataflow.yml to git, you need some kind of abstraction to hide the actual IP addresses. Even in a local network with fixed IPs, the actual prefix will probably differ between environments.
We're already doing that, don't we? We can of course lower the default logging level to also print INFO and DEBUG messages instead of only warnings and errors.
So you want a separate, single-process |
I really think that we should let user explicit which daemon address they want to connect to and that we should not abstract the networking layer as it out of our scope. There is many thing out of our control:
This is extremely tiring and inefficient.
I would largely prefer using environment variable to hide IP as it is most of the time done for secrets in docker, kubernetes, github action, package managers, ...
No, currently the only error that is reported is fatal error from daemon but not from actual nodes. And sporadically, errors are not reported.
I think that the current setup is just impossible to make distributed deployment, and I can only encourage you to try for yourself, as it's just going to be an uphill fix of issues, instead of working on meaningful features... What I think we should do is: nodes:
- id: rust-node
build: cargo build -p multiple-daemons-example-node
path: ../../target/debug/multiple-daemons-example-node
inputs:
tick: dora/timer/millis/10
outputs:
- random
- id: runtime-node
_unstable_deploy:
machine: 10.14.0.2
operators:
- id: rust-operator
build: cargo build -p multiple-daemons-example-operator
shared-library: ../../target/debug/multiple_daemons_example_operator
inputs:
tick: dora/timer/millis/100
random: rust-node/random
outputs:
- status
- id: rust-sink
_unstable_deploy:
machine: 10.14.0.1
build: cargo build -p multiple-daemons-example-sink
path: ../../target/debug/multiple-daemons-example-sink
inputs:
message: runtime-node/rust-operator/status # machine 10.14.0.1
dora daemon --address 0.0.0.0
# > Stdout of this machine here
# machine 10.14.0.2
dora daemon --address 0.0.0.0
# > Stdout of this machine here
# machine 10.14.0.3
dora daemon start dataflow.yaml
# > Stdout of this machine here And no other process and the inter daemon connection happen on daemon start. We can then put |
I totally agree with those ideas:
|
For multiple dataflow, I would be more in favor of having multiple daemons. I think it would make things simpler. It would also avoid conflict between multiple dataflow. Imagining building a dataflow and breaking another one running in parallel. The only machine that would run multiple dataflow is cloud machine, and I thiink that it would be better if each dataflow has a separate daemon with its own address. |
I fully agree with you that the current setup for multi-machine deployments is cumbersome to use and very difficult to get right. It is just a first prototype without any convenience features. You mention multiple things, so let me try to reply to them one by one:
|
Yes, exactly, the hard part ( independently from dora ) is interdaemon connection. As mentioned above, I genuinely don't think that we should abstract away the network stack, in the near future at least. Note that tunneling and custom routing should happen on every inter daemon connection and also be resilient to disconnection. If we try to use something like ssh it is extremely hard to keep the connection up all the time with It really sounds like having some uncommited env file and/or having to commit IP address that needs to be somehow protected is either to deal with than try to abstract away the network layer. We tried doing ssh tunneling during the gosim hackathon and this is way too hard for the common robotist that just want to connect his model from the cloud or LAN to his robot.
Basically my problem is that we have to do some imperative:
To be able to connect to the cooridnator and run a process. This step means that you need to have some way to either connect to the computer (ssh, ...) or restart this step on launch (systemctl, ...) and modify either machine-id or coordinator_addr by HAND if you want to change coordinator or name. This step is really not intuitive to me and I don't see how we can scale this. Having a simple Yes, IP address are dynamic, but there is plenty of tools to fix them and I would rather people use DNS/NAT to fix this issue rather than having them ssh to the robot computer.
This genuinely takes an 1 hour to do on 10 robots, where it could have been instant with IP address and need to be done on every start. It's also super easy to get wrong, and super annoying to deal with ssh passwords,,,
More like: _unstable_deploy:
machine: $IP_MACHINE_1 |
I truly believe that dora should be ssh-free for the most part otherwise the barrier to entry is going to be very high. |
But how does this avoid tunneling/custom routing? If the machine has no public IP, how can you reach it?
Thanks for clarifying your use case. I understand that you're using a robot that you want to reboot repeatedly and you want to avoid doing manual work on every reboot, right? I don't think that we have to change the whole design of dora for this. It would probably be enough to have some kind of "remote configuration" feature. For example, something like this:
This way, you could add the |
I think the part that confused me was that it needs to be "done on every start". That's because you want to completely reboot your robot in between runs, am I understanding this right? Because for normal systems, you could just leave the coordinator and daemons running and reuse them for the next |
Another possible alternative: Use zenoh for daemon<->coordinator connection and rely on multicast messages for discovery. This would allow the daemon to send some kind of register message to whole local network, which the coordinator could listen for. Then the coordinator could assign a machine ID to the daemon. This way, you would not need to specify the For multi-network deployments (e.g. cloud), you would still need to define some zenoh router when starting the |
The idea is that we don't want to make an abstraction layer that connects daemon and expose something that might not work. If the IP is something like: 127.0.0.1, it is explicit that this is not going to work. The thing is that we have to let the user figure out how his going to be routing the IPs and not:
I'm sorry but there is already so many step and we want to add an additional 4. So the workflow is going to be:
It is simply impossible for me to see this being consistently reliable, while we could have just:
And this does not even resolve the problem that we're hiding the risk of daemon not connecting. |
This sounds really complicated, while most of the time, you can easily find the IP address of the robot you want to connect to. I genuinely don't think that finding an IP address is hard, compared to setuping a whole zenoh cluster. |
I mean we have two well identified issues:
Both have been opened for close to 2 month. I really think that there is a limit to the complexity we can handle and I don't see in our discussion how we can make it |
But how would that work in detail? The daemons don't know the IP addresses of each other. The only way I see is that we use the IPs specified in the
This sounds like you want to remove both the coordinator and the CLI? And that the I fear that such a drastic change of the design would result in a lot of additional work. I think that there are faster and easier ways to solve the mentioned issues.
I'm not sure how these issues are related? I'm aware that we have many many things on our plate. That's why I think that we don't have the capacity to rearchitect dora completely. Redesigning a daemon communication mechanism that works in a distributed way without a coordinator sounds like a lot of work and like an additional source of complexity. A centralized coordinator that has full control of all the daemons makes the design much less complex in my opinion. |
Let's maybe take a step back. One of our initial design goals was that the dataflows could be controlled from computers that are not part of the dataflow. For example, that you could use the dora CLI on your laptop to control a dataflow that is running on some cloud machines. To be able to support this use case, we need some entity that the CLI can connect to. That was the motivation for creating the dora coordinator. Assuming a simple network, all the nodes could directly communicate with each other using TCP messages or shared memory. This doesn't require a daemon, but the daemon makes things easier for the nodes. Without it, each node would need to be aware of the whole dataflow topology and maintain its own connections to other nodes. The difficult part is to create all of these network connections, especially if the network topology is more complex. It doesn't really matter which entity creates these connections. So I'm not sure how removing the coordinator would simplify things. We can of course simplify things if we require simple network topologies. If we assume that the CLI always runs on the same machine as the coordinator and one of the daemons, and that the remote daemons are in the same network and reachable by everyone through the same IP, we can of course remove a lot of complexity. But we also lose functionality and are no longer able to support certain use cases. |
I feel like there is a lot of valid points and important things in this issue thread, but it's becoming difficult to follow. I think it would be a good idea to first collect the different problems, pain points, and use cases we want to improve. Ideally, we split then into separate discussions. It's probably a good idea to avoid suggesting specific changes in the initial post. Then we can propose potential solutions as comments and discuss them. I think that we would achieve a more productive discussion this way. Edit: I started created some discussions for the problems and usability issues mentioned in this thread:
I also added proposals for solutions to each discussion. Of course feel free to add alternative proposals! |
The coordinator and the daemon running as a background process is extremely difficult to maintain especially and creates a lot of hanging issues while bringing hard to quantify value as we're almost always running single dataflow.
systemctl
and therefore dora is nearly impossible to start on boot.Having backgound process makes it nearly impossible to keep up with environmental variables that can be very changing in distributed setup.
I think the only moment we need a background process is for remote daemon, which need to be able to connect to other daemon when spawning a dataflow.
What needs to change
dora daemon --start-dataflow
becomes the defaultdora start
behaviour and makes the dora-daemon the default process that is started when running the cli. This will removes a layer of complexity of having the daemon running in the background.systemctl
What this will bring
This is going to make everything a lot easier to embed dora in other application such as python and rust. As well as makes the daemon more easily be configurable as a single web server.
Changelog
This should not be a breaking change except from the fact that we will not uses
dora up
anymore.The text was updated successfully, but these errors were encountered: