Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplifying dora: Removing background process and make dora a single process. #691

Open
4 tasks
haixuanTao opened this issue Oct 20, 2024 · 16 comments
Open
4 tasks

Comments

@haixuanTao
Copy link
Collaborator

haixuanTao commented Oct 20, 2024

This is learning from our GOSIM Hackathon

The coordinator and the daemon running as a background process is extremely difficult to maintain especially and creates a lot of hanging issues while bringing hard to quantify value as we're almost always running single dataflow.

  • It is nearly impossible to keep in sync multiple daemons, a coordinator and a cli as any error could potentially can trow all of them in undefined behaviour.
  • Having multiple processes that need to be synced is nearly impossible with OS system manager like systemctl and therefore dora is nearly impossible to start on boot.
  • It makes dora extremely complex as any messages need to be shared accross process.

Having backgound process makes it nearly impossible to keep up with environmental variables that can be very changing in distributed setup.

I think the only moment we need a background process is for remote daemon, which need to be able to connect to other daemon when spawning a dataflow.

What needs to change

  • Remove the dora coordinator. The coordinator is still source of many issues while most of our use case are local single dataflow, making it very difficult to justify using it. We had 15 issues with the coordinator so far ( coordinator )
  • Refactor our cli so that dora daemon --start-dataflow becomes the default dora start behaviour and makes the dora-daemon the default process that is started when running the cli. This will removes a layer of complexity of having the daemon running in the background.
  • Make the dora-daemon print to stdout what is happenning to the dataflow as it is simply to obscure at the moment and things like hardware issues, network connection would be a lot easier to track on a single process stdout. We have refactored couple of time our logging mecanism but it is still too difficult for beginner who might not be familiar with terminal command. This will also make it simpler to integrate with things like systemctl
  • Makes daemon connect to other daemon when dataflow is spawned and makes them based on IP not machine id as it's going to be more transparent to user and more easy to understand connection issue.

What this will bring

This is going to make everything a lot easier to embed dora in other application such as python and rust. As well as makes the daemon more easily be configurable as a single web server.

Changelog

This should not be a breaking change except from the fact that we will not uses dora up anymore.

@phil-opp
Copy link
Collaborator

  • Remove the dora coordinator. The coordinator is still source of many issues while most of our use case are local single dataflow, making it very difficult to justify using it. We had 15 issues with the coordinator so far (

    coordinator

    )

Unfortunately we need the coordinator for deploying dora to multiple machines. It is responsible for connecting daemons to each other and for handling CLI commands. How would you do these things without the coordinator?

Makes daemon connect to other daemon when dataflow is spawned

But how are the daemons finding each other? We could use some multicast messages for local networks or use some existing communication layer such as zenoh.

makes them based on IP not machine id as it's going to be more transparent to user and more easy to understand connection issue

The reason we're using machine IDs instead of IPs is that the IPs might change or even be dynamic. If you want to commit your dataflow.yml to git, you need some kind of abstraction to hide the actual IP addresses. Even in a local network with fixed IPs, the actual prefix will probably differ between environments.

  • Make the dora-daemon print to stdout what is happenning to the dataflow as it is simply to obscure at the moment and things like hardware issues, network connection would be a lot easier to track on a single process stdout. We have refactored couple of time our logging mecanism but it is still too difficult for beginner who might not be familiar with terminal command. This will also make it simpler to integrate with things like systemctl

We're already doing that, don't we? We can of course lower the default logging level to also print INFO and DEBUG messages instead of only warnings and errors.

  • Refactor our cli so that dora daemon --start-dataflow becomes the default dora start behaviour and makes the dora-daemon the default process that is started when running the cli. This will removes a layer of complexity of having the daemon running in the background.

So you want a separate, single-process dora command that does not require launching any additional executables? I'm fine with adding such a command in addition, but I don't think that we should change our core architecture. After all, distributed deployment is an explicit goal of dora, so this should still be possible.

@haixuanTao
Copy link
Collaborator Author

haixuanTao commented Oct 29, 2024

But how are the daemons finding each other? We could use some multicast messages for local networks or use some existing communication layer such as zenoh.

I really think that we should let user explicit which daemon address they want to connect to and that we should not abstract the networking layer as it out of our scope. There is many thing out of our control:

  • It's hard to do. If we have 2 daemons in different networks: LAN and Internet, there simply is no simple way for dora to make sure those to can communicate. We had the problem at the GOSIM Hackathon that, we could not get an external ip and so we could not connect a cloud machine with our local machine.

  • It's also hard to debug those issues as by abstracting away the networking part, if there is a networking issue, on any of the connection CLI <-> Coordinator <-> Daemon <-> Daemon, there is no simple way to debug it, and we might not be able to cover all error cases.

  • It's hard to automate. I dare you to, try to connect 10 local machine at the same time. Meaning

    • ssh to each individual machine
    • run the dora daemon connection line and expliciting indiviual machine name. Knowing that this might not be a typical linux machine.
    • run the dora coordinator ( Make sure to only have only 1 running as multiple will make it impossible to recover from connecting)
    • run the dora cli
    • and destroy everything and restart if there is any failure.

This is extremely tiring and inefficient.

The reason we're using machine IDs instead of IPs is that the IPs might change or even be dynamic. If you want to commit your dataflow.yml to git, you need some kind of abstraction to hide the actual IP addresses. Even in a local network with fixed IPs, the actual prefix will probably differ between environments.

I would largely prefer using environment variable to hide IP as it is most of the time done for secrets in docker, kubernetes, github action, package managers, ...
As you exactly point out, the IP might change, and the prefix might not be the same depending of the environment. Abstracting away the IP address means that, we need to make sure that ANY daemon connects to the daemon in question only based on an ID which can hide its registered IP that might be local. I can only encourage you to try it between a local daemon and a remote daemon.

We're already doing that, don't we? We can of course lower the default logging level to also print INFO and DEBUG messages instead of only warnings and errors.

No, currently the only error that is reported is fatal error from daemon but not from actual nodes. And sporadically, errors are not reported.
But, it will make it better to just let stdout of each nodes go to the dora start stdout. There is too many stdout that is extremely important within the python and robotic ecosystem ( connection error, downloading models, ..., hardware issues, ...) that we can't expect to catch in dora.

So you want a separate, single-process dora command that does not require launching any additional executables? I'm fine with adding such a command in addition, but I don't think that we should change our core architecture. After all, distributed deployment is an explicit goal of dora, so this should still be possible.

I think that the current setup is just impossible to make distributed deployment, and I can only encourage you to try for yourself, as it's just going to be an uphill fix of issues, instead of working on meaningful features...

What I think we should do is:

nodes:
  - id: rust-node
    build: cargo build -p multiple-daemons-example-node
    path: ../../target/debug/multiple-daemons-example-node
    inputs:
      tick: dora/timer/millis/10
    outputs:
      - random

  - id: runtime-node
    _unstable_deploy:
      machine: 10.14.0.2
    operators:
      - id: rust-operator
        build: cargo build -p multiple-daemons-example-operator
        shared-library: ../../target/debug/multiple_daemons_example_operator
        inputs:
          tick: dora/timer/millis/100
          random: rust-node/random
        outputs:
          - status

  - id: rust-sink
    _unstable_deploy:
      machine: 10.14.0.1
    build: cargo build -p multiple-daemons-example-sink
    path: ../../target/debug/multiple-daemons-example-sink
    inputs:
      message: runtime-node/rust-operator/status
# machine 10.14.0.1
dora daemon --address 0.0.0.0
# > Stdout of this machine here

# machine 10.14.0.2
dora daemon --address 0.0.0.0
# > Stdout of this machine here

# machine 10.14.0.3
dora daemon start dataflow.yaml 
# > Stdout of this machine here

And no other process and the inter daemon connection happen on daemon start.

We can then put dora daemon in either systemctl or the OS specific manager for having them automatically spawn or restart on failure.

@Hennzau
Copy link
Collaborator

Hennzau commented Oct 29, 2024

I totally agree with those ideas:

  • Using Zenoh for daemon inter connection could be great, and I can help!

  • We should definitely make this single process daemon. However, what's the behavior with multiple dataflow? Your new architecture seems to be "1 daemon for 1 dataflow", can you explain?

@haixuanTao
Copy link
Collaborator Author

For multiple dataflow, I would be more in favor of having multiple daemons.

I think it would make things simpler. It would also avoid conflict between multiple dataflow. Imagining building a dataflow and breaking another one running in parallel.

The only machine that would run multiple dataflow is cloud machine, and I thiink that it would be better if each dataflow has a separate daemon with its own address.

@phil-opp
Copy link
Collaborator

I fully agree with you that the current setup for multi-machine deployments is cumbersome to use and very difficult to get right. It is just a first prototype without any convenience features.

You mention multiple things, so let me try to reply to them one by one:

  • Regarding "specify IP instead of machine ID in dataflow.yml":
    You propose that we specify the machine by IP address, instead of ID:

    _unstable_deploy:
          machine: 10.14.0.2

    I don't think that it's a good idea to specify the machines like that because the dataflow.yml file is often commited to git. So if someone else wants to check out your project, they need to manually edit the file to replace all of these IPs with their local IPs. Then their git working directory is dirty, so they have to decide wether they want to commit the IP changes or keep the file around as dirty.

    Also, the IP of the target machine might change. For example, DHCP might assign a new IP to your remote machine after it's restarted. Then you need to update all of your dataflow.yml files to change the old IP to the new. You can also have this situation for cloud machines that are assigned a public IP from a pool.

  • Regarding "connecting cloud and local machines:

    • If we have 2 daemons in different networks: LAN and Internet, there simply is no simple way for dora to make sure those to can communicate. We had the problem at the GOSIM Hackathon that, we could not get an external ip and so we could not connect a cloud machine with our local machine.
      I'm not sure how specifying an IP address in the dataflow would help with that? If the machine has no public IP, how can we connect to it?

    In my understanding, the current approach based on machine IDs should make this easier compared to defining IPs. The idea is that you don't need to know the IP addresess of each daemon (they might not even have a public IP). The only requirement is that the coordinator has a public IP and is reachable by the daemons. Then the coordinator can communicate back to the daemons through that connection. The only remaining challenge are inter-daemon messages in such a situation, which would require some sort of tunneling or custom routing.

    I think using zenoh could help to make this simpler, but for that we also need some kind of identifier for each daemon because it abstracts the IP address away.

  • Regarding "automation":

    • It's hard to automate. I dare you to, try to connect 10 local machine at the same time. Meaning

      • ssh to each individual machine
      • run the dora daemon connection line and expliciting indiviual machine name. Knowing that this might not be a typical linux machine.
      • run the dora coordinator ( Make sure to only have only 1 running as multiple will make it impossible to recover from connecting)
      • run the dora cli
      • and destroy everything and restart if there is any failure.
        I'm not sure how specifying IP addresses would help with this? Sure, you avoid the machine ID argument, but you still have to record the IP addresses for each machine and assign the nodes to machines.

    My intention was that the process should look like this:

    • SSH to the machine where the coordinator should run and start it there.
      • Remember the IP of the coordinator machine
    • SSH to each machine where you want to run a daemon
      • Start the daemon, assigning some unique ID (could be as simple as machine-1, machine-2, etc)
      • Pass the coordinator IP as argument so that the daemon can connect
    • Run the dora CLI to start and stop dataflows
      • If the coordinator is running on a remote machine, we need to specify the coordinator IP as argument
      • In the future, it would be nice to remember the coordinator IP in some way, e.g. through a config file or by having a separate dora connect command
    • If a dataflow fails: Fix your files, then do another dora start
      • You should never need to restart any deamon or coordinator.
      • dora destroy would be only needed if you want to shut down your machines and stop everything that is currently running.
      • (If there are any instances where we need to restart a daemon, we should fix those bugs.)
  • Regarding _"using ENV variables to specify IP addresses"

    I would largely prefer using environment variable to hide IP as it is most of the time done for secrets in docker, kubernetes, github action, package managers, ...
    Do I understand you correctly that you are thinking of something like this:

        _unstable_deploy:
          machine: IP_MACHINE_1

    In that case, we still need to define some mapping from IP_MACHINE_1 to the actual daemon, right?

  • Regarding "one daemon per dataflow":

    For multiple dataflow, I would be more in favor of having multiple daemons.

    I think it would make things simpler. It would also avoid conflict between multiple dataflow. Imagining building a dataflow and breaking another one running in parallel.
    I'm not sure how it would make things simpler? With separate daemons per dataflow you would need to do the whole setup routine (ssh to the machines, etc) again and again for every dataflow you want to start, no? Also, you would need to specify which coordinator/daemon you want to connect to for every dora CLI command if there are multiple running in parallel.
    I fully agree that dataflows should not be able to interfere with each other. We already have all the messages namespaced to their dataflow UUID, so there is no way that nodes are receiving messages from different dataflows. There are still robustness issues of course, which can bring the whole daemon down. This is something that we have to work on and improve, but I don't think that changing the whole design of dora brings us to this goal any faster.

@haixuanTao
Copy link
Collaborator Author

haixuanTao commented Oct 31, 2024

In my understanding, the current approach based on machine IDs should make this easier compared to defining IPs. The idea is that you don't need to know the IP addresess of each daemon (they might not even have a public IP). The only requirement is that the coordinator has a public IP and is reachable by the daemons. Then the coordinator can communicate back to the daemons through that connection. The only remaining challenge are inter-daemon messages in such a situation, which would require some sort of tunneling or custom routing.

Yes, exactly, the hard part ( independently from dora ) is interdaemon connection.

As mentioned above, I genuinely don't think that we should abstract away the network stack, in the near future at least.

Note that tunneling and custom routing should happen on every inter daemon connection and also be resilient to disconnection.

If we try to use something like ssh it is extremely hard to keep the connection up all the time with systemctl as you can have recursive failure.

It really sounds like having some uncommited env file and/or having to commit IP address that needs to be somehow protected is either to deal with than try to abstract away the network layer.

We tried doing ssh tunneling during the gosim hackathon and this is way too hard for the common robotist that just want to connect his model from the cloud or LAN to his robot.

Also, the IP of the target machine might change. For example, DHCP might assign a new IP to your remote machine after it's restarted. Then you need to update all of your dataflow.yml files to change the old IP to the new. You can also have this situation for cloud machines that are assigned a public IP from a pool.

Basically my problem is that we have to do some imperative:

dora daemon --coordinator-addr COORDINATOR_ADDR --machine-id abc

To be able to connect to the cooridnator and run a process.

This step means that you need to have some way to either connect to the computer (ssh, ...) or restart this step on launch (systemctl, ...) and modify either machine-id or coordinator_addr by HAND if you want to change coordinator or name. This step is really not intuitive to me and I don't see how we can scale this.

Having a simple dora daemon on a robot means that anyone can connect to the robot without having to ssh it, and needs 0 hand configuration.

Yes, IP address are dynamic, but there is plenty of tools to fix them and I would rather people use DNS/NAT to fix this issue rather than having them ssh to the robot computer.

My intention was that the process should look like this:

SSH to the machine where the coordinator should run and start it there.
    Remember the IP of the coordinator machine
SSH to each machine where you want to run a daemon
    Start the daemon, assigning some unique ID (could be as simple as machine-1, machine-2, etc)
    Pass the coordinator IP as argument so that the daemon can connect
Run the dora CLI to start and stop dataflows
    If the coordinator is running on a remote machine, we need to specify the coordinator IP as argument
    In the future, it would be nice to remember the coordinator IP in some way, e.g. through a config file or by having a separate dora connect command
If a dataflow fails: Fix your files, then do another dora start
    You should never need to restart any deamon or coordinator.
    dora destroy would be only needed if you want to shut down your machines and stop everything that is currently running.
    (If there are any instances where we need to restart a daemon, we should fix those bugs.)

This genuinely takes an 1 hour to do on 10 robots, where it could have been instant with IP address and need to be done on every start. It's also super easy to get wrong, and super annoying to deal with ssh passwords,,,

_unstable_deploy:
 machine: IP_MACHINE_1

More like:

    _unstable_deploy:
      machine: $IP_MACHINE_1

@haixuanTao
Copy link
Collaborator Author

I truly believe that dora should be ssh-free for the most part otherwise the barrier to entry is going to be very high.

@phil-opp
Copy link
Collaborator

phil-opp commented Oct 31, 2024

It really sounds like having some uncommited env file and/or having to commit IP address that needs to be somehow protected is either to deal with than try to abstract away the network layer.

But how does this avoid tunneling/custom routing? If the machine has no public IP, how can you reach it?

Having a simple dora daemon on a robot means that anyone can connect to the robot without having to ssh it, and needs 0 hand configuration.

Thanks for clarifying your use case. I understand that you're using a robot that you want to reboot repeatedly and you want to avoid doing manual work on every reboot, right?

I don't think that we have to change the whole design of dora for this. It would probably be enough to have some kind of "remote configuration" feature. For example, something like this:

  • Add a dora daemon --listen-for-remote-config <port> argument (the arg name is only a placeholder)
  • When started with this argument, the daemon will listen on the specified port for connections from the coordinator
  • We add a coordinator config that could look something like this:
     machines:
         - machine_1: dynamic
         - machine_2: 192.168.0.57:8080
         - machine_3: 192.168.0.64:1234
  • When starting the coordinator, we pass this config file as argument
    • Machines set to dynamic are treated like before (i.e. daemon initiates the connection to the coordinator)
    • For machines set to IP addresses, the coordinator sends an init_with_config message to the specified IP/port.
    • This message contains the machine ID.
    • Upon receiving this config message, the daemon uses the received values as --machine-id and --coordinator-address arguments.

This way, you could add the dora daemon --listen-for-remote-config <port> command to your startup commands on the robot and you would never need to touch it on a reboot. The dataflow.yml file also remains unchanged and independent of the local network setup. The daemon IP addresses are specified in a new coordinator config file that you can apply to multiple dataflows. And the changes to dora are minimal, so we can implement this quickly. What do you think?

@phil-opp
Copy link
Collaborator

This genuinely takes an 1 hour to do on 10 robots, where it could have been instant with IP address and need to be done on every start.

I think the part that confused me was that it needs to be "done on every start". That's because you want to completely reboot your robot in between runs, am I understanding this right? Because for normal systems, you could just leave the coordinator and daemons running and reuse them for the next dora start.

@phil-opp
Copy link
Collaborator

Another possible alternative:

Use zenoh for daemon<->coordinator connection and rely on multicast messages for discovery. This would allow the daemon to send some kind of register message to whole local network, which the coordinator could listen for. Then the coordinator could assign a machine ID to the daemon. This way, you would not need to specify the --machine-id and --coordinator-addr arguments either, without requiring the extra --listen-for-remote-config argument.

For multi-network deployments (e.g. cloud), you would still need to define some zenoh router when starting the dora daemon. However, this could be part of an env variable that you set only once when you set up your cluster.

@haixuanTao
Copy link
Collaborator Author

haixuanTao commented Nov 1, 2024

It really sounds like having some uncommited env file and/or having to commit IP address that needs to be somehow protected is either to deal with than try to abstract away the network layer.

But how does this avoid tunneling/custom routing? If the machine has no public IP, how can you reach it?

The idea is that we don't want to make an abstraction layer that connects daemon and expose something that might not work.

If the IP is something like: 127.0.0.1, it is explicit that this is not going to work.

The thing is that we have to let the user figure out how his going to be routing the IPs and not: put some machine id and connect to a public coordinator and hope it works and the thing is that we actually have no idea if it can work or not

I don't think that we have to change the whole design of dora for this. It would probably be enough to have some kind of "remote configuration" feature. For example, something like this:

* Add a `dora daemon --listen-for-remote-config <port>` argument (the arg name is only a placeholder)

* When started with this argument, the daemon will listen on the specified port for connections from the coordinator

* We add a coordinator config that could look something like this:
  ```yaml
   machines:
       - machine_1: dynamic
       - machine_2: 192.168.0.57:8080
       - machine_3: 192.168.0.64:1234
  ```

* When starting the coordinator, we pass this config file as argument
  
  * Machines set to `dynamic` are treated like before (i.e. daemon initiates the connection to the coordinator)
  * For machines set to IP addresses, the coordinator sends an `init_with_config` message to the specified IP/port.
  * This message contains the machine ID.
  * Upon receiving this config message, the daemon uses the received values as `--machine-id` and `--coordinator-address` arguments.

This way, you could add the dora daemon --listen-for-remote-config <port> command to your startup commands on the robot and you would never need to touch it on a reboot. The dataflow.yml file also remains unchanged and independent of the local network setup. The daemon IP addresses are specified in a new coordinator config file that you can apply to multiple dataflows. And the changes to dora are minimal, so we can implement this quickly. What do you think?

I'm sorry but there is already so many step and we want to add an additional 4.

So the workflow is going to be:

  • Boot the daemon
  • Boot the coordinator and pass config file
  • daemon Receive config file and register coordinator
  • Coordinator register daemon and machine id
  • cli connect to the coordinator and sent start command
  • coordinator send command to daemon
  • daemon connect to other daemon
  • daemon start dataflow

It is simply impossible for me to see this being consistently reliable, while we could have just:

  • dora daemon connect to other daemon using direct address.
  • dora daemon start dataflow

And this does not even resolve the problem that we're hiding the risk of daemon not connecting.

@haixuanTao
Copy link
Collaborator Author

haixuanTao commented Nov 1, 2024

Another possible alternative:

Use zenoh for daemon<->coordinator connection and rely on multicast messages for discovery. This would allow the daemon to send some kind of register message to whole local network, which the coordinator could listen for. Then the coordinator could assign a machine ID to the daemon. This way, you would not need to specify the --machine-id and --coordinator-addr arguments either, without requiring the extra --listen-for-remote-config argument.

For multi-network deployments (e.g. cloud), you would still need to define some zenoh router when starting the dora daemon. However, this could be part of an env variable that you set only once when you set up your cluster.

This sounds really complicated, while most of the time, you can easily find the IP address of the robot you want to connect to. I genuinely don't think that finding an IP address is hard, compared to setuping a whole zenoh cluster.

@haixuanTao
Copy link
Collaborator Author

This is something that we have to work on and improve, but I don't think that changing the whole design of dora brings us to this goal any faster.

I mean we have two well identified issues:

Both have been opened for close to 2 month.

I really think that there is a limit to the complexity we can handle and I don't see in our discussion how we can make it work and improve given the development speed we are able to do.

@phil-opp
Copy link
Collaborator

phil-opp commented Nov 1, 2024

It is simply impossible for me to see this being consistently reliable, while we could have just:

* dora daemon connect to other daemon using direct address.

But how would that work in detail? The daemons don't know the IP addresses of each other. The only way I see is that we use the IPs specified in the dataflow.yml file, which is only known when a dataflow is started. So each of the daemons would need to listen on some public IP/port for incoming dataflow YAML files and then do the following:

  • initialize a connection to all specified daemons
    • this takes a bit of time, which will delay the dataflow start
  • query its own public IP address to find out which nodes it should run
  • prepare these nodes
  • wait until all other daemons are ready
  • do a synchronized start
    • which daemon decides when it's time to start?
  • dora daemon start dataflow

This sounds like you want to remove both the coordinator and the CLI? And that the dora daemon start command then performs all the coordinator tasks of coordinating the other daemons and collecting logs?

I fear that such a drastic change of the design would result in a lot of additional work. I think that there are faster and easier ways to solve the mentioned issues.

This is something that we have to work on and improve, but I don't think that changing the whole design of dora brings us to this goal any faster.

I mean we have two well identified issues: [...]
Both have been opened for close to 2 month.

I really think that there is a limit to the complexity we can handle and I don't see in our discussion how we can make it work and improve given the development speed we are able to do.

I'm not sure how these issues are related?

I'm aware that we have many many things on our plate. That's why I think that we don't have the capacity to rearchitect dora completely. Redesigning a daemon communication mechanism that works in a distributed way without a coordinator sounds like a lot of work and like an additional source of complexity. A centralized coordinator that has full control of all the daemons makes the design much less complex in my opinion.

@phil-opp
Copy link
Collaborator

phil-opp commented Nov 1, 2024

Let's maybe take a step back. One of our initial design goals was that the dataflows could be controlled from computers that are not part of the dataflow. For example, that you could use the dora CLI on your laptop to control a dataflow that is running on some cloud machines. To be able to support this use case, we need some entity that the CLI can connect to. That was the motivation for creating the dora coordinator.

Assuming a simple network, all the nodes could directly communicate with each other using TCP messages or shared memory. This doesn't require a daemon, but the daemon makes things easier for the nodes. Without it, each node would need to be aware of the whole dataflow topology and maintain its own connections to other nodes.


The difficult part is to create all of these network connections, especially if the network topology is more complex. It doesn't really matter which entity creates these connections. So I'm not sure how removing the coordinator would simplify things.

We can of course simplify things if we require simple network topologies. If we assume that the CLI always runs on the same machine as the coordinator and one of the daemons, and that the remote daemons are in the same network and reachable by everyone through the same IP, we can of course remove a lot of complexity. But we also lose functionality and are no longer able to support certain use cases.

@phil-opp
Copy link
Collaborator

phil-opp commented Nov 1, 2024

I feel like there is a lot of valid points and important things in this issue thread, but it's becoming difficult to follow. I think it would be a good idea to first collect the different problems, pain points, and use cases we want to improve. Ideally, we split then into separate discussions. It's probably a good idea to avoid suggesting specific changes in the initial post.

Then we can propose potential solutions as comments and discuss them. I think that we would achieve a more productive discussion this way.

Edit: I started created some discussions for the problems and usability issues mentioned in this thread:

I also added proposals for solutions to each discussion. Of course feel free to add alternative proposals!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants