Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable downloading remote dataflow #682

Merged
merged 2 commits into from
Oct 11, 2024

Conversation

haixuanTao
Copy link
Collaborator

This PR makes it possible to start a dataflow from scratch without having to git clone a repository.

Example

dora up
dora build https://gist.githubusercontent.com/haixuanTao/b0399d7b82273c4d2a32d4394bd80942/raw/09f7f2927d51c1431ffd7834bbab558cd7343cbb/camera.yaml
dora start https://gist.githubusercontent.com/haixuanTao/b0399d7b82273c4d2a32d4394bd80942/raw/09f7f2927d51c1431ffd7834bbab558cd7343cbb/camera.yaml

Video

Screencast.from.2024-10-09.12-11-57.webm

Small breaking change

  • This slighlty change the logic of download_file function to use the predefined downloaded file name instead of changing it when downloading.
  • It removes the caching of download_file as there is now no way to know if the filename and path is going to be the same before hand. I think that if users want to have caching, they should link the path before hand instead of using the url. This can be done in conjunction of a build wget or curl command.

@haixuanTao haixuanTao marked this pull request as ready for review October 9, 2024 10:50
@haixuanTao haixuanTao requested a review from phil-opp October 9, 2024 10:53
@haixuanTao haixuanTao merged commit 903cb23 into main Oct 11, 2024
40 checks passed
@haixuanTao haixuanTao deleted the enable-downloading-remote-dataflow branch October 11, 2024 03:48

fn resolve_dataflow(dataflow: String) -> eyre::Result<PathBuf> {
let dataflow = if source_is_url(&dataflow) {
// try to download the shared library
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- // try to download the shared library
+ // try to download the dataflow YAML file

Comment on lines -104 to +106
let target_path = Path::new("build")
.join(node_id.to_string())
.with_extension(EXE_EXTENSION);
download_file(source, &target_path)
let target_dir = Path::new("build");
download_file(source, &target_dir)
.await
.wrap_err("failed to download custom node")?;
target_path.clone()
.wrap_err("failed to download custom node")?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify the reason for this breaking change?

By renaming the downloaded files we avoided collisions. For example, consider:

nodes:
  - id: camera
    path: example.com/camera/node
    inputs:
      tick: dora/timer/millis/20
    outputs:
      - image

  - id: plot
    path: example.com/plot/node
    inputs:
      image: camera/image

With the new logic, the second download overwrites the first, as both nodes are stored as build/node. It probably still works as expected on Linux because the executables are directly started, but I could imagine that there will be some errors on Windows (since Windows disallows removing executable files that are still running). Also, this collision would prevent us from adding some kind of caching again in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand you're reasoning.

But, I really don't think we should rename file on the fly as it makes it very hard to keep track of which file correspond to what.

I think that if people name everything "node.py" and colludes with names, I don't think that changing every name is the right solution.

Copy link
Collaborator Author

@haixuanTao haixuanTao Oct 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would very much prefer people use build for caching, in the likes of something like:

  - id: plot
    build: wget -O plot.py example.com/plot/node
    path: node.py
    inputs:
      image: camera/image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if people name everything "node.py" and colludes with names, I don't think that changing every name is the right solution.

The thing is that the two node.py files might come from different sources. For example, they could be part of different git repositories that are not your own. In this case, you have no way to avoid the name conflict.

But, I really don't think we should rename file on the fly as it makes it very hard to keep track of which file correspond to what.

For the above example, you don't have the source for one of the nodes at all (because it's overwritten).

I would very much prefer people use build for caching, in the likes of something like:

Sure, this is a good alternative. But if we support pointing path to remote URLs, we should be able to deal with potential name collisions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this is a good alternative. But if we support pointing path to remote URLs, we should be able to deal with potential name collisions.

Ok, so I guess, what we could do is fail on pre-existing files. That would make the most sense to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other alternative would be renaming, either always or only on conflict. If we don't want to do that, I agree that failing is better than silently overwriting the files of other nodes.

let target_path = adjust_shared_library_path(
&Path::new("build")
.join(node_id.to_string())
.join(operator_id.to_string()),
)?;
let target_path = &Path::new("build");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this adjust_shared_library_path was that you can provide cross-platform URLs. E.g. example.com/operator would become example.com/liboperator.so on Linux and example.com/operator.dll on Windows. This is a bit hacky, but without this adjust you need different dataflow files for Windows and Linux.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, but I honestly find it super hard to keep track about all the rules that we had rather than having simple link.

I would rather prefer user implement their own url resolver for cross platform than choosing a url definition that people might need to remember.

this might resolve some issues as you might also want to deal with architecture (x86 or arm) and I don't think we should expect people to build library for all platform.

Copy link
Collaborator Author

@haixuanTao haixuanTao Oct 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let say I want to use the current status quo for something like:

-> libopencv_plot.so
-> I need to push into a server and need to be referenced at name url.com/libopencv_plot.so any other link will not work.
-> I need to remember to put it in my node as: url.com/opencv_plot which is not obvious
-> And if there is a bug from me or someone else it's going to be: Something went wrong with build/libplot.so for example. Which I need to remember links to url.com/opencv_plot which is my libopencv_plot.so

Knowing that none of this is going to be properly documented in the near future...

Copy link
Collaborator Author

@haixuanTao haixuanTao Oct 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being cross platform could be as simple as:

  - id: plot
    build: }
        wget -O plot.zip example.com/plot/$OSTYPE/node/plot.zip
        unzip plot.zip
    path: libplot.so
    inputs:
      image: camera/image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree that the status quo solution was hacky and not ideal. Of course, things get simpler if we simply ignore the platform differences. However, if we want dora to be cross-platform, we should provide some way to use this feature for cross-platform dataflows.

Being cross platform could be as simple as:

[...]
path: libplot.so
[...]

The issue is that the file would need to be named plot.dll if you want to run it on Windows.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we decided to remove operator support in the near future, so we might not have to deal with this anymore. However, the same things also applies to executables, which need to have an .exe suffix on Windows.

Would you be fine with example.com/path/node expanding to example.com/path/node.exe when run on Windows? Or would you prefer some solution with explicit placeholders, e.g. example.com/path/node{EXE_SUFFIX}? Or do you want to make this a "one-platform-only" feature that does not do any path adjustments at all (so that you need a separate dataflow_Windows.yml with adjusted paths)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in windows you can run an exe without the .exe suffix so we can always call something like target/release/node without the exe extension. So I would prefer defaulting to this.

See: https://stackoverflow.com/questions/44441265/calling-an-exe-without-the-extension-but-with-i

FYI, i hope that most binary that we ship for dora can be packaged in a way that is pip or cargo installable so that everything is contained in PATH by default and easy to find and deploy.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's useful for local paths, but unfortunately not for URLs. We need different binaries on Windows and Linux, so downloading the same file would not help. Also, the downloaded file would probably need an .exe extension to be recognized as an executable file on Windows, no?

However, even if we added a .exe extension automatically, we would still have the issue that there are different CPU architectures etc. So placeholders is probably the way to go, e.g. example.com/path/{target}/node, which would expand to example.com/path/x86_64-unknown-linux-gnu/node on an x86_64 Linux system. Alternatively, we could of course support separate URLs for each target, e.g.:

path:
  x86_64-unknown-linux-gnu: https://example.com/path/x86/linux/node
  x86_64-pc-windows-msvc: https://example.com/path/x86/win/node.exe

(Of course, this is not a high priority right now.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants