Enable downloading remote dataflow #682

haixuanTao · 2024-10-09T10:22:44Z

This PR makes it possible to start a dataflow from scratch without having to git clone a repository.

Example

dora up
dora build https://gist.githubusercontent.com/haixuanTao/b0399d7b82273c4d2a32d4394bd80942/raw/09f7f2927d51c1431ffd7834bbab558cd7343cbb/camera.yaml
dora start https://gist.githubusercontent.com/haixuanTao/b0399d7b82273c4d2a32d4394bd80942/raw/09f7f2927d51c1431ffd7834bbab558cd7343cbb/camera.yaml

Video

Screencast.from.2024-10-09.12-11-57.webm

Small breaking change

This slighlty change the logic of download_file function to use the predefined downloaded file name instead of changing it when downloading.
It removes the caching of download_file as there is now no way to know if the filename and path is going to be the same before hand. I think that if users want to have caching, they should link the path before hand instead of using the url. This can be done in conjunction of a build wget or curl command.

…e within the url to get the path avoiding to create confusion on the name of the file.

phil-opp · 2024-10-11T10:07:41Z

binaries/cli/src/main.rs

+
+fn resolve_dataflow(dataflow: String) -> eyre::Result<PathBuf> {
+    let dataflow = if source_is_url(&dataflow) {
+        // try to download the shared library


- // try to download the shared library + // try to download the dataflow YAML file

phil-opp · 2024-10-11T10:12:58Z

binaries/daemon/src/spawn.rs

-                        let target_path = Path::new("build")
-                            .join(node_id.to_string())
-                            .with_extension(EXE_EXTENSION);
-                        download_file(source, &target_path)
+                        let target_dir = Path::new("build");
+                        download_file(source, &target_dir)
                            .await
-                            .wrap_err("failed to download custom node")?;
-                        target_path.clone()
+                            .wrap_err("failed to download custom node")?


Could you clarify the reason for this breaking change?

By renaming the downloaded files we avoided collisions. For example, consider:

nodes: - id: camera path: example.com/camera/node inputs: tick: dora/timer/millis/20 outputs: - image - id: plot path: example.com/plot/node inputs: image: camera/image

With the new logic, the second download overwrites the first, as both nodes are stored as build/node. It probably still works as expected on Linux because the executables are directly started, but I could imagine that there will be some errors on Windows (since Windows disallows removing executable files that are still running). Also, this collision would prevent us from adding some kind of caching again in the future.

I understand you're reasoning.

But, I really don't think we should rename file on the fly as it makes it very hard to keep track of which file correspond to what.

I think that if people name everything "node.py" and colludes with names, I don't think that changing every name is the right solution.

I would very much prefer people use build for caching, in the likes of something like:

- id: plot build: wget -O plot.py example.com/plot/node path: node.py inputs: image: camera/image

I think that if people name everything "node.py" and colludes with names, I don't think that changing every name is the right solution.

The thing is that the two node.py files might come from different sources. For example, they could be part of different git repositories that are not your own. In this case, you have no way to avoid the name conflict.

But, I really don't think we should rename file on the fly as it makes it very hard to keep track of which file correspond to what.

For the above example, you don't have the source for one of the nodes at all (because it's overwritten).

I would very much prefer people use build for caching, in the likes of something like:

Sure, this is a good alternative. But if we support pointing path to remote URLs, we should be able to deal with potential name collisions.

Sure, this is a good alternative. But if we support pointing path to remote URLs, we should be able to deal with potential name collisions.

Ok, so I guess, what we could do is fail on pre-existing files. That would make the most sense to me.

The other alternative would be renaming, either always or only on conflict. If we don't want to do that, I agree that failing is better than silently overwriting the files of other nodes.

phil-opp · 2024-10-11T10:20:59Z

binaries/runtime/src/operator/shared_lib.rs

-        let target_path = adjust_shared_library_path(
-            &Path::new("build")
-                .join(node_id.to_string())
-                .join(operator_id.to_string()),
-        )?;
+        let target_path = &Path::new("build");


The reason for this adjust_shared_library_path was that you can provide cross-platform URLs. E.g. example.com/operator would become example.com/liboperator.so on Linux and example.com/operator.dll on Windows. This is a bit hacky, but without this adjust you need different dataflow files for Windows and Linux.

I understand, but I honestly find it super hard to keep track about all the rules that we had rather than having simple link.

I would rather prefer user implement their own url resolver for cross platform than choosing a url definition that people might need to remember.

this might resolve some issues as you might also want to deal with architecture (x86 or arm) and I don't think we should expect people to build library for all platform.

Let say I want to use the current status quo for something like:

-> libopencv_plot.so
-> I need to push into a server and need to be referenced at name url.com/libopencv_plot.so any other link will not work.
-> I need to remember to put it in my node as: url.com/opencv_plot which is not obvious
-> And if there is a bug from me or someone else it's going to be: Something went wrong with build/libplot.so for example. Which I need to remember links to url.com/opencv_plot which is my libopencv_plot.so

Knowing that none of this is going to be properly documented in the near future...

Being cross platform could be as simple as:

- id: plot build: } wget -O plot.zip example.com/plot/$OSTYPE/node/plot.zip unzip plot.zip path: libplot.so inputs: image: camera/image

I fully agree that the status quo solution was hacky and not ideal. Of course, things get simpler if we simply ignore the platform differences. However, if we want dora to be cross-platform, we should provide some way to use this feature for cross-platform dataflows.

Being cross platform could be as simple as:

[...] path: libplot.so [...]

The issue is that the file would need to be named plot.dll if you want to run it on Windows.

I guess we decided to remove operator support in the near future, so we might not have to deal with this anymore. However, the same things also applies to executables, which need to have an .exe suffix on Windows.

Would you be fine with example.com/path/node expanding to example.com/path/node.exe when run on Windows? Or would you prefer some solution with explicit placeholders, e.g. example.com/path/node{EXE_SUFFIX}? Or do you want to make this a "one-platform-only" feature that does not do any path adjustments at all (so that you need a separate dataflow_Windows.yml with adjusted paths)?

So in windows you can run an exe without the .exe suffix so we can always call something like target/release/node without the exe extension. So I would prefer defaulting to this.

See: https://stackoverflow.com/questions/44441265/calling-an-exe-without-the-extension-but-with-i

FYI, i hope that most binary that we ship for dora can be packaged in a way that is pip or cargo installable so that everything is contained in PATH by default and easy to find and deploy.

That's useful for local paths, but unfortunately not for URLs. We need different binaries on Windows and Linux, so downloading the same file would not help. Also, the downloaded file would probably need an .exe extension to be recognized as an executable file on Windows, no?

However, even if we added a .exe extension automatically, we would still have the issue that there are different CPU architectures etc. So placeholders is probably the way to go, e.g. example.com/path/{target}/node, which would expand to example.com/path/x86_64-unknown-linux-gnu/node on an x86_64 Linux system. Alternatively, we could of course support separate URLs for each target, e.g.:

path: x86_64-unknown-linux-gnu: https://example.com/path/x86/linux/node x86_64-pc-windows-msvc: https://example.com/path/x86/win/node.exe

(Of course, this is not a high priority right now.)

haixuanTao added 2 commits October 9, 2024 12:10

Make downloading file use content_dosposition header or the filenam…

3775a3c

…e within the url to get the path avoiding to create confusion on the name of the file.

use download_file when calling build or start for a url dataflow

b494a47

haixuanTao marked this pull request as ready for review October 9, 2024 10:50

haixuanTao requested a review from phil-opp October 9, 2024 10:53

haixuanTao merged commit 903cb23 into main Oct 11, 2024
40 checks passed

haixuanTao deleted the enable-downloading-remote-dataflow branch October 11, 2024 03:48

phil-opp reviewed Oct 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable downloading remote dataflow #682

Enable downloading remote dataflow #682

haixuanTao commented Oct 9, 2024

phil-opp Oct 11, 2024

phil-opp Oct 11, 2024

haixuanTao Oct 12, 2024

haixuanTao Oct 12, 2024 •

edited

Loading

phil-opp Oct 30, 2024

haixuanTao Oct 31, 2024

phil-opp Oct 31, 2024

phil-opp Oct 11, 2024

haixuanTao Oct 12, 2024

haixuanTao Oct 12, 2024 •

edited

Loading

haixuanTao Oct 12, 2024 •

edited

Loading

phil-opp Oct 30, 2024

phil-opp Oct 30, 2024

haixuanTao Oct 31, 2024

phil-opp Oct 31, 2024

Enable downloading remote dataflow #682

Enable downloading remote dataflow #682

Conversation

haixuanTao commented Oct 9, 2024

Example

Video

Small breaking change

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haixuanTao Oct 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haixuanTao Oct 12, 2024 • edited Loading

Choose a reason for hiding this comment

haixuanTao Oct 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haixuanTao Oct 12, 2024 •

edited

Loading

haixuanTao Oct 12, 2024 •

edited

Loading

haixuanTao Oct 12, 2024 •

edited

Loading