Print only first node error and report more metadata in dataflow results #552

phil-opp · 2024-06-12T17:07:12Z

Allows us to mark certain node errors as cascading, which will deprioritize them when printed.

Also updates the error printing code to only print the error that happened first.

Example Output:

Node rust-sink exits with a success exit status before initializing dora connection:

❯ dora start dataflow.yml --attach
01901beb-7fd7-7183-8584-8e6284c84031
dataflow failed: Node `dora-record` failed: exited with code: 1

This error occurred because node `rust-sink` exited before connecting to dora.

There are 2 more errors. Check the `out/01901beb-7fd7-7183-8584-8e6284c84031` folder for full details.

Node rust-sink errored before initialization:

❯ dora start dataflow.yml --attach
01902594-d2e4-7b4f-a2d5-0d3e059a52fa
dataflow failed: Node `rust-sink` failed: exited with code: 1

Stderr output:
---------------------------------------------------------------------------------
Error: foo

Location:
    examples/rust-dataflow/sink/src/main.rs:5:5

[...]
---------------------------------------------------------------------------------


There are 3 more errors. Check the `out/01902594-d2e4-7b4f-a2d5-0d3e059a52fa` folder for full details.

TODO:

mark node failures as cascading when initialization fails because other node exited too early
testing
load stderr
Only skip cascading errors
Add error cause for grace period kills
Separate PR: Add --detach arg to dora start and make --attach the default
- Opened separate PR in Make dora start attach by default, add --detach to opt-out #561

Allows us to mark certain node errors as cascading, which will deprioritize them when printed. Also updates the error printing code to only print the error that happened first.

haixuanTao · 2024-06-13T12:51:43Z

How is this related to #542 ?

Should we merge 542 first?

And if so should i make change within 542?

phil-opp · 2024-06-13T14:27:32Z

This PR isn't finished yet, but it uses a different approach than #542. The idea is that we report all node errors to the CLI in their original form and include some additional metadata (e.g. timestamp). This way, we can do the printing in the CLI based on all the available information. So the CLI could print a short summary by default and provide options to print more details. Also, we can then format the errors in a nicer way because the CLI is aware of the terminal size and can use colors and text decorations (e.g. bold text).

The init function returns an error if another node exited before initialization. In this case, we consider the subsequent error as a cascading error and skip it when printing the error message to the user.

Before this commit, there were some cases where the returned `DataflowStatus` was ignored and the reported `cascading_errors` were never applied.

phil-opp · 2024-06-17T10:12:39Z

@haixuanTao I updated the PR description with some example output. Please let me know what you think!

As follow-up changes we can implement your suggestion from #548 (comment) and maybe some nicer terminal formatting (colors, text decoration, proper lines, etc).

haixuanTao

I'm slightly confused as this only print out one error, I would expect to print all of them if they happen at the same time.

Also, I don't like that the default dora start does not return anything in case of error. Can we make the attach the default behaviour because otherwise people starting with dora are going to be confused.

libraries/core/src/topics.rs

binaries/daemon/src/pending.rs

haixuanTao · 2024-06-17T14:55:10Z

It also seems to be no way to see other errors reason whether they are similar to the first one or not related at all.

019026ac-8bd8-7b67-bec4-62b0a9f3fca1
^Cdataflow failed: Node `reachy-vision-client` failed: exited because of signal SIGKILL

There are 3 more errors. Check the `out/019026ac-8bd8-7b67-bec4-62b0a9f3fca1` folder for full details.

For example in this example, I have probably an error about the grace duration on the first node, but there is no way for me to know what was the others errors.

In the past we used to have a grace duration information, and now we onty have SIGKILL.

phil-opp · 2024-06-17T15:12:54Z

I'm slightly confused as this only print out one error, I would expect to print all of them if they happen at the same time.

Sure, we can print them all. The error filtering and formatting now happens on the CLI side, so we can be as verbose as we like. We probably still want to skip cascading errors (i.e. "node failed because another node exited before init"), right?

Also, I don't like that the default dora start does not return anything in case of error. Can we make the attach the default behaviour because otherwise people starting with dora are going to be confused.

Fine with me! We could add a --detach argument to run it in the background and make dora start equivalent to dora start --attach. I'll prepare a separate PR for that.

In the past we used to have a grace duration information, and now we onty have SIGKILL.

The warning is still printed by the daemon, it's just not visible if the daemon is spawned in background (e.g. using dora up) or on a remote machine. The proper solution for this would be to add a new exit cause for this and communicate it back to the CLI. I'll try to implement something like this tomorrow.

phil-opp · 2024-06-17T15:41:14Z

I'm slightly confused as this only print out one error, I would expect to print all of them if they happen at the same time.

I pushed 94fecd6 to print all node errors that are not consequential errors (e.g. "node X exited too early").

haixuanTao · 2024-06-18T08:22:46Z

I think it is important to have errors when they happen in this PR as currently if a node fails but the others keep on running. The dataflow can run forever before an error message is emitted.

The only way to know there is an error is to use ctrl+c

haixuanTao · 2024-06-18T09:21:14Z

libraries/core/src/topics.rs

+            NodeErrorCause::Other { stderr } if stderr.is_empty() => {}
+            NodeErrorCause::Other { stderr } => {
+                let line: &str = "---------------------------------------------------------------------------------";
+                write!(f, "\n\nStderr output:\n{line}\n{stderr}\n[...]\n{line}\n")?


The [...] seems to point out that there is more to the error message that there is.

Could we remove it?

Also:

Can we make:

Node `idefics2` failed: exited with code: 1 Stderr output:

Be one line instead of 3 so that it looks like:

- Node 018ffe7b-39d7-725a-a812-ac9c220a11c8/plot exited with code 1 with error (stderr):

Just as in #542

The [...] seems to point out that there is more to the error message that there is.

Could we remove it?

Parts of the the error message might be cut off. I thought that it might be less confusing with a [...] marker in that case (but the marker should be at the beggining, not the end). For example, consider the following error message: "some\nmultiline\nerror\nmessage\nwith\nlots\nof\nlines\nand\nstuff\nand\nthings". It appears like this in the terminal:

Stderr output: --------------------------------------------------------------------------------- [...] message with lots of lines and stuff and things ---------------------------------------------------------------------------------

Without the leading [...] it's not obvious that relevant parts of the error message might be omitted.

[...] Node 018ffe7b-39d7-725a-a812-ac9c220a11c8/plot exited [...]

I don't think that we need to repeat the dataflow UUID again. It's always the same and it just makes the node ID more difficult to read.

Can we make:

Node `idefics2` failed: exited with code: 1 Stderr output:

Be one line instead of 3

You mean like this:

- Node `rust-node` failed: exited with code 1 with stderr output: --------------------------------------------------------------------------------- [...] some multiline error message ---------------------------------------------------------------------------------

I think it's easier to read without the leading - before Node, provided that we keep the lines before and after the stderr output.

I like the lines around stderr because it makes things easier to read when multiple nodes fail. For example, compare:

Dataflow 01903a8d-e26e-74f7-8cff-5edc90c1de0b failed: - Node `rust-node` failed: exited with code 1 with stderr output: [...] Traceback (most recent call last): File "main.py", line 9, in <module> do_stuff() File "main.py", line 5, in do_stuff raise Exception("test exception") Exception: test exception - Node `rust-sink` failed: exited with code 1 with stderr output: [...] Error: Failed to read instrs from ./path/to/instrs.json Caused by: No such file or directory (os error 2)

with:

Dataflow 01903a8f-e07f-7136-aad3-214da35d7158 failed: Node `rust-node` failed: exited with code 1 with stderr output: --------------------------------------------------------------------------------- [...] Traceback (most recent call last): File "main.py", line 9, in <module> do_stuff() File "main.py", line 5, in do_stuff raise Exception("test exception") Exception: test exception --------------------------------------------------------------------------------- Node `rust-sink` failed: exited with code 1 with stderr output: --------------------------------------------------------------------------------- [...] Error: Failed to read instrs from ./path/to/instrs.json Caused by: No such file or directory (os error 2) ---------------------------------------------------------------------------------

To me, the second example seems much more readable.

I like the variant with more spacing even more:

Dataflow 01903a92-8c73-770e-b2ca-840dd8c60d0b failed: Node `rust-node` failed: exited with code 1 Stderr output: --------------------------------------------------------------------------------- [...] Traceback (most recent call last): File "main.py", line 9, in <module> do_stuff() File "main.py", line 5, in do_stuff raise Exception("test exception") Exception: test exception --------------------------------------------------------------------------------- Node `rust-sink` failed: exited with code 1 Stderr output: --------------------------------------------------------------------------------- [...] Error: Failed to read instrs from ./path/to/instrs.json Caused by: No such file or directory (os error 2) ---------------------------------------------------------------------------------

I think that the current version:

019049ab-9473-7c5c-a4b4-d19513e219b5 Dataflow 019049ab-9473-7c5c-a4b4-d19513e219b5 failed: Node `plot` failed: exited with code 1 with stderr output: --------------------------------------------------------------------------------- [...] Traceback (most recent call last): File "/home/peter/Documents/work/dora/examples/python-dataflow/plot.py", line 85, in <module> assert False, "False" AssertionError: False --------------------------------------------------------------------------------- Node `object_detection` failed: exited with code 1 with stderr output: --------------------------------------------------------------------------------- [...] Traceback (most recent call last): File "/home/peter/Documents/work/dora/examples/python-dataflow/object_detection.py", line 16, in <module> assert False, "False" AssertionError: False ---------------------------------------------------------------------------------

Looks good.

I'm slightly confused as, it seems that we are showing all stderr. Do we have a slicing of stderr somewhere?

If not, i would not put the [...] if that is referencing something like logs.

Yes, there is some slicing at the daemon level. But we don't need to show the [...] if the stderr buffer is not full (because no slicing happened then). I pushed 5bda333 to implement this.

binaries/cli/src/attach.rs

binaries/cli/src/main.rs

haixuanTao · 2024-06-18T09:25:54Z

Could we take the times to write some test of the error message?

Co-authored-by: Haixuan Xavier Tao <[email protected]>

phil-opp · 2024-06-21T11:49:04Z

Could we take the times to write some test of the error message?

Which part do you want to test?

phil-opp · 2024-06-21T15:44:35Z

I think it is important to have errors when they happen in this PR as currently if a node fails but the others keep on running. The dataflow can run forever before an error message is emitted.

I prepared a PR with a basic log framework in #559. This PR enables daemons to send log messages to the CLI (via the coordinator). The CLI can then print daemon warnings/error as they happen.

phil-opp · 2024-06-24T09:11:15Z

In the past we used to have a grace duration information, and now we onty have SIGKILL.

The warning is still printed by the daemon, it's just not visible if the daemon is spawned in background (e.g. using dora up) or on a remote machine. The proper solution for this would be to add a new exit cause for this and communicate it back to the CLI. I'll try to implement something like this tomorrow.

I pushed ee39436 to handle grace duration kills properly. edit: I shortened the error message in ece3f72.

The output now is (the ^C indicates a ctrl-c event):

❯ dora start dataflow.yml --attach
01904986-3d33-702e-9ee5-bb54a3a5c3be
^CDataflow 01904986-3d33-702e-9ee5-bb54a3a5c3be failed:

Node `rust-node` failed: node was killed by dora because it didn't react to a stop message in time (SIGKILL)

Print only first node error and report more metadata in dataflow results

3019eba

Allows us to mark certain node errors as cascading, which will deprioritize them when printed. Also updates the error printing code to only print the error that happened first.

phil-opp added 3 commits June 13, 2024 16:45

Mark node failures as cascading on init errors caused by other nodes

71ea44a

The init function returns an error if another node exited before initialization. In this case, we consider the subsequent error as a cascading error and skip it when printing the error message to the user.

Merge branch 'main' into better-errors

9425e4a

Fix: Pass cascading_errors as arg to ensure that it is always applied

4e69495

Before this commit, there were some cases where the returned `DataflowStatus` was ignored and the reported `cascading_errors` were never applied.

haixuanTao mentioned this pull request Jun 14, 2024

Add --quiet flag to daemon and coordinator #548

Merged

phil-opp force-pushed the better-errors branch 2 times, most recently from bbe1cca to 39f115c Compare June 15, 2024 12:42

phil-opp added 2 commits June 15, 2024 15:30

Print errors as formatted string

ba86563

Print error causes and include node that caused error

72bc9cd

phil-opp force-pushed the better-errors branch from 39f115c to 72bc9cd Compare June 15, 2024 13:30

phil-opp added 3 commits June 15, 2024 15:59

Remove old formatting methods

39164b9

Report last 10 stderr lines on node failure

b231df1

Print lines before and after stderr for easier readability

43b128f

phil-opp force-pushed the better-errors branch from 3fcaaac to 43b128f Compare June 17, 2024 09:46

phil-opp marked this pull request as ready for review June 17, 2024 09:47

haixuanTao reviewed Jun 17, 2024

View reviewed changes

libraries/core/src/topics.rs Outdated Show resolved Hide resolved

binaries/daemon/src/pending.rs Outdated Show resolved Hide resolved

phil-opp added 3 commits June 17, 2024 17:25

Print all non-cascading errors (instead of only first one)

94fecd6

Fix number of consequential errors and improve message

450002e

Shorten node_exited_before_subscribe error message

912338d

Merge branch 'main' into better-errors

5a210b6

haixuanTao reviewed Jun 18, 2024

View reviewed changes

haixuanTao mentioned this pull request Jun 19, 2024

Auto fetch logs and clean up error trace for better readibility #542

Closed

phil-opp and others added 4 commits June 21, 2024 12:29

Include dataflow UUID in error message

a0e8fff

Co-authored-by: Haixuan Xavier Tao <[email protected]>

Refactor: move dataflow result handling to a separate function

0423155

Fix: Move ellipsis marker to the beginning of the stderr output

252746a

Use `handle_dataflow_result for attach too

4fb5a96

Slightly tweak error printing

361ea27

phil-opp mentioned this pull request Jun 21, 2024

Add basic log forwarding from daemon to CLI #559

Merged

Add space

24e6c9c

Add error cause for grace duration kills

ee39436

Shorten grace duration error message to one line

ece3f72

phil-opp mentioned this pull request Jun 24, 2024

Make dora start attach by default, add --detach to opt-out #561

Merged

Only add [...] omission marker if stderr buffer is full

5bda333

haixuanTao approved these changes Jun 24, 2024

View reviewed changes

phil-opp merged commit f283b4d into main Jun 24, 2024
18 checks passed

phil-opp deleted the better-errors branch June 24, 2024 16:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Print only first node error and report more metadata in dataflow results #552

Print only first node error and report more metadata in dataflow results #552

phil-opp commented Jun 12, 2024 •

edited

Loading

haixuanTao commented Jun 13, 2024

phil-opp commented Jun 13, 2024

phil-opp commented Jun 17, 2024

haixuanTao left a comment

haixuanTao commented Jun 17, 2024

phil-opp commented Jun 17, 2024

phil-opp commented Jun 17, 2024

haixuanTao commented Jun 18, 2024

haixuanTao Jun 18, 2024

phil-opp Jun 21, 2024

phil-opp Jun 21, 2024

haixuanTao Jun 24, 2024

phil-opp Jun 24, 2024

haixuanTao commented Jun 18, 2024

phil-opp commented Jun 21, 2024

phil-opp commented Jun 21, 2024

phil-opp commented Jun 24, 2024 •

edited

Loading

Print only first node error and report more metadata in dataflow results #552

Print only first node error and report more metadata in dataflow results #552

Conversation

phil-opp commented Jun 12, 2024 • edited Loading

Example Output:

TODO:

haixuanTao commented Jun 13, 2024

phil-opp commented Jun 13, 2024

phil-opp commented Jun 17, 2024

haixuanTao left a comment

Choose a reason for hiding this comment

haixuanTao commented Jun 17, 2024

phil-opp commented Jun 17, 2024

phil-opp commented Jun 17, 2024

haixuanTao commented Jun 18, 2024

haixuanTao Jun 18, 2024

Choose a reason for hiding this comment

phil-opp Jun 21, 2024

Choose a reason for hiding this comment

phil-opp Jun 21, 2024

Choose a reason for hiding this comment

haixuanTao Jun 24, 2024

Choose a reason for hiding this comment

phil-opp Jun 24, 2024

Choose a reason for hiding this comment

haixuanTao commented Jun 18, 2024

phil-opp commented Jun 21, 2024

phil-opp commented Jun 21, 2024

phil-opp commented Jun 24, 2024 • edited Loading

phil-opp commented Jun 12, 2024 •

edited

Loading

phil-opp commented Jun 24, 2024 •

edited

Loading