Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Print only first node error and report more metadata in dataflow results #552

Merged
merged 22 commits into from
Jun 24, 2024

Conversation

phil-opp
Copy link
Collaborator

@phil-opp phil-opp commented Jun 12, 2024

Allows us to mark certain node errors as cascading, which will deprioritize them when printed.

Also updates the error printing code to only print the error that happened first.

Example Output:

Node rust-sink exits with a success exit status before initializing dora connection:

❯ dora start dataflow.yml --attach
01901beb-7fd7-7183-8584-8e6284c84031
dataflow failed: Node `dora-record` failed: exited with code: 1

This error occurred because node `rust-sink` exited before connecting to dora.

There are 2 more errors. Check the `out/01901beb-7fd7-7183-8584-8e6284c84031` folder for full details.

Node rust-sink errored before initialization:

❯ dora start dataflow.yml --attach
01902594-d2e4-7b4f-a2d5-0d3e059a52fa
dataflow failed: Node `rust-sink` failed: exited with code: 1

Stderr output:
---------------------------------------------------------------------------------
Error: foo

Location:
    examples/rust-dataflow/sink/src/main.rs:5:5

[...]
---------------------------------------------------------------------------------


There are 3 more errors. Check the `out/01902594-d2e4-7b4f-a2d5-0d3e059a52fa` folder for full details.

TODO:

  • mark node failures as cascading when initialization fails because other node exited too early
  • testing
  • load stderr
  • Only skip cascading errors
  • Add error cause for grace period kills
  • Separate PR: Add --detach arg to dora start and make --attach the default

Allows us to mark certain node errors as cascading, which will deprioritize them when printed.

Also updates the error printing code to only print the error that happened first.
@haixuanTao
Copy link
Collaborator

How is this related to #542 ?

Should we merge 542 first?

And if so should i make change within 542?

@phil-opp
Copy link
Collaborator Author

This PR isn't finished yet, but it uses a different approach than #542. The idea is that we report all node errors to the CLI in their original form and include some additional metadata (e.g. timestamp). This way, we can do the printing in the CLI based on all the available information. So the CLI could print a short summary by default and provide options to print more details. Also, we can then format the errors in a nicer way because the CLI is aware of the terminal size and can use colors and text decorations (e.g. bold text).

phil-opp added 3 commits June 13, 2024 16:45
The init function returns an error if another node exited before initialization. In this case, we consider the subsequent error as a cascading error and skip it when printing the error message to the user.
Before this commit, there were some cases where the returned `DataflowStatus` was ignored and the reported `cascading_errors` were never applied.
@phil-opp phil-opp force-pushed the better-errors branch 2 times, most recently from bbe1cca to 39f115c Compare June 15, 2024 12:42
@phil-opp phil-opp marked this pull request as ready for review June 17, 2024 09:47
@phil-opp
Copy link
Collaborator Author

@haixuanTao I updated the PR description with some example output. Please let me know what you think!

As follow-up changes we can implement your suggestion from #548 (comment) and maybe some nicer terminal formatting (colors, text decoration, proper lines, etc).

Copy link
Collaborator

@haixuanTao haixuanTao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly confused as this only print out one error, I would expect to print all of them if they happen at the same time.

Also, I don't like that the default dora start does not return anything in case of error. Can we make the attach the default behaviour because otherwise people starting with dora are going to be confused.

libraries/core/src/topics.rs Outdated Show resolved Hide resolved
binaries/daemon/src/pending.rs Outdated Show resolved Hide resolved
@haixuanTao
Copy link
Collaborator

It also seems to be no way to see other errors reason whether they are similar to the first one or not related at all.

019026ac-8bd8-7b67-bec4-62b0a9f3fca1
^Cdataflow failed: Node `reachy-vision-client` failed: exited because of signal SIGKILL

There are 3 more errors. Check the `out/019026ac-8bd8-7b67-bec4-62b0a9f3fca1` folder for full details.

For example in this example, I have probably an error about the grace duration on the first node, but there is no way for me to know what was the others errors.

In the past we used to have a grace duration information, and now we onty have SIGKILL.

@phil-opp
Copy link
Collaborator Author

I'm slightly confused as this only print out one error, I would expect to print all of them if they happen at the same time.

Sure, we can print them all. The error filtering and formatting now happens on the CLI side, so we can be as verbose as we like. We probably still want to skip cascading errors (i.e. "node failed because another node exited before init"), right?

Also, I don't like that the default dora start does not return anything in case of error. Can we make the attach the default behaviour because otherwise people starting with dora are going to be confused.

Fine with me! We could add a --detach argument to run it in the background and make dora start equivalent to dora start --attach. I'll prepare a separate PR for that.

In the past we used to have a grace duration information, and now we onty have SIGKILL.

The warning is still printed by the daemon, it's just not visible if the daemon is spawned in background (e.g. using dora up) or on a remote machine. The proper solution for this would be to add a new exit cause for this and communicate it back to the CLI. I'll try to implement something like this tomorrow.

@phil-opp
Copy link
Collaborator Author

I'm slightly confused as this only print out one error, I would expect to print all of them if they happen at the same time.

I pushed 94fecd6 to print all node errors that are not consequential errors (e.g. "node X exited too early").

@haixuanTao
Copy link
Collaborator

I think it is important to have errors when they happen in this PR as currently if a node fails but the others keep on running. The dataflow can run forever before an error message is emitted.

The only way to know there is an error is to use ctrl+c

NodeErrorCause::Other { stderr } if stderr.is_empty() => {}
NodeErrorCause::Other { stderr } => {
let line: &str = "---------------------------------------------------------------------------------";
write!(f, "\n\nStderr output:\n{line}\n{stderr}\n[...]\n{line}\n")?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The [...] seems to point out that there is more to the error message that there is.

Could we remove it?

Also:

Can we make:

Node `idefics2` failed: exited with code: 1

Stderr output:

Be one line instead of 3 so that it looks like:

- Node 018ffe7b-39d7-725a-a812-ac9c220a11c8/plot exited with code 1 with error (stderr):

Just as in #542

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The [...] seems to point out that there is more to the error message that there is.

Could we remove it?

Parts of the the error message might be cut off. I thought that it might be less confusing with a [...] marker in that case (but the marker should be at the beggining, not the end). For example, consider the following error message: "some\nmultiline\nerror\nmessage\nwith\nlots\nof\nlines\nand\nstuff\nand\nthings". It appears like this in the terminal:

Stderr output:
---------------------------------------------------------------------------------
[...]
message
with
lots
of
lines
and
stuff
and
things

---------------------------------------------------------------------------------

Without the leading [...] it's not obvious that relevant parts of the error message might be omitted.

[...] Node 018ffe7b-39d7-725a-a812-ac9c220a11c8/plot exited [...]

I don't think that we need to repeat the dataflow UUID again. It's always the same and it just makes the node ID more difficult to read.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make:

Node `idefics2` failed: exited with code: 1

Stderr output:

Be one line instead of 3

You mean like this:

- Node `rust-node` failed: exited with code 1 with stderr output:
---------------------------------------------------------------------------------
[...]
some
multiline
error
message
---------------------------------------------------------------------------------

I think it's easier to read without the leading - before Node, provided that we keep the lines before and after the stderr output.

I like the lines around stderr because it makes things easier to read when multiple nodes fail. For example, compare:

Dataflow 01903a8d-e26e-74f7-8cff-5edc90c1de0b failed:
- Node `rust-node` failed: exited with code 1 with stderr output:
[...]
Traceback (most recent call last):
  File "main.py", line 9, in <module>
    do_stuff()
  File "main.py", line 5, in do_stuff
    raise Exception("test exception")
Exception: test exception

- Node `rust-sink` failed: exited with code 1 with stderr output:
[...]
Error: Failed to read instrs from ./path/to/instrs.json

Caused by:
No such file or directory (os error 2)

with:

Dataflow 01903a8f-e07f-7136-aad3-214da35d7158 failed:
Node `rust-node` failed: exited with code 1 with stderr output:
---------------------------------------------------------------------------------
[...]
Traceback (most recent call last):
  File "main.py", line 9, in <module>
    do_stuff()
  File "main.py", line 5, in do_stuff
    raise Exception("test exception")
Exception: test exception
---------------------------------------------------------------------------------

Node `rust-sink` failed: exited with code 1 with stderr output:
---------------------------------------------------------------------------------
[...]
Error: Failed to read instrs from ./path/to/instrs.json

Caused by:
No such file or directory (os error 2)
---------------------------------------------------------------------------------

To me, the second example seems much more readable.

I like the variant with more spacing even more:

Dataflow 01903a92-8c73-770e-b2ca-840dd8c60d0b failed:

Node `rust-node` failed: exited with code 1

Stderr output:
---------------------------------------------------------------------------------
[...]
Traceback (most recent call last):
  File "main.py", line 9, in <module>
    do_stuff()
  File "main.py", line 5, in do_stuff
    raise Exception("test exception")
Exception: test exception
---------------------------------------------------------------------------------

Node `rust-sink` failed: exited with code 1

Stderr output:
---------------------------------------------------------------------------------
[...]
Error: Failed to read instrs from ./path/to/instrs.json

Caused by:
No such file or directory (os error 2)
---------------------------------------------------------------------------------

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the current version:

019049ab-9473-7c5c-a4b4-d19513e219b5
Dataflow 019049ab-9473-7c5c-a4b4-d19513e219b5 failed:

Node `plot` failed: exited with code 1 with stderr output:
---------------------------------------------------------------------------------
[...]
Traceback (most recent call last):
  File "/home/peter/Documents/work/dora/examples/python-dataflow/plot.py", line 85, in <module>
    assert False, "False"
AssertionError: False
---------------------------------------------------------------------------------

Node `object_detection` failed: exited with code 1 with stderr output:
---------------------------------------------------------------------------------
[...]
Traceback (most recent call last):
  File "/home/peter/Documents/work/dora/examples/python-dataflow/object_detection.py", line 16, in <module>
    assert False, "False"
AssertionError: False
---------------------------------------------------------------------------------

Looks good.

I'm slightly confused as, it seems that we are showing all stderr. Do we have a slicing of stderr somewhere?

If not, i would not put the [...] if that is referencing something like logs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is some slicing at the daemon level. But we don't need to show the [...] if the stderr buffer is not full (because no slicing happened then). I pushed 5bda333 to implement this.

binaries/cli/src/attach.rs Outdated Show resolved Hide resolved
binaries/cli/src/main.rs Outdated Show resolved Hide resolved
@haixuanTao
Copy link
Collaborator

Could we take the times to write some test of the error message?

@phil-opp
Copy link
Collaborator Author

Could we take the times to write some test of the error message?

Which part do you want to test?

@phil-opp
Copy link
Collaborator Author

I think it is important to have errors when they happen in this PR as currently if a node fails but the others keep on running. The dataflow can run forever before an error message is emitted.

I prepared a PR with a basic log framework in #559. This PR enables daemons to send log messages to the CLI (via the coordinator). The CLI can then print daemon warnings/error as they happen.

@phil-opp
Copy link
Collaborator Author

phil-opp commented Jun 24, 2024

In the past we used to have a grace duration information, and now we onty have SIGKILL.

The warning is still printed by the daemon, it's just not visible if the daemon is spawned in background (e.g. using dora up) or on a remote machine. The proper solution for this would be to add a new exit cause for this and communicate it back to the CLI. I'll try to implement something like this tomorrow.

I pushed ee39436 to handle grace duration kills properly. edit: I shortened the error message in ece3f72.

The output now is (the ^C indicates a ctrl-c event):

❯ dora start dataflow.yml --attach
01904986-3d33-702e-9ee5-bb54a3a5c3be
^CDataflow 01904986-3d33-702e-9ee5-bb54a3a5c3be failed:

Node `rust-node` failed: node was killed by dora because it didn't react to a stop message in time (SIGKILL)

@phil-opp phil-opp merged commit f283b4d into main Jun 24, 2024
18 checks passed
@phil-opp phil-opp deleted the better-errors branch June 24, 2024 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants