-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Print only first node error and report more metadata in dataflow results #552
Conversation
Allows us to mark certain node errors as cascading, which will deprioritize them when printed. Also updates the error printing code to only print the error that happened first.
How is this related to #542 ? Should we merge 542 first? And if so should i make change within 542? |
This PR isn't finished yet, but it uses a different approach than #542. The idea is that we report all node errors to the CLI in their original form and include some additional metadata (e.g. timestamp). This way, we can do the printing in the CLI based on all the available information. So the CLI could print a short summary by default and provide options to print more details. Also, we can then format the errors in a nicer way because the CLI is aware of the terminal size and can use colors and text decorations (e.g. bold text). |
The init function returns an error if another node exited before initialization. In this case, we consider the subsequent error as a cascading error and skip it when printing the error message to the user.
Before this commit, there were some cases where the returned `DataflowStatus` was ignored and the reported `cascading_errors` were never applied.
bbe1cca
to
39f115c
Compare
@haixuanTao I updated the PR description with some example output. Please let me know what you think! As follow-up changes we can implement your suggestion from #548 (comment) and maybe some nicer terminal formatting (colors, text decoration, proper lines, etc). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm slightly confused as this only print out one error, I would expect to print all of them if they happen at the same time.
Also, I don't like that the default dora start
does not return anything in case of error. Can we make the attach the default behaviour because otherwise people starting with dora are going to be confused.
It also seems to be no way to see other errors reason whether they are similar to the first one or not related at all. 019026ac-8bd8-7b67-bec4-62b0a9f3fca1
^Cdataflow failed: Node `reachy-vision-client` failed: exited because of signal SIGKILL
There are 3 more errors. Check the `out/019026ac-8bd8-7b67-bec4-62b0a9f3fca1` folder for full details. For example in this example, I have probably an error about the grace duration on the first node, but there is no way for me to know what was the others errors. In the past we used to have a grace duration information, and now we onty have |
Sure, we can print them all. The error filtering and formatting now happens on the CLI side, so we can be as verbose as we like. We probably still want to skip cascading errors (i.e. "node failed because another node exited before init"), right?
Fine with me! We could add a
The warning is still printed by the daemon, it's just not visible if the daemon is spawned in background (e.g. using |
I pushed 94fecd6 to print all node errors that are not consequential errors (e.g. "node X exited too early"). |
I think it is important to have errors when they happen in this PR as currently if a node fails but the others keep on running. The dataflow can run forever before an error message is emitted. The only way to know there is an error is to use ctrl+c |
libraries/core/src/topics.rs
Outdated
NodeErrorCause::Other { stderr } if stderr.is_empty() => {} | ||
NodeErrorCause::Other { stderr } => { | ||
let line: &str = "---------------------------------------------------------------------------------"; | ||
write!(f, "\n\nStderr output:\n{line}\n{stderr}\n[...]\n{line}\n")? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The [...] seems to point out that there is more to the error message that there is.
Could we remove it?
Also:
Can we make:
Node `idefics2` failed: exited with code: 1
Stderr output:
Be one line instead of 3 so that it looks like:
- Node 018ffe7b-39d7-725a-a812-ac9c220a11c8/plot exited with code 1 with error (stderr):
Just as in #542
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The [...] seems to point out that there is more to the error message that there is.
Could we remove it?
Parts of the the error message might be cut off. I thought that it might be less confusing with a [...]
marker in that case (but the marker should be at the beggining, not the end). For example, consider the following error message: "some\nmultiline\nerror\nmessage\nwith\nlots\nof\nlines\nand\nstuff\nand\nthings"
. It appears like this in the terminal:
Stderr output:
---------------------------------------------------------------------------------
[...]
message
with
lots
of
lines
and
stuff
and
things
---------------------------------------------------------------------------------
Without the leading [...]
it's not obvious that relevant parts of the error message might be omitted.
[...] Node 018ffe7b-39d7-725a-a812-ac9c220a11c8/plot exited [...]
I don't think that we need to repeat the dataflow UUID again. It's always the same and it just makes the node ID more difficult to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make:
Node `idefics2` failed: exited with code: 1 Stderr output:
Be one line instead of 3
You mean like this:
- Node `rust-node` failed: exited with code 1 with stderr output:
---------------------------------------------------------------------------------
[...]
some
multiline
error
message
---------------------------------------------------------------------------------
I think it's easier to read without the leading -
before Node
, provided that we keep the lines before and after the stderr output.
I like the lines around stderr because it makes things easier to read when multiple nodes fail. For example, compare:
Dataflow 01903a8d-e26e-74f7-8cff-5edc90c1de0b failed:
- Node `rust-node` failed: exited with code 1 with stderr output:
[...]
Traceback (most recent call last):
File "main.py", line 9, in <module>
do_stuff()
File "main.py", line 5, in do_stuff
raise Exception("test exception")
Exception: test exception
- Node `rust-sink` failed: exited with code 1 with stderr output:
[...]
Error: Failed to read instrs from ./path/to/instrs.json
Caused by:
No such file or directory (os error 2)
with:
Dataflow 01903a8f-e07f-7136-aad3-214da35d7158 failed:
Node `rust-node` failed: exited with code 1 with stderr output:
---------------------------------------------------------------------------------
[...]
Traceback (most recent call last):
File "main.py", line 9, in <module>
do_stuff()
File "main.py", line 5, in do_stuff
raise Exception("test exception")
Exception: test exception
---------------------------------------------------------------------------------
Node `rust-sink` failed: exited with code 1 with stderr output:
---------------------------------------------------------------------------------
[...]
Error: Failed to read instrs from ./path/to/instrs.json
Caused by:
No such file or directory (os error 2)
---------------------------------------------------------------------------------
To me, the second example seems much more readable.
I like the variant with more spacing even more:
Dataflow 01903a92-8c73-770e-b2ca-840dd8c60d0b failed:
Node `rust-node` failed: exited with code 1
Stderr output:
---------------------------------------------------------------------------------
[...]
Traceback (most recent call last):
File "main.py", line 9, in <module>
do_stuff()
File "main.py", line 5, in do_stuff
raise Exception("test exception")
Exception: test exception
---------------------------------------------------------------------------------
Node `rust-sink` failed: exited with code 1
Stderr output:
---------------------------------------------------------------------------------
[...]
Error: Failed to read instrs from ./path/to/instrs.json
Caused by:
No such file or directory (os error 2)
---------------------------------------------------------------------------------
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the current version:
019049ab-9473-7c5c-a4b4-d19513e219b5
Dataflow 019049ab-9473-7c5c-a4b4-d19513e219b5 failed:
Node `plot` failed: exited with code 1 with stderr output:
---------------------------------------------------------------------------------
[...]
Traceback (most recent call last):
File "/home/peter/Documents/work/dora/examples/python-dataflow/plot.py", line 85, in <module>
assert False, "False"
AssertionError: False
---------------------------------------------------------------------------------
Node `object_detection` failed: exited with code 1 with stderr output:
---------------------------------------------------------------------------------
[...]
Traceback (most recent call last):
File "/home/peter/Documents/work/dora/examples/python-dataflow/object_detection.py", line 16, in <module>
assert False, "False"
AssertionError: False
---------------------------------------------------------------------------------
Looks good.
I'm slightly confused as, it seems that we are showing all stderr. Do we have a slicing of stderr
somewhere?
If not, i would not put the [...]
if that is referencing something like logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is some slicing at the daemon level. But we don't need to show the [...]
if the stderr buffer is not full (because no slicing happened then). I pushed 5bda333 to implement this.
Could we take the times to write some test of the error message? |
Which part do you want to test? |
I prepared a PR with a basic log framework in #559. This PR enables daemons to send log messages to the CLI (via the coordinator). The CLI can then print daemon warnings/error as they happen. |
I pushed ee39436 to handle grace duration kills properly. edit: I shortened the error message in ece3f72. The output now is (the
|
Allows us to mark certain node errors as cascading, which will deprioritize them when printed.
Also updates the error printing code to only print the error that happened first.
Example Output:
Node
rust-sink
exits with a success exit status before initializing dora connection:Node
rust-sink
errored before initialization:TODO:
--detach
arg todora start
and make--attach
the defaultdora start
attach by default, add--detach
to opt-out #561