Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Better error handling/reporting #1670

Merged
merged 16 commits into from
Sep 9, 2022

Conversation

har7an
Copy link
Contributor

@har7an har7an commented Aug 17, 2022

As discussed in the dev meeting yesterday, here's a proposal on how to improve the error handling in zellij.
Here's what it currently does:

  • Perform small refactoring of zellij_utils::errors so we can share the contents (at least some) with code targeted at wasm
  • Add a generic error message for unhandled errors to be used in expect() statements
  • Add a bit of docs to zellij_utils::errors that invites users to join in on the fun and improve the error handling with us
  • Demonstrate what the "improved" handling may look like in the zellij_tile subcrate

Let me know what you think!

Maybe we can also link to this very commit from the docs in the errors submodule so users have an example what the "improvement" may look like?

@imsnif
Copy link
Member

imsnif commented Aug 17, 2022

Hey, cool to see this coming along.

Add a generic error message for unhandled errors to be used in expect() statements

I'm not sure I understand the purpose of this? I remember we talked about switching methods with unwraps to return a Result and then swapping the unwraps with question marks, no? Why this intermediate solution?

@har7an
Copy link
Contributor Author

har7an commented Aug 17, 2022

Why this intermediate solution?

Good question. That's because I expect that in some cases, like for modules in zellij-utils, certain functions are called from a whole lot of places. If we force the removal of all .unwrap() directly, we'll likely end up revising the whole code base - in a single PR. That's because we traverse the whole function call tree from the bottom upwards, until we reach some place in the code where we already return Result<T>.

The idea is that moving from unwrap() to expect(...) allows us to perform this change gradually in smaller chunks. So when we end up having a function that consists only of expect(...) we can then make that return a Result<T> and move the expect(...) to the parent, or drop it entirely.

This way we can contain the changes in error handling to a single module, without having to traverse up to all other modules on the way.

So you're not a fan of that idea?

@imsnif
Copy link
Member

imsnif commented Aug 17, 2022

Not particularly, I admit. I'm very on board with doing this incrementally, but I feel this will both create more work and give us a false sense of progress.

I do get what you're saying about not wanting to traverse the whole codebase - but I don't share your pessimism. I think if we find one particular function that is called in many different places, we can do a PR specifically for that function. Remember: we don't need to bubble a Result all the way up in order to make progress. We can change all the unwraps in one function, and then unwrap that particular function where it is called.

@har7an
Copy link
Contributor Author

har7an commented Aug 17, 2022

We can change all the unwraps in one function, and then unwrap that particular function where it is called.

But that's exactly what I said, no?
In the end whoever steps up to do this will decide for themselves where they draw the magic line to stop "bubbling up" the Result. You propose we pick out particular functions and work our way up, I propose we work module-wise.
The expect(errors::TODO_HANDLING) is mostly there to show that someone has at least recognized that some better error handling needs to be done here. As you said yesterday, if I do

if foo.is_some() {
    let bar = foo.unwrap();
}

the unwrap won't panic (although this particular example is of course very weird in the first place). I just wanted to show, as part of this PR, how one could initiate this process. I have now removed all panics from zellij_tile::shim, does this feel better to you?

Edit: Note that it doesn't make the panics go away, they just happen further up the codepath (somewhere in the plugins) now, with an expect(errors::TODO_HANDLING).

@imsnif
Copy link
Member

imsnif commented Aug 17, 2022

But that's exactly what I said, no? In the end whoever steps up to do this will decide for themselves where they draw the magic line to stop "bubbling up" the Result. You propose we pick out particular functions and work our way up, I propose we work module-wise. The expect(errors::TODO_HANDLING) is mostly there to show that someone has at least recognized that some better error handling needs to be done here.

My issue is mostly with the expect. I've been personally burned in this code-base by maintainers leaving something "for someone else to do later" and that someone never coming along (or being me when I absolutely would prefer to do something else :) ). So - do this as you wish, but as a personal request: please only leave unfinished stuff in the code-base if you yourself intend on seeing them through in the near future.

@har7an
Copy link
Contributor Author

har7an commented Aug 18, 2022

While testing with the debug plugins I had the impression that the tab bar has become very slow now, but that may just be the lack of wasm-opt...

To the best of my knowledge all plugins are now free from panics caused by unwrap or expect (This does not include tests). And they all have error handling now.
I added error context as best as I can, but I didn't quite understand the purpose of some of the functions involved, so I made a guess at places.

While provoking an error in the status bar I noticed that the plugins now sort of die gracefully - They just don't render any text at all. It's better than crashing (which is what happened previously), but I'm not sure whether that's ideal. Also the plugins recover from errors now. I tested this by causing an error when displaying the "LOCKED" indication in the second line of the status bar: When I enter "LOCKED" mode the status bar stays blank, when I leave "LOCKED" mode it recovers.

I'm now thinking of a way to display the error to the user. Given that Plugin -> Server communication is somewhat limited, I guess the best option is to print the error to the bit of "canvas" we have and maybe log it, too. Or do you have a better suggestion?

@har7an
Copy link
Contributor Author

har7an commented Aug 18, 2022

Oh and on a related note: Who calls these functions defined by register_plugin! at all? So where do these errors go? Maybe we can handle them there? That would be my favorite solution.

@har7an
Copy link
Contributor Author

har7an commented Aug 18, 2022

I've found a solution now, here's what it looks like in action:

Screenshot from 2022-08-18 13-32-00

Of course it can only print error information on lines that do exist, so if the tab bar dies we will only see the last error in the chain. However, it will also log the error:

DEBUG  |/var/home/ahartmann/repos| 2022-08-18 13:34:59.456 [id: 1     ] ERROR: status bar: Failed to generate second line of output 
DEBUG  |/var/home/ahartmann/repos| 2022-08-18 13:34:59.456 [id: 1     ] because: status bar: Failed to assemble second line 
DEBUG  |/var/home/ahartmann/repos| 2022-08-18 13:34:59.456 [id: 1     ] because: This is a provoked error

And in the case that the plugin does panic (while loading, or by failing to borrow_mut from the RefCell it is stored in, or by failing to serialize the object from stdin) it will print this pretty information to the log:

DEBUG  |/var/home/ahartmann/repos| 2022-08-18 13:22:30.934 [id: 1     ] ERROR: status bar: Failed to generate second line of output 
DEBUG  |/var/home/ahartmann/repos| 2022-08-18 13:22:30.934 [id: 1     ] because: status bar: Failed to assemble second line 
DEBUG  |/var/home/ahartmann/repos| 2022-08-18 13:22:30.934 [id: 1     ] because: This is a provoked error 
DEBUG  |/var/home/ahartmann/repos| 2022-08-18 13:22:30.934 [id: 1     ] thread 'main' panicked at 'Error { 
DEBUG  |/var/home/ahartmann/repos| 2022-08-18 13:22:30.934 [id: 1     ]     context: "status bar: Failed to generate second line of output", 
DEBUG  |/var/home/ahartmann/repos| 2022-08-18 13:22:30.934 [id: 1     ]     source: Error { 
DEBUG  |/var/home/ahartmann/repos| 2022-08-18 13:22:30.934 [id: 1     ]         context: "status bar: Failed to assemble second line", 
DEBUG  |/var/home/ahartmann/repos| 2022-08-18 13:22:30.934 [id: 1     ]         source: "This is a provoked error", 
DEBUG  |/var/home/ahartmann/repos| 2022-08-18 13:22:30.934 [id: 1     ]     }, 
DEBUG  |/var/home/ahartmann/repos| 2022-08-18 13:22:30.934 [id: 1     ] }', /var/home/ahartmann/repos/other/zellij/zellij-tile/src/lib.rs:43:13 
DEBUG  |/var/home/ahartmann/repos| 2022-08-18 13:22:30.934 [id: 1     ] note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace 
ERROR  |zellij_utils::errors::not| 2022-08-18 13:22:30.936 [wasm      ] [zellij-utils/src/errors.rs:330]: Panic occured:
             thread: wasm
             location: At zellij-server/src/wasm_vm.rs:159:26
             message: called `Result::unwrap()` on an `Err` value: RuntimeError { source: Trap(UnreachableCodeReached), wasm_trace: [FrameInfo { module_name: "<module>", func_index: 8819, function_name: Some("__rust_start_panic"), func_start: SourceLoc(2475159), instr: SourceLoc(2475160) }, FrameInfo { module_name: "<module>", func_index: 8811, function_name: Some("rust_panic"), func_start: SourceLoc(2474419), instr: SourceLoc(2474458) }, FrameInfo { module_name: "<module>", func_index: 8810, function_name: Some("_ZN3std9panicking20rust_panic_with_hook17h7811fbe84dbe0b81E"), func_start: SourceLoc(2473197), instr: SourceLoc(2474327) }, FrameInfo { module_name: "<module>", func_index: 8797, function_name: Some("_ZN3std9panicking19begin_panic_handler28_$u7b$$u7b$closure$u7d$$u7d$17hb3057833ba946b83E"), func_start: SourceLoc(2470625), instr: SourceLoc(2470808) }, FrameInfo { module_name: "<module>", func_index: 8796, function_name: Some("_ZN3std10sys_common9backtrace26__rust_end_short_backtrace17ha1f5dea80924017cE"), func_start: SourceLoc(2470599), instr: SourceLoc(2470615) }, FrameInfo { module_name: "<module>", func_index: 8805, function_name: Some("rust_begin_unwind"), func_start: SourceLoc(2472497), instr: SourceLoc(2472578) }, FrameInfo { module_name: "<module>", func_index: 8929, function_name: Some("_ZN4core9panicking9panic_fmt17hb03a34d2c425127eE"), func_start: SourceLoc(2522485), instr: SourceLoc(2522553) }, FrameInfo { module_name: "<module>", func_index: 997, function_name: Some("_ZN11zellij_tile11catch_fatal17hb94271f5a1a0f080E"), func_start: SourceLoc(301632), instr: SourceLoc(302336) }, FrameInfo { module_name: "<module>", func_index: 1986, function_name: Some("_ZN10status_bar6render28_$u7b$$u7b$closure$u7d$$u7d$17he3b6e586989a6be2E"), func_start: SourceLoc(815936), instr: SourceLoc(816185) }, FrameInfo { module_name: "<module>", func_index: 2604, function_name: Some("_ZN3std6thread5local17LocalKey$LT$T$GT$8try_with17hf8213c93496ebb15E"), func_start: SourceLoc(1026012), instr: SourceLoc(1026197) }, FrameInfo { module_name: "<module>", func_index: 2603, function_name: Some("_ZN3std6thread5local17LocalKey$LT$T$GT$4with17ha35436331acaa492E"), func_start: SourceLoc(1025870), instr: SourceLoc(1025927) }, FrameInfo { module_name: "<module>", func_index: 673, function_name: Some("render"), func_start: SourceLoc(195097), instr: SourceLoc(195206) }, FrameInfo { module_name: "<module>", func_index: 9044, function_name: Some("render.command_export"), func_start: SourceLoc(2550478), instr: SourceLoc(2550485) }], native_trace:    0: <unknown>
   1: <unknown>
   2: <unknown>
   3: <unknown>
   4: <unknown>
   5: <unknown>
   6: <unknown>
   7: <unknown>
   8: <unknown>
   9: <unknown>
  10: <unknown>
  11: <unknown>
  12: <unknown>
  13: <unknown>
  14: <unknown>
  15: <unknown>
 } 
ERROR  |zellij_utils::errors::not| 2022-08-18 13:22:30.936 [screen    ] [zellij-utils/src/errors.rs:330]: Panic occured:
             thread: screen
             location: At zellij-server/src/panes/plugin_pane.rs:168:42
             message: called `Result::unwrap()` on an `Err` value: RecvError 
ERROR  |zellij_utils::errors::not| 2022-08-18 13:22:30.937 [async-std/runti] [zellij-utils/src/errors.rs:330]: Panic occured:
             thread: async-std/runtime
             location: At zellij-server/src/terminal_bytes.rs:122:14
             message: called `Result::unwrap()` on an `Err` value: "SendError(..)"

What do you think?

@har7an har7an force-pushed the feature/better-error-reporting branch from 7bd5073 to d095e22 Compare August 18, 2022 13:19
Comment on lines 20 to 51
/// Catch non-fatal errors and print their content to the plugins stdout and stderr. This way, the
/// user sees the error and can even report it, but it will not crash the application. Hence, this
/// works only on the empty type `()`.
pub fn catch(ret: anyhow::Result<()>) {
match ret {
Ok(_) => (),
Err(err) => {
println!("ERROR: {}", err);
eprintln!("ERROR: {}", &err);
err.chain().skip(1).for_each(|cause| {
let msg = format!("because: {cause}");
println!("{}", msg);
eprintln!("{}", &msg);
});
},
}
}

/// Return the inner value if `Ok(_)` and panic the application otherwise. Before panicking, the
/// error is written to the logfile.
pub fn catch_fatal<T>(ret: anyhow::Result<T>) -> T {
match ret {
Ok(val) => val,
Err(err) => {
eprintln!("ERROR: {}", &err);
err.chain().skip(1).for_each(|cause| {
let msg = format!("because: {cause}");
eprintln!("{}", &msg);
});
panic!("{:#?}", err);
},
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tlinford Maybe you have an idea how this bit of code can be written prettier? I wanted to put this into a trait and implement that on std::result::Result but I failed miserably. Makes the code below pretty unreadable, too...

@imsnif
Copy link
Member

imsnif commented Aug 18, 2022

@har7an - what do you think about (also) sending the error to the log? I think with plugins if you write to STDERR (eg. with eprintln) it'll do the trick.

@har7an har7an force-pushed the feature/better-error-reporting branch 3 times, most recently from 81a2ee9 to f9d58dc Compare August 19, 2022 09:01
@har7an
Copy link
Contributor Author

har7an commented Aug 19, 2022

@imsnif You mean the std::error::Error? I could, of course, but in the output above you see that the error doesn't contain any information beyond what is already printed. Only I think that the printed version looks much nicer, so I'd prefer to stick with that if I'm honest.

Shifted the code around a little and managed to make the whole catch/catch_fatal business much nicer to use and look at. I think the code is very usable the way it is now. Actually I like my LoggableError so much I think I'll publish it in a separate crate soon. :)

I think this is a very solid foundation to build the error handling upon now. In theory we could slap a to_log() at every Result type we have in the code now, provided someone comes and implements the trait for the particular error types.

I'll leave the PR hanging around for another day or two to see if anything else that could be improved comes to my mind. I think it's a big step in the right direction, even if only because an error while rendering plugins won't crash the whole application any more and all plugin errors are now logged, even if zellij crashes.

@imsnif
Copy link
Member

imsnif commented Aug 21, 2022

@har7an - briefly going over this, I notice that you also change the plugin API. Note that this is a breaking change for all plugins (the ones we know about and the ones we don't :) ), will require a documentation change on the docs website, and will likely require at least some amount of continuing support to people who use custom plugins and want to upgrade to the latest version.

@imsnif
Copy link
Member

imsnif commented Aug 21, 2022

Also, does this affect non-rust plugins? We've got a budding Zig ecosystem too.

@har7an har7an force-pushed the feature/better-error-reporting branch from e3bb4d5 to 8cae215 Compare September 4, 2022 13:13
@har7an har7an temporarily deployed to cachix September 4, 2022 13:14 Inactive
and move everything that can't run in wasm into a separate submodule.
The module is now structured such that all code incompatible with wasm
targets lives in its own submodule that isn't included when compiling
for wasm targets.
@har7an har7an force-pushed the feature/better-error-reporting branch from 94817d4 to 9ccc72e Compare September 4, 2022 15:21
@har7an har7an temporarily deployed to cachix September 4, 2022 15:21 Inactive
@har7an
Copy link
Contributor Author

har7an commented Sep 4, 2022

Good hint about the plugin API change, thanks!

I changed the plugin API back to what it was before, but I kept the modifications in the plugins and am now handling the errors inside the plugins, too. This makes handling the errors in the plugins completely optional, but still allows us to communicate to the user when things went wrong instead of directly crashing zellij (at least in some cases, and for the core plugins only, of course).

As discussed recently @imsnif, I also went ahead and updated the public API of zellij-utils/src/screen.rs to return Result types in all fallible functions. I kept everything that's infallible as is. During the process I also rewrote some of the internal functions (non-public) to return Result types where they call public functions returning a Result now.

All of the errors are now extended by a context that explains in what context (i.e. usually in what function) the error was encountered. This gives us a pretty chain of errors from the source error (which can be anything), to all intermediary functions that the error went by, up until it is finally logged and unwrapped. In the case of screen, the logging and unwrapping is handled by the call to fatal() in zellij-utils/src/lib.rs where screen_thread_main is called.

I'll update the module docs in utils/errors soon with a few examples on how to apply the error handling so people get an idea of how to use it.

Overall I think it's a great (and convenient) addition to zellij. I'll do a few more tests in the next days to see how it works out in practice. More on this soon...

@har7an har7an temporarily deployed to cachix September 5, 2022 07:27 Inactive
that informs the reader about the endeavour to improve error handling
throughout the zellij code base.
now that the panic calls have been removed.
that allow for easy/concise logging of `anyhow::Result` types and
panicking the application when they are fatal or non-fatal.
that applications can import to conveniently access the error handling
functionality part of the module. Re-exports some parts and macros from
anyhow and the `LoggableError` and `FatalError` traits.
and make all fallible functions from the public API return a `Result`.
Append error contexts in all functions that can come across error types
to allow tracing where an error originates and what lead there.
and make them `fatal`. This will log the errors first, before unwrapping
on the error type and panicking the application.
and unwrap on the `Result` types introduced from the new error handling.
in calls to `fatal`, so we keep track of the location where the panic
really originates. Otherwise, any call to `fatal` will show the code in
`errors` as source, which of course isn't true.
Also change the error formatting and do not call `to_log` for fatal
errors anymore, because the panic is already logged and contains much
more information.
@har7an har7an force-pushed the feature/better-error-reporting branch from aa7bfeb to 1a136f8 Compare September 7, 2022 10:36
@har7an har7an temporarily deployed to cachix September 7, 2022 10:36 Inactive
@har7an
Copy link
Contributor Author

har7an commented Sep 7, 2022

For the record:

I tried implementing sending errors that do not change the servers state to the clients to crash only the offending client instead of the whole application in certain situations. To do that I need something that implements ServerOsApi, which in the case of screen is screen.bus.os_input. Unfortunately, that is a trait object of type Box<dyn ServerOsApi> and no matter what I try and do, I cannot find a way to pass that to other functions that wrap the sending bit. I always get errors from the borrow checker about the object having been moved in a prior iteration, and I can't clone it either. So I'll leave that bit out for now and retry in a separate PR.

Note to self: In the next meeting I'd like to discuss the usage of trait objects vs. generics.

explain some error handling facilities and the motivation behind using
them.
@har7an har7an temporarily deployed to cachix September 7, 2022 14:24 Inactive
functions that are part of the public API.
@har7an har7an temporarily deployed to cachix September 9, 2022 12:56 Inactive
@har7an har7an merged commit 99e2bef into zellij-org:main Sep 9, 2022
har7an added a commit to har7an/zellij that referenced this pull request Sep 21, 2022
that explains how we plan to change error handling in zellij and invites
new contributors to join the fun. Details the currently existent error
handling capabilities and gives a bunch of examples taken from zellij-org#1670.
har7an added a commit to har7an/zellij that referenced this pull request Sep 27, 2022
that explains how we plan to change error handling in zellij and invites
new contributors to join the fun. Details the currently existent error
handling capabilities and gives a bunch of examples taken from zellij-org#1670.
har7an added a commit that referenced this pull request Sep 27, 2022
* docs: Add ERROR_HANDLING

that explains how we plan to change error handling in zellij and invites
new contributors to join the fun. Details the currently existent error
handling capabilities and gives a bunch of examples taken from #1670.

* utils/errors: Shorten docblock

by moving previous content under "Help Wanted" to the new
`docs/ERROR_HANDLING` document and link to the document instead.
@har7an har7an deleted the feature/better-error-reporting branch October 23, 2022 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants