-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance TaskContext and add task failure root cause #3410
Comments
I agree the wrapping is confusing and I think in general the wrapping of errors is not valuable. We see similar wrapping obfuscation in IOx as well. The datafusion code already tries to unwrap an ArrowError here (when converting from DataFusionError --> ArrowError) I wonder if we can do something similar in Ballista? Or maybe write some "unwrapper" that would strip out the useless wrapping errors before display. |
Thanks Alamb, the context here is not just for displaying. Recently we are working on error handling and error recovering which is useful for distributed execution. And we hope to do error recovering for some special error case and we need to know the root cause of the error. If an error is wrapped too much, it will be very difficult to make decision of which error to recover and others not. |
That is a neat usecase and it makes sense. Arrow and DataFusion don't have very specific errors (unlike some other rust crate where the specific error call site often has its own unique variant). This makes coding easier as there is less error machinery to maintain, but it makes distinguishing between errors potentially more challenging -- All you have is the handful of variants and a message. As the rest of the rust ecosystem works on stabalizing "error context" (aka being able to walk up the chain of error messages) maybe finding the root cause will become simpler. I mostly mention this as maybe it would be a good time to figure out if we can add a |
Agree that it's a more standard way and it would be easier for error recovering purpose. @mingmwang, what do you think? |
@alamb @yahoNanJing This would help and this approach is more rusty. But it still depends on the programer to propagate the errors correctly in the execution path. And there is another case that we still could not catch the error easily.
|
Yes I agree care would be needed in various cases. I am not opposed to adding information to the TaskContext -- perhaps we can brainstorm additional improvements to errors / error handling as a follow on project. I know @tustvold has some ideas in apache/arrow-rs#2711 etc so maybe it would be part of a broader ecosystem improvment |
Calling tokio::spawn without doing anything with the join handle is not a good idea, not only will you potentially silently drop errors, as you elude to, but more problematically you will miss panics. This has been a frequent pain point, see the work by @crepererum to fix this https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+is%3Aclosed+author%3Acrepererum. Longer term I'm hoping to move away from tokio see #2504, in part to do away with these types of problems, but I'm not sure when I'll get back to working on that. In the meantime, a tokio::spawn that doesn't somehow await the JoinHandle is an anti-pattern IMO.
If we are to go down this route, I think the question to ask is "what users are there of the current structured error handling". In particular, could we just use something like anyhow which we already have a transitive dependency on as a result of prost... This has a fairly straightforward API for downcasting errors - https://docs.rs/anyhow/latest/anyhow/trait.Context.html#effect-on-downcasting
@mingmwang Perhaps you could expand on the particular error cases you are looking to handle. The major one that comes to my mind are network failures, which I would have thought would make more sense to handle at the layer responsible for network calls. Perhaps see https://github.com/apache/arrow-rs/blob/master/object_store/src/client/retry.rs#L121 for inspiration??
This seems to be where the ecosystem as a whole is headed, and makes sense to me. See rust-lang/rust#58520 for work to stabilize a mechanism to "walk" the tree |
It's not only about error handling though but also about task cancellation. Esp. in a server application, it's good to stop a query from running when the original request is cancelled. Uncaptured task handles will prevent this from happening, since tokio will still drive the execution. |
Yes, this is what I really want out of errors in DataFusion and Arrow |
Filed apache/arrow-rs#2725 to discuss better error handling in arrow (and thus also datafusion and ballista) |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
(This section helps Arrow developers understand the context and why for this feature, in addition to the what)
In current Datafusion/Ballista plan's execution path, the error result mapping is a chaos, the root cause of the task failure is not clear, especially when the plan tree contains pipeline breakers.
To make the error handling clear and make it easy to reasoning about the real task failure reason, the suggestion is to add a new method to TaskContext to set the task failure root cause into TaskContext.
Below is an example of Ballista shuffle read execution path, the arrow '->' represents the error/result return flow.
BallistaClient.fetch_partition(): BallistaError::General -> Ballista ShuffleReaderExec.fetch_partition(): DataFusionError::Execution -> Ballista ShuffleReaderExec.execute() : SendableRecordBatchStream -> ->RecordBatchStreamAdapter.poll_next(): ArrowResult::ExternalError ->
We can see the root error is mapped to Ballista Error -> DataFusion Error -> Arrow Error. And depends on how the
stream is polled and consumed, the Arrow Error can be mapped to DataFusion Error or Ballista Error again.
There are back and forth error mapping in the code base, this makes the root case reasoning become a challenge.
Want to hear your thoughts.
@andygrove @alamb @thinkharderdev
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: