Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance TaskContext and add task failure root cause #3410

Open
mingmwang opened this issue Sep 9, 2022 · 11 comments
Open

Enhance TaskContext and add task failure root cause #3410

mingmwang opened this issue Sep 9, 2022 · 11 comments
Labels
enhancement New feature or request

Comments

@mingmwang
Copy link
Contributor

mingmwang commented Sep 9, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
(This section helps Arrow developers understand the context and why for this feature, in addition to the what)

In current Datafusion/Ballista plan's execution path, the error result mapping is a chaos, the root cause of the task failure is not clear, especially when the plan tree contains pipeline breakers.
To make the error handling clear and make it easy to reasoning about the real task failure reason, the suggestion is to add a new method to TaskContext to set the task failure root cause into TaskContext.

Below is an example of Ballista shuffle read execution path, the arrow '->' represents the error/result return flow.

BallistaClient.fetch_partition(): BallistaError::General -> Ballista ShuffleReaderExec.fetch_partition(): DataFusionError::Execution -> Ballista ShuffleReaderExec.execute() : SendableRecordBatchStream -> ->RecordBatchStreamAdapter.poll_next(): ArrowResult::ExternalError ->

We can see the root error is mapped to Ballista Error -> DataFusion Error -> Arrow Error. And depends on how the
stream is polled and consumed, the Arrow Error can be mapped to DataFusion Error or Ballista Error again.
There are back and forth error mapping in the code base, this makes the root case reasoning become a challenge.

Want to hear your thoughts.

@andygrove @alamb @thinkharderdev

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@mingmwang mingmwang added the enhancement New feature or request label Sep 9, 2022
@mingmwang
Copy link
Contributor Author

@yahoNanJing

@alamb
Copy link
Contributor

alamb commented Sep 9, 2022

I agree the wrapping is confusing and I think in general the wrapping of errors is not valuable. We see similar wrapping obfuscation in IOx as well.

The datafusion code already tries to unwrap an ArrowError here (when converting from DataFusionError --> ArrowError)

https://github.com/apache/arrow-datafusion/blob/master/datafusion/common/src/error.rs#L209-L217

I wonder if we can do something similar in Ballista? Or maybe write some "unwrapper" that would strip out the useless wrapping errors before display.

@yahoNanJing
Copy link
Contributor

Thanks Alamb, the context here is not just for displaying. Recently we are working on error handling and error recovering which is useful for distributed execution. And we hope to do error recovering for some special error case and we need to know the root cause of the error. If an error is wrapped too much, it will be very difficult to make decision of which error to recover and others not.

@alamb
Copy link
Contributor

alamb commented Sep 10, 2022

And we hope to do error recovering for some special error case and we need to know the root cause of the error. If an error is wrapped too much, it will be very difficult to make decision of which error to recover and others not.

That is a neat usecase and it makes sense.

Arrow and DataFusion don't have very specific errors (unlike some other rust crate where the specific error call site often has its own unique variant). This makes coding easier as there is less error machinery to maintain, but it makes distinguishing between errors potentially more challenging -- All you have is the handful of variants and a message.

As the rest of the rust ecosystem works on stabalizing "error context" (aka being able to walk up the chain of error messages) maybe finding the root cause will become simpler.

I mostly mention this as maybe it would be a good time to figure out if we can add a source: of each ArrowError, and DataFusionError as a more general purpose method than adding such info to the TaskContext

@yahoNanJing
Copy link
Contributor

yahoNanJing commented Sep 10, 2022

I mostly mention this as maybe it would be a good time to figure out if we can add a source: of each ArrowError, and DataFusionError as a more general purpose method than adding such info to the TaskContext

Agree that it's a more standard way and it would be easier for error recovering purpose. @mingmwang, what do you think?

@mingmwang
Copy link
Contributor Author

mingmwang commented Sep 13, 2022

@alamb @yahoNanJing This would help and this approach is more rusty. But it still depends on the programer to propagate the errors correctly in the execution path. And there is another case that we still could not catch the error easily.
If we add such info to TaskContext, errors can be propagated easily and there is less burden to the programer.

        tokio::spawn(async move {

        .............................
        });

@alamb
Copy link
Contributor

alamb commented Sep 13, 2022

But it still depends on the programer to propagate the errors correctly in the execution path. And there is another case that we still could not catch the error easily.
If we add such info to TaskContext, errors can be propagated easily and there is less burden to the programer.

Yes I agree care would be needed in various cases. I am not opposed to adding information to the TaskContext -- perhaps we can brainstorm additional improvements to errors / error handling as a follow on project. I know @tustvold has some ideas in apache/arrow-rs#2711 etc so maybe it would be part of a broader ecosystem improvment

@tustvold
Copy link
Contributor

tustvold commented Sep 13, 2022

If we add such info to TaskContext, errors can be propagated easily and there is less burden to the programer.

Calling tokio::spawn without doing anything with the join handle is not a good idea, not only will you potentially silently drop errors, as you elude to, but more problematically you will miss panics. This has been a frequent pain point, see the work by @crepererum to fix this https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+is%3Aclosed+author%3Acrepererum. Longer term I'm hoping to move away from tokio see #2504, in part to do away with these types of problems, but I'm not sure when I'll get back to working on that. In the meantime, a tokio::spawn that doesn't somehow await the JoinHandle is an anti-pattern IMO.

additional improvements to errors / error handling

If we are to go down this route, I think the question to ask is "what users are there of the current structured error handling". In particular, could we just use something like anyhow which we already have a transitive dependency on as a result of prost... This has a fairly straightforward API for downcasting errors - https://docs.rs/anyhow/latest/anyhow/trait.Context.html#effect-on-downcasting

And we hope to do error recovering for some special error case and we need to know the root cause of the error.

@mingmwang Perhaps you could expand on the particular error cases you are looking to handle. The major one that comes to my mind are network failures, which I would have thought would make more sense to handle at the layer responsible for network calls. Perhaps see https://github.com/apache/arrow-rs/blob/master/object_store/src/client/retry.rs#L121 for inspiration??

I mostly mention this as maybe it would be a good time to figure out if we can add a source: of each ArrowError, and DataFusionError as a more general purpose method than adding such info to the TaskContext

This seems to be where the ecosystem as a whole is headed, and makes sense to me. See rust-lang/rust#58520 for work to stabilize a mechanism to "walk" the tree

@crepererum
Copy link
Contributor

Calling tokio::spawn without doing anything with the join handle is not a good idea, not only will you potentially silently drop errors, as you elude to, but more problematically you will miss panics.

It's not only about error handling though but also about task cancellation. Esp. in a server application, it's good to stop a query from running when the original request is cancelled. Uncaptured task handles will prevent this from happening, since tokio will still drive the execution.

@alamb
Copy link
Contributor

alamb commented Sep 13, 2022

See rust-lang/rust#58520 for work to stabilize a mechanism to "walk" the tree

Yes, this is what I really want out of errors in DataFusion and Arrow

@alamb
Copy link
Contributor

alamb commented Sep 14, 2022

Filed apache/arrow-rs#2725 to discuss better error handling in arrow (and thus also datafusion and ballista)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants