Exception in threads kills entire process #7335

mccollum-amzn · 2017-08-04T15:57:33Z

For bugs or installation issues, please provide the following information.
The more information you provide, the more likely people will be able to help you.

Environment info

Operating System: MacOS

Compiler: Clang

Package used (Python/R/Scala/Julia): Python

MXNet commit hash (git rev-parse HEAD): 3a48185

Error Message:

I am looking at some code that adds new operators to MXNet. In a few edge cases, this code uses the CHECK macros to assert certain properties. When a CHECK fails, it throws an exception using the LOG_FATAL macro. This exception makes its way up to ExecuteOprBlock() in the ThreadedEngine class. From here, it is logged using LOG_ERROR. This causes the exception to be printed out to the console (twice actually, due to a simple MXNet bug) and then another exception is thrown out of the thread’s run handler. Following the C++ spec, this second exception causes terminate() to be called on the entire process, exiting MXNet. This has a few side-effects I’d like some feedback on.

First, the caught exception is only ever logged to the console. Anyone using Jupyter will never see any errors unless they have access to the console that launched the kernel. If you are using a hosted notebook solution, where you don’t see the console, the process will exit and zero information will be provided back to the user. This is a pretty awful user-experience.

Second, the environment itself exits. If you were a few days into training a model when the problem occurs, all your work will be lost. You’ll see a stack-trace, but you’ll be forced to start everything over again.

Third, this means that MXNet behaves very different between the NaiveEngine and the regular threaded engine. In Naïve mode, the exception is printed inside the interpreter and your environment is retained. In Threaded mode, the exception is only logged to the console, the interpreter exits, and you lose all your work.

Are these behaviours we want to keep? It seems to me that the proper thing would be for the ThreadedEngine to catch the exception and pass it back to the main thread where it could be treated the same as exceptions in the main thread.

We could use C++11 exception_ptr (http://en.cppreference.com/w/cpp/error/exception_ptr) to pass the exceptions back to the main thread. One problem is that we need to ensure related operations in other threads are also terminated. Something like:

Print out related operands to a log file.
Pass the exception to the main thread for processing and display.
Kill the operators depending on the operator that threw the exception.

Minimum reproducible example

Modify any operator that executes inside a thread to include something like:

CHECK(0) << "Exception thrown here";

The text was updated successfully, but these errors were encountered:

bhavinthaker · 2017-08-04T18:13:11Z

Comments from Mu Li:

Hi Cliff,

Thank you for your summary. We thought to fix it before. One solution is using C++11 execption_ptr (http://en.cppreference.com/w/cpp/error/exception_ptr) to pass all exceptions to the main thread, so that we can catch all of them at the python frontend.

This feature is not on our team roadmap now, it will be great if you can work on it.

Thanks
Mu

Comments from Junyuan Xie:

One problem with this is that once an operator fails inside engine it will never be complete and all subsequent operations will hang.

It’s unclear how one should recover from this.

Comments from Junru Shao:

Will it be a good try to
print out related operands to some log file
pass the exception to main thread
kill the operators depending on the operator that throws the exception?

Thanks,
Junru

szha · 2017-11-04T00:26:29Z

This issue is closed due to lack of activity in the last 90 days. Feel free to ping me to reopen if this is still an active issue. Thanks!
Also, do please check out our forum (and Chinese version) for general "how-to" questions.

eric-haibin-lin · 2017-11-07T21:55:24Z

@anirudh2290

anirudh2290 · 2018-01-23T19:21:30Z

Please see: Exception Handling Wiki

szha closed this as completed Nov 4, 2017

eric-haibin-lin added Call for Contribution Discussion labels Nov 7, 2017

eric-haibin-lin reopened this Nov 7, 2017

larroy mentioned this issue Nov 27, 2017

Python crashes (core-dump) instead of a graceful error message when GPU context is used on a CPU-only instance (EC2 x1.32xlarge) #8835

Closed

anirudh2290 mentioned this issue Dec 20, 2017

random_uniform causes VM to crash #9131

Closed

This was referenced Jan 27, 2018

nd.argmax cause "Kernel Died" error in Jupyter Notebook #9567

Closed

nd.stack causes abort for differently sized arrays #9380

Closed

Better Exception Handling for Operators #9681

Merged

piiswrong closed this as completed in #9681 Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exception in threads kills entire process #7335

Exception in threads kills entire process #7335

mccollum-amzn commented Aug 4, 2017

bhavinthaker commented Aug 4, 2017

szha commented Nov 4, 2017

eric-haibin-lin commented Nov 7, 2017

anirudh2290 commented Jan 23, 2018 •

edited

Loading

Exception in threads kills entire process #7335

Exception in threads kills entire process #7335

Comments

mccollum-amzn commented Aug 4, 2017

Environment info

Error Message:

Minimum reproducible example

bhavinthaker commented Aug 4, 2017

szha commented Nov 4, 2017

eric-haibin-lin commented Nov 7, 2017

anirudh2290 commented Jan 23, 2018 • edited Loading

anirudh2290 commented Jan 23, 2018 •

edited

Loading