-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Exception in threads kills entire process #7335
Comments
Comments from Mu Li: Hi Cliff, Thank you for your summary. We thought to fix it before. One solution is using C++11 execption_ptr (http://en.cppreference.com/w/cpp/error/exception_ptr) to pass all exceptions to the main thread, so that we can catch all of them at the python frontend. This feature is not on our team roadmap now, it will be great if you can work on it. Thanks Comments from Junyuan Xie: One problem with this is that once an operator fails inside engine it will never be complete and all subsequent operations will hang. It’s unclear how one should recover from this. Comments from Junru Shao: Will it be a good try to Thanks, |
This issue is closed due to lack of activity in the last 90 days. Feel free to ping me to reopen if this is still an active issue. Thanks! |
Please see: Exception Handling Wiki |
For bugs or installation issues, please provide the following information.
The more information you provide, the more likely people will be able to help you.
Environment info
Operating System: MacOS
Compiler: Clang
Package used (Python/R/Scala/Julia): Python
MXNet commit hash (
git rev-parse HEAD
): 3a48185Error Message:
I am looking at some code that adds new operators to MXNet. In a few edge cases, this code uses the CHECK macros to assert certain properties. When a CHECK fails, it throws an exception using the LOG_FATAL macro. This exception makes its way up to ExecuteOprBlock() in the ThreadedEngine class. From here, it is logged using LOG_ERROR. This causes the exception to be printed out to the console (twice actually, due to a simple MXNet bug) and then another exception is thrown out of the thread’s run handler. Following the C++ spec, this second exception causes terminate() to be called on the entire process, exiting MXNet. This has a few side-effects I’d like some feedback on.
First, the caught exception is only ever logged to the console. Anyone using Jupyter will never see any errors unless they have access to the console that launched the kernel. If you are using a hosted notebook solution, where you don’t see the console, the process will exit and zero information will be provided back to the user. This is a pretty awful user-experience.
Second, the environment itself exits. If you were a few days into training a model when the problem occurs, all your work will be lost. You’ll see a stack-trace, but you’ll be forced to start everything over again.
Third, this means that MXNet behaves very different between the NaiveEngine and the regular threaded engine. In Naïve mode, the exception is printed inside the interpreter and your environment is retained. In Threaded mode, the exception is only logged to the console, the interpreter exits, and you lose all your work.
Are these behaviours we want to keep? It seems to me that the proper thing would be for the ThreadedEngine to catch the exception and pass it back to the main thread where it could be treated the same as exceptions in the main thread.
We could use C++11 exception_ptr (http://en.cppreference.com/w/cpp/error/exception_ptr) to pass the exceptions back to the main thread. One problem is that we need to ensure related operations in other threads are also terminated. Something like:
Minimum reproducible example
Modify any operator that executes inside a thread to include something like:
The text was updated successfully, but these errors were encountered: