You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
Python crashes (core-dump) instead of gracefully returning an error message when GPU context is used on a CPU-only instance (EC2 x1.32xlarge). The root-cause of the problem may be "unknown CUDA error" when ideally it should return a valid CUDA error that MXNet can trap and display the error message instead of crashing Python.
terminate called after throwing an instance of 'dmlc::Error'
what(): [17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: unknown error
Description
Python crashes (core-dump) instead of gracefully returning an error message when GPU context is used on a CPU-only instance (EC2 x1.32xlarge). The root-cause of the problem may be "unknown CUDA error" when ideally it should return a valid CUDA error that MXNet can trap and display the error message instead of crashing Python.
Environment info (Required)
EC2 instance type: x1.32xlarge
MXNet: Release candidate: v1.0.0 RC0
Build info (Required if built from source)
Release candidate: v1.0.0 RC0
Compiler (gcc/clang/mingw/visual studio): gcc 5.4 on Ubuntu Linux 16.04
Error Message:
-snip--
Stack trace returned 8 entries:
[bt] (0) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f92ed23fdfc]
[bt] (1) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow9SetDeviceINS_3gpuEEEvi+0xd0) [0x7f92efc55410]
[bt] (2) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0x75) [0x7f92efc5da35]
[bt] (3) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5+0x63) [0x7f92efc5dce3]
[bt] (4) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f92efc57cba]
[bt] (5) /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f92fd68bc5c]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f92fe8d56ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f92fe60b3dd]
[17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/dmlc-core/include/dmlc/./logging.h:308: [17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: unknown error
Stack trace returned 8 entries:
[bt] (0) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f92ed23fdfc]
[bt] (1) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow9SetDeviceINS_3gpuEEEvi+0xd0) [0x7f92efc55410]
[bt] (2) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10Threa
dPool11SimpleEventEE+0x75) [0x7f92efc5da35]
[bt] (3) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(ZNSt17_Function_handlerIFvSt
10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1
clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5+0x63) [0x7f92efc5dce3]
[bt] (4) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f92efc57cba]
[bt] (5) /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f92fd68bc5c]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f92fe8d56ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f92fe60b3dd]
terminate called after throwing an instance of 'dmlc::Error' what(): [17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed
: e == cudaSuccess CUDA: unknown error
Stack trace returned 8 entries:
[bt] (0) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f92ed23fdfc][bt] (1) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow9SetDeviceINS_3gpuEEEvi+0xd0) [0x7f92efc55410][bt] (2) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0x75) [0x7f92efc5da35]
[bt] (3) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5+0x63) [0x7f92efc5dce3]
[bt] (4) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f92efc57cba]
[bt] (5) /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f92fd68bc5c]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f92fe8d56ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f92fe60b3dd]
terminate called after throwing an instance of 'dmlc::Error'
what(): [17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: unknown error
Stack trace returned 8 entries:
[bt] (0) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f92ed23fdfc]
[bt] (1) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow9SetDeviceINS_3gpuEEEvi+0xd0) [0x7f92efc55410]
[bt] (2) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0x75) [0x7f92efc5da35]
[bt] (3) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5+0x63) [0x7f92efc5dce3]
[bt] (4) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f92efc57cba]
[bt] (5) /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f92fd68bc5c]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f92fe8d56ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f92fe60b3dd]
Aborted (core dumped)
--snip--
Minimum reproducible example
Build from source using the configuration below and run the reproduction steps on a CPU-only instance.
$ cd src/make
$ diff config.mk config.mk.ci | egrep ">"
What have you tried to solve it?
Workaround: Do NOT use GPU context on a CPU-only instance.
The text was updated successfully, but these errors were encountered: