Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Python crashes (core-dump) instead of a graceful error message when GPU context is used on a CPU-only instance (EC2 x1.32xlarge) #8835

Closed
bhavinthaker opened this issue Nov 27, 2017 · 3 comments · Fixed by #9681

Comments

@bhavinthaker
Copy link
Contributor

Description

Python crashes (core-dump) instead of gracefully returning an error message when GPU context is used on a CPU-only instance (EC2 x1.32xlarge). The root-cause of the problem may be "unknown CUDA error" when ideally it should return a valid CUDA error that MXNet can trap and display the error message instead of crashing Python.

Environment info (Required)

EC2 instance type: x1.32xlarge
MXNet: Release candidate: v1.0.0 RC0

Build info (Required if built from source)

Release candidate: v1.0.0 RC0

Compiler (gcc/clang/mingw/visual studio): gcc 5.4 on Ubuntu Linux 16.04

Error Message:

-snip--

import mxnet as mx
mx.version
'1.0.0'

shape = (10, 10)
a = mx.nd.ones(shape, mx.gpu(0))
[17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/dmlc-core/include/dmlc/./logging.h:308: [17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: unknown error

Stack trace returned 8 entries:
[bt] (0) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f92ed23fdfc]
[bt] (1) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow9SetDeviceINS_3gpuEEEvi+0xd0) [0x7f92efc55410]
[bt] (2) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0x75) [0x7f92efc5da35]
[bt] (3) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5+0x63) [0x7f92efc5dce3]
[bt] (4) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f92efc57cba]
[bt] (5) /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f92fd68bc5c]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f92fe8d56ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f92fe60b3dd]

[17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/dmlc-core/include/dmlc/./logging.h:308: [17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: unknown error

Stack trace returned 8 entries:
[bt] (0) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f92ed23fdfc]
[bt] (1) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow9SetDeviceINS_3gpuEEEvi+0xd0) [0x7f92efc55410]
[bt] (2) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10Threa
dPool11SimpleEventEE+0x75) [0x7f92efc5da35]
[bt] (3) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(ZNSt17_Function_handlerIFvSt
10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1

clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5
+0x63) [0x7f92efc5dce3]
[bt] (4) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f92efc57cba]
[bt] (5) /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f92fd68bc5c]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f92fe8d56ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f92fe60b3dd]
terminate called after throwing an instance of 'dmlc::Error' what(): [17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed
: e == cudaSuccess CUDA: unknown error

Stack trace returned 8 entries:
[bt] (0) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f92ed23fdfc][bt] (1) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow9SetDeviceINS_3gpuEEEvi+0xd0) [0x7f92efc55410][bt] (2) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0x75) [0x7f92efc5da35]
[bt] (3) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5+0x63) [0x7f92efc5dce3]
[bt] (4) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f92efc57cba]
[bt] (5) /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f92fd68bc5c]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f92fe8d56ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f92fe60b3dd]

terminate called after throwing an instance of 'dmlc::Error'
what(): [17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: unknown error

Stack trace returned 8 entries:
[bt] (0) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f92ed23fdfc]
[bt] (1) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow9SetDeviceINS_3gpuEEEvi+0xd0) [0x7f92efc55410]
[bt] (2) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0x75) [0x7f92efc5da35]
[bt] (3) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5+0x63) [0x7f92efc5dce3]
[bt] (4) /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f92efc57cba]
[bt] (5) /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f92fd68bc5c]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f92fe8d56ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f92fe60b3dd]

Aborted (core dumped)
--snip--

Minimum reproducible example

Build from source using the configuration below and run the reproduction steps on a CPU-only instance.

$ cd src/make
$ diff config.mk config.mk.ci | egrep ">"

DEBUG = 1
USE_CUDA = 1
USE_CUDA_PATH = /usr/local/cuda
USE_CUDNN = 1
USE_DIST_KVSTORE = 1
USE_S3 = 1

import mxnet as mx
mx.version
'1.0.0'

shape = (10, 10)
a = mx.nd.ones(shape, mx.gpu(0))
[17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/dmlc-core/include/dmlc/./logging.h:308: [17:17:52] /home/ubuntu/bt/apache-mxnet-src-1.0.0.rc0-incubating/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: unknown error

What have you tried to solve it?

Workaround: Do NOT use GPU context on a CPU-only instance.

@larroy
Copy link
Contributor

larroy commented Nov 27, 2017

Same issue:
#7335

@larroy
Copy link
Contributor

larroy commented Nov 28, 2017

This doesn't seem to happen in master. But I think because it now uses imperative, so it propagates a normal Error.

@leopd
Copy link
Contributor

leopd commented Mar 11, 2018

I'm still seeing this on latest the AWS DL AMI, v5.0.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants