mx.nd.Custom conflicts with memory management #14522

YutingZhang · 2019-03-26T07:07:31Z

When training/running a large neural network with CustomOP, mxnet can get stuck. My speculation is that if memory management (e.g., releasing/reallocating GPU memory, raising an "out of memory error") is needed while running CustomOP, mxnet can get stuck.

A minimum piece code to show that CustomOP can deadlock with memory management ("out-of-memory in this case"):
As expected, main(1) should work.
main(100) should give "out of GPU memory" error. However, it just got stuck.

The real-world problem I met is not just about inability to give an "out of memory" error. It seems MxNet can release/reallocate memory dynamically, but this probably also deadlock with CustomOP, so my program, which could fit into GPU memory, can also get stuck

import mxnet as mx


class MyMulMax(mx.operator.CustomOp):

    def __init__(self):
        super().__init__()

    def forward(self, is_train, req, in_data, out_data, aux):
        a, b = in_data[0:2]
        c = mx.nd.batch_dot(a, b)
        d = mx.nd.max(c, axis=-1, keepdims=True)
        self.assign(out_data[0], req[0], d)

    def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
        self.assign(in_grad[0], req[0], 0)
        self.assign(in_grad[1], req[1], 0)


@mx.operator.register("MyMulMax")
class MyMulMaxProp(mx.operator.CustomOpProp):

    def __init__(self):
        super().__init__()

    def list_arguments(self):
        return ['a', 'b']

    def list_outputs(self):
        return ['d']

    def infer_shape(self, in_shape):
        return in_shape, [list(in_shape[0][:-1] + [1])]

    def create_operator(self, ctx, shapes, dtypes):
        return MyMulMax()


def main(n):
    with mx.Context(mx.gpu(0)):
        a = mx.nd.random.uniform(shape=(n, 6000, 1))
        b = mx.nd.random.uniform(shape=(n, 1, 7000))
        d = mx.nd.Custom(a, b, op_type="MyMulMax")
        d_np = d.asnumpy()


if __name__ == "__main__":
    main(1)
    print("DONE -- 1")
    main(100)
    print("DONE -- 2")

Tested with nightly build of mxnet-cu90mkl with Python3.6

mxnet-label-bot · 2019-03-26T07:07:34Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug

YutingZhang · 2019-03-26T07:08:43Z

@arcadiaphy @wkcn @anirudh2290 Interested in looking into this?

wkcn · 2019-03-26T07:57:30Z

Does it reproduce the bug in old version of MXNet?

YutingZhang · 2019-03-26T11:41:48Z

@wkcn "the bug in old version": Do you mean the deadlock in subprocess? I thought #14451 solved this problem (I have not tried it out myself).

This thread is about a different problem. It happens in the main process on a single GPU.

wkcn · 2019-03-26T12:12:20Z

I mean that does this issue exist before the PR #14363 ?

wkcn · 2019-03-26T14:09:16Z

reproduce the bug in CPU environment, it get stuck too.

I use GDB to print the backtrace:

#0  0x00007ffff7f6aafc in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0                                      
#1  0x00007fffe28cacd1 in __gthread_cond_wait (__mutex=<optimized out>, __cond=<optimized out>)                              
    at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:864         
#2  0x00007fffe28cacd1 in std::condition_variable::wait(std::unique_lock<std::mutex>&) (this=<optimized out>, __lock=...)    
    at /build/gcc/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53                                                    
#3  0x00007fffead27065 in mxnet::engine::ThreadedEngine::WaitForVar(mxnet::engine::Var*) ()                                  
    at /home/wkcn/proj/incubator-mxnet/python/mxnet/../../lib/libmxnet.so                                                    
#4  0x00007fffea739052 in mxnet::NDArray::SyncCopyToCPU(void*, unsigned long) const ()                                       
    at /home/wkcn/proj/incubator-mxnet/python/mxnet/../../lib/libmxnet.so                                                    
#5  0x00007fffead95697 in MXNDArraySyncCopyToCPU () at /home/wkcn/proj/incubator-mxnet/python/mxnet/../../lib/libmxnet.so    
#6  0x00007ffff6b386d0 in ffi_call_unix64 () at /usr/lib/libffi.so.6                                                         
#7  0x00007ffff6b380a0 in ffi_call () at /usr/lib/libffi.so.6
#8  0x00007ffff6b9e615 in _ctypes_callproc () at /usr/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so      
#9  0x00007ffff6b9efa0 in  () at /usr/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so
#10 0x00007ffff7ba163c in _PyObject_FastCallKeywords () at /usr/lib/libpython3.7m.so.1.0
#11 0x00007ffff7beccca in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.7m.so.1.0
#12 0x00007ffff7b752eb in _PyFunction_FastCallKeywords () at /usr/lib/libpython3.7m.so.1.0
#13 0x00007ffff7be7c42 in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.7m.so.1.0
#14 0x00007ffff7b752eb in _PyFunction_FastCallKeywords () at /usr/lib/libpython3.7m.so.1.0
#15 0x00007ffff7be7dfa in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.7m.so.1.0
--Type <RET> for more, q to quit, c to continue without paging--bt
\#16 0x00007ffff7b2eb99 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.7m.so.1.0
#17 0x00007ffff7b2fab4 in PyEval_EvalCodeEx () at /usr/lib/libpython3.7m.so.1.0
#18 0x00007ffff7b2fadc in PyEval_EvalCode () at /usr/lib/libpython3.7m.so.1.0
#19 0x00007ffff7c59c94 in  () at /usr/lib/libpython3.7m.so.1.0
#20 0x00007ffff7c5b8be in PyRun_FileExFlags () at /usr/lib/libpython3.7m.so.1.0
#21 0x00007ffff7c5cc75 in PyRun_SimpleFileExFlags () at /usr/lib/libpython3.7m.so.1.0
#22 0x00007ffff7c5eeb7 in  () at /usr/lib/libpython3.7m.so.1.0
#23 0x00007ffff7c5f0fc in _Py_UnixMain () at /usr/lib/libpython3.7m.so.1.0
#24 0x00007ffff7dbd223 in __libc_start_main () at /usr/lib/libc.so.6
#25 0x000055555555505e in _start ()

wkcn · 2019-03-26T14:18:45Z

I see. In python/mxnet/ndarray/ndarray.py, the function def copyto(self, other): calls _internal._copyto(self, out=other), which gets stuck.

YutingZhang · 2019-03-27T05:24:16Z

I mean that does this issue exist before the PR #14363 ?

Both before an after. I believe it should be a different issue.

arcadiaphy · 2019-03-27T10:11:24Z

MXNET_ENGINE_TYPE=ThreadedEngine gdb python -ex 'b src/storage/pooled_storage_manager.h:145' -ex 'r reproduce.py'

Pretty interesting bug, seems like cuda out of memory gets stuck and cannot be triggered in custom op.

wkcn · 2019-03-27T10:21:00Z

OOM on CPU gets stuck too.

arcadiaphy · 2019-03-27T10:55:27Z

@wkcn I think it's not related to OOM, looks more like LOG(FATAL) gets stuck.

YutingZhang · 2019-03-28T00:12:15Z

@arcadiaphy if LOG(FATAL) gets stuck, then there might be more stuff that can get stuck, which are harder to figure out.

anirudh2290 · 2019-03-28T00:34:33Z

I think worth trying would be checking if cudaGetErrorString is not called does it still get stuck. Since custom op uses its own custom threads it may not handle exceptions very well but still std::terminate should be called for unhandled exception. I am surprised that its not happening here for the unhandled exception.

larroy · 2019-03-28T17:33:36Z

Hi

I'm investigating this one. With the provided example I believe the cause of the error is the following at least in CPU context (changing gpu to cpu context in the example above).

I believe the current example crashes in Gemm due to an integer overflow in GEMM in the following code in dot_engine-inl.h

   
    for (int i = 0; i < batch_count; ++i) {
      gemm(stream, transa, transb, m, n, k, alpha,
           A + i * m * k, lda, B + i * k * n, ldb,
           beta, C + i * m * n, ldc);
    }

When having a gemm with dimension 6000 * 7000 * 100 which is > 2^32 we have an overflow which leads to the crash.
 I would suggest checking matrix dimensions for overflow before dispatching to GEMM.

arcadiaphy · 2019-03-29T05:17:42Z

@anirudh2290 @wkcn @YutingZhang Finally figure out the reason:

Normally, when a exception is thrown in spawn thread, it should be caught in std::exception_ptr and re-thrown in main thread to ensure proper except handling and avoid std::terminate. This mechanism is introduced in #9681 to handle exceptions in operator.

But there are still two problems in the except handling of custom op:

The exception thrown in custom thread cannot be caught in main thread, causing program crash.
When exception happens in custom op, it will be caught and re-thrown in sync function WaitForVar. But the WaitForVar is deadlocked because the pushed operation CustomOperator is skipped running, making the write dependency of waiting ndarray not completed forever.
https://github.com/apache/incubator-mxnet/blob/master/src/operator/custom/custom-inl.h#L128
By adding c.wait_to_read(), the exception is forced to be re-thrown, but the program still crashes due to Problem 1.

import mxnet as mx


class MyMulMax(mx.operator.CustomOp):
    def __init__(self):
        super(MyMulMax, self).__init__()

    def forward(self, is_train, req, in_data, out_data, aux):
        a, b = in_data[0:2]
        c = mx.nd.batch_dot(a, b)
        # force re-throw exception, program still crash due to uncaught exception
        # c.wait_to_read()
        d = mx.nd.max(c, axis=-1, keepdims=True)
        self.assign(out_data[0], req[0], d)

    def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
        self.assign(in_grad[0], req[0], 0)
        self.assign(in_grad[1], req[1], 0)


@mx.operator.register("MyMulMax")
class MyMulMaxProp(mx.operator.CustomOpProp):
    def __init__(self):
        super(MyMulMaxProp, self).__init__()

    def list_arguments(self):
        return ['a', 'b']

    def list_outputs(self):
        return ['d']

    def infer_shape(self, in_shape):
        return in_shape, [list(in_shape[0][:-1] + [1])]

    def create_operator(self, ctx, shapes, dtypes):
        return MyMulMax()


def custom(n):
    with mx.Context(mx.gpu(0)):
        a = mx.nd.random.uniform(shape=(n, 6000, 1))
        b = mx.nd.random.uniform(shape=(n, 1, 7000))
        d = mx.nd.Custom(a, b, op_type="MyMulMax")
        d.wait_to_read()


def direct(n):
    with mx.Context(mx.gpu(0)):
        a = mx.nd.random.uniform(shape=(n, 6000, 1))
        b = mx.nd.random.uniform(shape=(n, 1, 7000))
        c = mx.nd.batch_dot(a, b)
        d = mx.nd.max(c, axis=-1, keepdims=True)
        # deadlock due to skipped CustomOperation function
        d.wait_to_read()


if __name__ == "__main__":
    custom(100)

arcadiaphy · 2019-03-29T12:45:27Z

@wkcn @anirudh2290 Like what I've mentioned before, adding on_complete callback in CustomOperator is not a good design, since the calling may be skipped when exception happens:
https://github.com/apache/incubator-mxnet/blob/master/src/operator/custom/custom-inl.h#L128

Any suggestions on how to avoid it?

larroy · 2019-03-29T19:37:26Z

In CPU I see a crash in sgemm. I don't see the relation with exceptions in CPU.

(lldb) r repro2.py
Process 6491 launched: '/Users/pllarroy/devel/mxnet/py3_venv/bin/python' (x86_64)
Process 6491 stopped
* thread #4, stop reason = EXC_BAD_ACCESS (code=1, address=0xffffffff326c4800)
    frame #0: 0x0000000113fb3b4d libopenblasp-r0.3.5.dylib`sgemm_beta_HASWELL + 107
libopenblasp-r0.3.5.dylib`sgemm_beta_HASWELL:
->  0x113fb3b4d <+107>: movl   $0x0, (%rdx,%rax,4)
    0x113fb3b54 <+114>: incq   %rax
    0x113fb3b57 <+117>: jmp    0x113fb3b42               ; <+96>
    0x113fb3b59 <+119>: decq   %rsi
  thread #10, stop reason = EXC_BAD_ACCESS (code=1, address=0xffffffff326c6c78)
    frame #0: 0x0000000113fb3b4d libopenblasp-r0.3.5.dylib`sgemm_beta_HASWELL + 107
libopenblasp-r0.3.5.dylib`sgemm_beta_HASWELL:
->  0x113fb3b4d <+107>: movl   $0x0, (%rdx,%rax,4)
    0x113fb3b54 <+114>: incq   %rax
    0x113fb3b57 <+117>: jmp    0x113fb3b42               ; <+96>
    0x113fb3b59 <+119>: decq   %rsi
  thread #11, stop reason = EXC_BAD_ACCESS (code=1, address=0xffffffff326c90ec)
    frame #0: 0x0000000113fb3b4d libopenblasp-r0.3.5.dylib`sgemm_beta_HASWELL + 107
libopenblasp-r0.3.5.dylib`sgemm_beta_HASWELL:
->  0x113fb3b4d <+107>: movl   $0x0, (%rdx,%rax,4)
    0x113fb3b54 <+114>: incq   %rax
    0x113fb3b57 <+117>: jmp    0x113fb3b42               ; <+96>
    0x113fb3b59 <+119>: decq   %rsi
Target 0: (python) stopped.
(lldb)

larroy · 2019-03-29T19:40:06Z

In GPU this is the segmentation fault that I get. Are we talking about the same problem?

Thread 297 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffea9371700 (LWP 3786)]
0x00007fff31cdb240 in sgemm_beta_HASWELL () from /usr/lib/x86_64-linux-gnu/libopenblas.so.0
(gdb) bt 3
#0  0x00007fff31cdb240 in sgemm_beta_HASWELL () from /usr/lib/x86_64-linux-gnu/libopenblas.so.0
#1  0x00007fff30821d95 in ?? () from /usr/lib/x86_64-linux-gnu/libopenblas.so.0
#2  0x00007fff30969b52 in ?? () from /usr/lib/x86_64-linux-gnu/libopenblas.so.0
(More stack frames follow...)
(gdb)

larroy · 2019-03-29T20:45:47Z

I talked to @anirudh2290 and I understand we are talking about two separate issues here:

Exception handling in custom ops
The integer overflow problem that I described.

@arcadiaphy do you plan to send a fix for the handling of exceptions in custom ops?

I can work on the integer overflow problem.

larroy · 2019-03-29T21:25:30Z

This repro in imperative mode isolates from the exception handling problem in custom ops:

import mxnet as mx
n = 100
a = mx.nd.random.uniform(shape=(n, 6000, 1))
b = mx.nd.random.uniform(shape=(n, 1, 7000))
c = mx.nd.batch_dot(a, b)
c.wait_to_read()
print(c.shape)

arcadiaphy · 2019-03-30T02:35:50Z

@larroy Yes, two different issues here.

Reproduced the gemm bug too, but I'm focusing on the exception handling issue.

anirudh2290 · 2019-03-30T05:29:42Z

Thanks @arcadiaphy ! Nice work! We can do this incrementally. Would you be willing to start with a PR for 1 ?

The exception thrown in custom thread cannot be caught in main thread, causing program crash.

arcadiaphy · 2019-03-31T12:35:07Z

@anirudh2290 Have started a PR for 1, but the really tricky one is 2.

wkcn · 2019-03-31T12:58:33Z

Great! Thanks fo your analysis!
I will read the comments and review the PR : )

larroy · 2019-04-01T22:34:59Z

nice! I will work on a fix for tensor shapes in the Blas Engine, unless somebody else has a strong desire to help there.

larroy · 2019-04-01T23:34:57Z

@mxnet-label-bot add [C++, Backend, Bug, Exception Handling]

Fixes apache/mxnet#14522

apache#14522 dmlc/mshadow#372

arcadiaphy mentioned this issue Mar 31, 2019

fix custom exception handling #14575

Closed

7 tasks

marcoabreu added Backend Issues related to the backend of MXNet Bug C++ Related to C++ labels Apr 1, 2019

marcoabreu added the Exception Handling label Apr 1, 2019

larroy added a commit to larroy/mshadow that referenced this issue Apr 2, 2019

Tensor shape overflow checking in Blas Engine

59471de

Fixes apache/mxnet#14522

larroy mentioned this issue Apr 2, 2019

[WIP] Tensor shape overflow checking in Blas Engine dmlc/mshadow#372

Closed

larroy added a commit to larroy/mshadow that referenced this issue Apr 2, 2019

Tensor shape overflow checking in Blas Engine

72335fc

Fixes apache/mxnet#14522

larroy added a commit to larroy/mxnet that referenced this issue Apr 2, 2019

Add test for gemm overflow.

7b2582d

apache#14522 dmlc/mshadow#372

larroy added a commit to larroy/mxnet that referenced this issue Apr 2, 2019

Add test for gemm overflow.

612cb95

apache#14522 dmlc/mshadow#372

larroy mentioned this issue Apr 2, 2019

[WIP] Add test for gemm overflow. #14601

Closed

5 tasks

larroy added a commit to larroy/mxnet that referenced this issue Apr 5, 2019

Add test for gemm overflow.

548321c

apache#14522 dmlc/mshadow#372

arcadiaphy mentioned this issue Apr 14, 2019

Properly handling custom op exception by modify engine #14693

Merged

7 tasks

larroy added a commit to larroy/mxnet that referenced this issue May 21, 2019

Add test for gemm overflow.

040423f

apache#14522 dmlc/mshadow#372

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mx.nd.Custom conflicts with memory management #14522

mx.nd.Custom conflicts with memory management #14522

YutingZhang commented Mar 26, 2019

mxnet-label-bot commented Mar 26, 2019

YutingZhang commented Mar 26, 2019

wkcn commented Mar 26, 2019

YutingZhang commented Mar 26, 2019

wkcn commented Mar 26, 2019 •

edited

Loading

wkcn commented Mar 26, 2019 •

edited

Loading

wkcn commented Mar 26, 2019 •

edited

Loading

YutingZhang commented Mar 27, 2019

arcadiaphy commented Mar 27, 2019 •

edited

Loading

wkcn commented Mar 27, 2019

arcadiaphy commented Mar 27, 2019

YutingZhang commented Mar 28, 2019

anirudh2290 commented Mar 28, 2019

larroy commented Mar 28, 2019

arcadiaphy commented Mar 29, 2019 •

edited

Loading

arcadiaphy commented Mar 29, 2019 •

edited

Loading

larroy commented Mar 29, 2019

larroy commented Mar 29, 2019

larroy commented Mar 29, 2019

larroy commented Mar 29, 2019

arcadiaphy commented Mar 30, 2019

anirudh2290 commented Mar 30, 2019

arcadiaphy commented Mar 31, 2019

wkcn commented Mar 31, 2019

larroy commented Apr 1, 2019

larroy commented Apr 1, 2019 •

edited

Loading

mx.nd.Custom conflicts with memory management #14522

mx.nd.Custom conflicts with memory management #14522

Comments

YutingZhang commented Mar 26, 2019

mxnet-label-bot commented Mar 26, 2019

YutingZhang commented Mar 26, 2019

wkcn commented Mar 26, 2019

YutingZhang commented Mar 26, 2019

wkcn commented Mar 26, 2019 • edited Loading

wkcn commented Mar 26, 2019 • edited Loading

wkcn commented Mar 26, 2019 • edited Loading

YutingZhang commented Mar 27, 2019

arcadiaphy commented Mar 27, 2019 • edited Loading

wkcn commented Mar 27, 2019

arcadiaphy commented Mar 27, 2019

YutingZhang commented Mar 28, 2019

anirudh2290 commented Mar 28, 2019

larroy commented Mar 28, 2019

arcadiaphy commented Mar 29, 2019 • edited Loading

arcadiaphy commented Mar 29, 2019 • edited Loading

larroy commented Mar 29, 2019

larroy commented Mar 29, 2019

larroy commented Mar 29, 2019

larroy commented Mar 29, 2019

arcadiaphy commented Mar 30, 2019

anirudh2290 commented Mar 30, 2019

arcadiaphy commented Mar 31, 2019

wkcn commented Mar 31, 2019

larroy commented Apr 1, 2019

larroy commented Apr 1, 2019 • edited Loading

wkcn commented Mar 26, 2019 •

edited

Loading

wkcn commented Mar 26, 2019 •

edited

Loading

wkcn commented Mar 26, 2019 •

edited

Loading

arcadiaphy commented Mar 27, 2019 •

edited

Loading

arcadiaphy commented Mar 29, 2019 •

edited

Loading

arcadiaphy commented Mar 29, 2019 •

edited

Loading

larroy commented Apr 1, 2019 •

edited

Loading