-
Notifications
You must be signed in to change notification settings - Fork 6.8k
mx.nd.Custom conflicts with memory management #14522
Comments
Hey, this is the MXNet Label Bot. |
@arcadiaphy @wkcn @anirudh2290 Interested in looking into this? |
Does it reproduce the bug in old version of MXNet? |
I mean that does this issue exist before the PR #14363 ? |
reproduce the bug in CPU environment, it get stuck too. I use GDB to print the backtrace:
|
I see. In |
Both before an after. I believe it should be a different issue. |
OOM on CPU gets stuck too. |
@wkcn I think it's not related to OOM, looks more like LOG(FATAL) gets stuck. |
@arcadiaphy if LOG(FATAL) gets stuck, then there might be more stuff that can get stuck, which are harder to figure out. |
I think worth trying would be checking if cudaGetErrorString is not called does it still get stuck. Since custom op uses its own custom threads it may not handle exceptions very well but still std::terminate should be called for unhandled exception. I am surprised that its not happening here for the unhandled exception. |
Hi I'm investigating this one. With the provided example I believe the cause of the error is the following at least in CPU context (changing gpu to cpu context in the example above).
|
@anirudh2290 @wkcn @YutingZhang Finally figure out the reason: Normally, when a exception is thrown in spawn thread, it should be caught in But there are still two problems in the except handling of custom op:
|
@wkcn @anirudh2290 Like what I've mentioned before, adding on_complete callback in CustomOperator is not a good design, since the calling may be skipped when exception happens: Any suggestions on how to avoid it? |
In CPU I see a crash in sgemm. I don't see the relation with exceptions in CPU.
|
In GPU this is the segmentation fault that I get. Are we talking about the same problem?
|
I talked to @anirudh2290 and I understand we are talking about two separate issues here:
@arcadiaphy do you plan to send a fix for the handling of exceptions in custom ops? I can work on the integer overflow problem. |
This repro in imperative mode isolates from the exception handling problem in custom ops:
|
@larroy Yes, two different issues here. Reproduced the gemm bug too, but I'm focusing on the exception handling issue. |
Thanks @arcadiaphy ! Nice work! We can do this incrementally. Would you be willing to start with a PR for 1 ?
|
@anirudh2290 Have started a PR for 1, but the really tricky one is 2. |
Great! Thanks fo your analysis! |
nice! I will work on a fix for tensor shapes in the Blas Engine, unless somebody else has a strong desire to help there. |
@mxnet-label-bot add [C++, Backend, Bug, Exception Handling] |
When training/running a large neural network with CustomOP, mxnet can get stuck. My speculation is that if memory management (e.g., releasing/reallocating GPU memory, raising an "out of memory error") is needed while running CustomOP, mxnet can get stuck.
A minimum piece code to show that CustomOP can deadlock with memory management ("out-of-memory in this case"):
As expected, main(1) should work.
main(100) should give "out of GPU memory" error. However, it just got stuck.
The real-world problem I met is not just about inability to give an "out of memory" error. It seems MxNet can release/reallocate memory dynamically, but this probably also deadlock with CustomOP, so my program, which could fit into GPU memory, can also get stuck
Tested with nightly build of mxnet-cu90mkl with Python3.6
The text was updated successfully, but these errors were encountered: