Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when used with PyTorch #173

Open
elistevens opened this issue Apr 27, 2018 · 0 comments
Open

Segmentation fault when used with PyTorch #173

elistevens opened this issue Apr 27, 2018 · 0 comments

Comments

@elistevens
Copy link

I get the following error intermittently when trying to profile a program that uses PyTorch. I'm not sure if that's relevant, since I haven't tried to minimize the program. The program seems to be exiting at the time, after running for about 5 minutes.

VMProf was invoked like so under Ubuntu 16.04:

if __name__ == '__main__':
    if os.getenv('VMPROF', None):
        PROFILE_FILE = 'vmprof_training.dat'
        flags = os.O_RDWR | os.O_CREAT | os.O_TRUNC
        if sys.platform == 'win32':
            flags |= os.O_BINARY
        outfd = os.open(PROFILE_FILE, flags)

        vmprof.enable(outfd, period=0.01)
    try:
        LunaTrainingApp().main()
    finally:
        if os.getenv('VMPROF', None):
            vmprof.disable()

Unfortunately, the core file is larger than the available space on the system's hard drive (I think due to the artificially inflated RAM size of CUDA programs). I can reproduce the issue, and run gdb commands if desired, however. Here's the backtrace:

#0  access_mem (as=<optimized out>, addr=140733193519104, val=0x7ffc393f6bb0, write=<optimized out>, arg=<optimized out>) at x86_64/Ginit.c:175
#1  0x00007ffff333b9dd in is_plt_entry (c=0x7ffc393f6d50) at x86_64/Gstep.c:43
#2  _ULx86_64_step (cursor=0x7ffc393f6d50) at x86_64/Gstep.c:126
#3  0x00007ffff355918f in vmp_walk_and_record_stack (frame=0x7ffc2eb556a8, result=result@entry=0x7fff95333020, max_depth=max_depth@entry=1019, signal=<optimized out>, signal@entry=1, pc=pc@entry=0)
    at src/vmp_stack.c:312
#4  0x00007ffff355a703 in get_stack_trace (current=current@entry=0xdca665b0, result=result@entry=0x7fff95333020, max_depth=max_depth@entry=1019, pc=pc@entry=0) at src/vmprof_unix.c:493
#5  0x00007ffff355a78f in _vmprof_sample_stack (p=p@entry=0x7fff95333000, tstate=tstate@entry=0xdca665b0, uc=uc@entry=0x7ffc393f7200) at src/vmprof_unix.c:98
#6  0x00007ffff355a912 in sigprof_handler (sig_nr=<optimized out>, info=<optimized out>, ucontext=<optimized out>) at src/vmprof_unix.c:242
#7  <signal handler called>
#8  0x00007fffb6373e2b in __device_stub__ZN5cudnn6detail24bn_fw_tr_1C11_singlereadIfLi512ELb1ELi1ELi2ELi20EEEv17cudnnTensorStructPKT_S2_PS3_PKfS8_ffPfS9_S9_S9_ffNS_15reduced_divisorEiSA_PNS0_19bnFwPersistentStateEifffiffP13cudnnStatus_tb(cudnnTensorStruct const&, float const*, cudnnTensorStruct const&, float*, float const*, float const*, float, float, float*, float*, float*, float*, float, float, cudnn::reduced_divisor&, int, cudnn::reduced_divisor&, cudnn::detail::bnFwPersistentState*, int, float, float, float, int, float, float, cudnnStatus_t*, bool) ()
   from /home/elis/edit/book/.venv/lib/python3.6/site-packages/torch/lib/libATen.so
#9  0x00007fff00020000 in ?? ()
#10 0x000000005daaaaab in ?? ()
#11 0x00007fff24200000 in ?? ()
#12 0x00007ffc00000001 in ?? ()
#13 0x00007fffb638e67c in cudnnBatchNormalizationForwardTraining () from /home/elis/edit/book/.venv/lib/python3.6/site-packages/torch/lib/libATen.so
#14 0x00007fffaefe1d3d in at::native::cudnn_batch_norm(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, bool, double, double) ()
   from /home/elis/edit/book/.venv/lib/python3.6/site-packages/torch/lib/libATen.so
#15 0x00007fffaf254d04 in at::Type::cudnn_batch_norm(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, bool, double, double) const ()
   from /home/elis/edit/book/.venv/lib/python3.6/site-packages/torch/lib/libATen.so
#16 0x00007fffd3ea3746 in torch::autograd::VariableType::cudnn_batch_norm (this=0x16e68f0, input=..., weight=..., bias=..., running_mean=..., running_var=..., training=true,
    exponential_average_factor=0.10000000000000001, epsilon=1.0000000000000001e-05) at torch/csrc/autograd/generated/VariableType.cpp:18662
#17 0x00007fffaefa36a7 in at::native::batch_norm(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, bool, double, double, bool) ()
   from /home/elis/edit/book/.venv/lib/python3.6/site-packages/torch/lib/libATen.so
#18 0x00007fffaf2547b6 in at::Type::batch_norm(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, bool, double, double, bool) const ()
   from /home/elis/edit/book/.venv/lib/python3.6/site-packages/torch/lib/libATen.so
#19 0x00007fffd3e05e69 in torch::autograd::VariableType::batch_norm (this=0x16e68f0, input=..., weight=..., bias=..., running_mean=..., running_var=..., training=true, momentum=0.10000000000000001,
    eps=1.0000000000000001e-05, cudnn_enabled=true) at torch/csrc/autograd/generated/VariableType.cpp:18205
#20 0x00007fffd3f8533e in at::batch_norm (cudnn_enabled=true, eps=1.0000000000000001e-05, momentum=0.10000000000000001, training=true, running_var=..., running_mean=..., bias=..., weight=..., input=...)
    at /pytorch/torch/lib/tmp_install/include/ATen/Functions.h:2993
#21 torch::autograd::dispatch_batch_norm (cudnn_enabled=true, eps=1.0000000000000001e-05, momentum=0.10000000000000001, training=true, running_var=..., running_mean=..., bias=..., weight=..., input=...)
    at torch/csrc/autograd/generated/python_torch_functions_dispatch.h:941
#22 torch::autograd::THPVariable_batch_norm (self=<optimized out>, args=<optimized out>, kwargs=<optimized out>) at torch/csrc/autograd/generated/python_torch_functions.cpp:1419
#23 0x00000000004c4b0b in _PyCFunction_FastCallKeywords ()
#24 0x000000000054f3c4 in ?? ()
#25 0x0000000000553aaf in _PyEval_EvalFrameDefault ()
#26 0x000000000054efc1 in ?? ()
#27 0x000000000054f24d in ?? ()
#28 0x0000000000553aaf in _PyEval_EvalFrameDefault ()
#29 0x000000000054e4c8 in ?? ()
#30 0x00000000005582c2 in _PyFunction_FastCallDict ()
#31 0x0000000000459c11 in _PyObject_Call_Prepend ()
#32 0x000000000045969e in PyObject_Call ()
#33 0x0000000000552029 in _PyEval_EvalFrameDefault ()
#34 0x000000000054efc1 in ?? ()
#35 0x00000000005581e9 in _PyFunction_FastCallDict ()
#36 0x0000000000459c11 in _PyObject_Call_Prepend ()
#37 0x000000000045969e in PyObject_Call ()
#38 0x00000000004e050b in ?? ()
#39 0x0000000000459893 in _PyObject_FastCallDict ()
#40 0x000000000054f117 in ?? ()
#41 0x0000000000553aaf in _PyEval_EvalFrameDefault ()
#42 0x000000000054e4c8 in ?? ()
#43 0x00000000005582c2 in _PyFunction_FastCallDict ()
#44 0x0000000000459c11 in _PyObject_Call_Prepend ()
#45 0x000000000045969e in PyObject_Call ()
#46 0x0000000000552029 in _PyEval_EvalFrameDefault ()
#47 0x000000000054efc1 in ?? ()
#48 0x00000000005581e9 in _PyFunction_FastCallDict ()
#49 0x0000000000459c11 in _PyObject_Call_Prepend ()
#50 0x000000000045969e in PyObject_Call ()
#51 0x00000000004e050b in ?? ()
#52 0x0000000000459893 in _PyObject_FastCallDict ()
#53 0x000000000054f117 in ?? ()
#54 0x0000000000553aaf in _PyEval_EvalFrameDefault ()
#55 0x000000000054e4c8 in ?? ()
#56 0x00000000005582c2 in _PyFunction_FastCallDict ()
#57 0x0000000000459c11 in _PyObject_Call_Prepend ()
#58 0x000000000045969e in PyObject_Call ()
#59 0x0000000000552029 in _PyEval_EvalFrameDefault ()
#60 0x000000000054efc1 in ?? ()
#61 0x00000000005581e9 in _PyFunction_FastCallDict ()
#62 0x0000000000459c11 in _PyObject_Call_Prepend ()
#63 0x000000000045969e in PyObject_Call ()
#64 0x00000000004e050b in ?? ()
#65 0x0000000000459893 in _PyObject_FastCallDict ()
#66 0x000000000054f117 in ?? ()
#67 0x0000000000553aaf in _PyEval_EvalFrameDefault ()
#68 0x000000000054e4c8 in ?? ()
#69 0x00000000005582c2 in _PyFunction_FastCallDict ()
#70 0x0000000000459c11 in _PyObject_Call_Prepend ()
#71 0x000000000045969e in PyObject_Call ()
#72 0x0000000000552029 in _PyEval_EvalFrameDefault ()
#73 0x000000000054efc1 in ?? ()
#74 0x00000000005581e9 in _PyFunction_FastCallDict ()
#75 0x0000000000459c11 in _PyObject_Call_Prepend ()
#76 0x000000000045969e in PyObject_Call ()
#77 0x00000000004e050b in ?? ()
#78 0x0000000000459893 in _PyObject_FastCallDict ()
#79 0x000000000054f117 in ?? ()
#80 0x0000000000553aaf in _PyEval_EvalFrameDefault ()
#81 0x000000000054e4c8 in ?? ()
#82 0x00000000005582c2 in _PyFunction_FastCallDict ()
#83 0x0000000000459c11 in _PyObject_Call_Prepend ()
#84 0x000000000045969e in PyObject_Call ()
#85 0x0000000000552029 in _PyEval_EvalFrameDefault ()
#86 0x000000000054efc1 in ?? ()
#87 0x0000000000558146 in _PyFunction_FastCallDict ()
#88 0x0000000000459c11 in _PyObject_Call_Prepend ()
#89 0x000000000045969e in PyObject_Call ()
#90 0x00000000004e050b in ?? ()
#91 0x000000000045969e in PyObject_Call ()
#92 0x0000000000552029 in _PyEval_EvalFrameDefault ()
#93 0x000000000054efc1 in ?? ()
#94 0x000000000054ffee in PyEval_EvalCodeEx ()
#95 0x000000000048b86d in ?? ()
#96 0x000000000045969e in PyObject_Call ()
#97 0x0000000000552029 in _PyEval_EvalFrameDefault ()
#98 0x000000000054e4c8 in ?? ()
#99 0x000000000054f4f6 in ?? ()
#100 0x0000000000553aaf in _PyEval_EvalFrameDefault ()
#101 0x000000000054e4c8 in ?? ()
#102 0x000000000054f4f6 in ?? ()
#103 0x0000000000553aaf in _PyEval_EvalFrameDefault ()
#104 0x000000000054e4c8 in ?? ()
#105 0x00000000005582c2 in _PyFunction_FastCallDict ()
#106 0x0000000000459c11 in _PyObject_Call_Prepend ()
#107 0x000000000045969e in PyObject_Call ()
#108 0x000000000058e2c2 in ?? ()
#109 0x00007ffff7bbd7fc in start_thread (arg=0x7ffc393fe700) at pthread_create.c:465
#110 0x00007ffff6d44b5f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

$ .venv/bin/python
Python 3.6.3 (default, Oct 3 2017, 21:45:48)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

And I see vmprof-0.4.12.dist-info/ in my site-packages, so I'm guessing that's the version I'm using.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant