Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA] Conv3DTranspose Error with Illegal Instruction #70838

Open
jwnhy opened this issue Jan 15, 2025 · 1 comment
Open

[CUDA] Conv3DTranspose Error with Illegal Instruction #70838

jwnhy opened this issue Jan 15, 2025 · 1 comment
Assignees

Comments

@jwnhy
Copy link

jwnhy commented Jan 15, 2025

bug描述 Describe the Bug

下面代码会触发非法指令异常。

import paddle
model = paddle.nn.Conv3DTranspose(9, 1, kernel_size=[8, 8, 8], stride=[
                                  4, 3, 1], padding=[8, 0, 8], dilation=[2, 1, 2])
tensor = paddle.rand([9, 9, 9, 9, 9])
model(tensor)

报错信息如下

W0115 10:52:53.554724 2022101 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 9.0, Driver API Version: 12.6, Runtime API Version: 12.3
W0115 10:52:53.555317 2022101 gpu_resources.cc:164] device: 0, cuDNN Version: 9.0.
Traceback (most recent call last):
  File "/home/jwnhy/gpu_fuzz/gen/poc4.py", line 7, in <module>
    model(tensor)
  File "/home/jwnhy/miniconda3/envs/gpu-paddle/lib/python3.12/site-packages/paddle/nn/layer/layers.py", line 1426, in __call__
    return self.forward(*inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jwnhy/miniconda3/envs/gpu-paddle/lib/python3.12/site-packages/paddle/nn/layer/conv.py", line 1219, in forward
    out = F.conv3d_transpose(
          ^^^^^^^^^^^^^^^^^^^
  File "/home/jwnhy/miniconda3/envs/gpu-paddle/lib/python3.12/site-packages/paddle/nn/functional/conv.py", line 1723, in conv3d_transpose
    pre_bias = _C_ops.conv3d_transpose(
               ^^^^^^^^^^^^^^^^^^^^^^^^
OSError: (External) CUDNN error(5000), CUDNN_STATUS_EXECUTION_FAILED.
  [Hint: Please search for the error code(5000) on website (https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnStatus_t) to get Nvidia's official solution and advice about CUDNN Error.] (at ../paddle/phi/kernels/gpudnn/conv_cudnn_v7.h:834)

compute-sanitizer 追踪结果

========= Illegal instruction
=========     at sm90_xmma_dgrad_implicit_gemm_indexed_f32f32_tf32f32_f32_nhwckrsc_nhwc_tilesize256x64x32_warpgroupsize1x1x1_g1_strided_execute_kernel__5x_cudnn+0x6bf0
=========     by thread (0,0,0) in block (8,0,0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2dfec3]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x1cb78b8]
=========                in /home/jwnhy/miniconda3/envs/gpu-paddle/lib/python3.12/site-packages/paddle/base/../../../../libcudnn_engines_precompiled.so.9.3.0
=========     Host Frame: [0x1d1b4cf]
=========                in /home/jwnhy/miniconda3/envs/gpu-paddle/lib/python3.12/site-packages/paddle/base/../../../../libcudnn_engines_precompiled.so.9.3.0
=========     Host Frame: [0x12ed011]
=========                in /home/jwnhy/miniconda3/envs/gpu-paddle/lib/python3.12/site-packages/paddle/base/../../../../libcudnn_engines_precompiled.so.9.3.0
=========     Host Frame: [0x145a89f]
=========                in /home/jwnhy/miniconda3/envs/gpu-paddle/lib/python3.12/site-packages/paddle/base/../../../../libcudnn_engines_precompiled.so.9.3.0
=========     Host Frame: [0x101f70c]
=========                in /home/jwnhy/miniconda3/envs/gpu-paddle/lib/python3.12/site-packages/paddle/base/../../../../libcudnn_engines_precompiled.so.9.3.0
=========     Host Frame: [0x101fe09]
=========                in /home/jwnhy/miniconda3/envs/gpu-paddle/lib/python3.12/site-packages/paddle/base/../../../../libcudnn_engines_precompiled.so.9.3.0
=========     Host Frame: [0x4a33ad]
=========                in /home/jwnhy/miniconda3/envs/gpu-paddle/lib/python3.12/site-packages/paddle/base/../../../../libcudnn_engines_precompiled.so.9.3.0
=========     Host Frame: [0x4a3888]
=========                in /home/jwnhy/miniconda3/envs/gpu-paddle/lib/python3.12/site-packages/paddle/base/../../../../libcudnn_engines_precompiled.so.9.3.0
=========     Host Frame: [0x4b835b]
=========                in /home/jwnhy/miniconda3/envs/gpu-paddle/lib/python3.12/site-packages/paddle/base/../../../../libcudnn_engines_precompiled.so.9.3.0
=========     Host Frame:cudnn::backend::execute(cudnnContext*, cudnn::backend::ExecutionPlan const&, cudnn::backend::VariantPack&) [0x134c78]
=========                in /home/jwnhy/miniconda3/envs/gpu-paddle/lib/python3.12/site-packages/paddle/base/../../../../libcudnn_graph.so.9.3.0
=========     Host Frame: [0x36351]
=========                in /home/jwnhy/miniconda3/envs/gpu-paddle/lib/python3.12/site-packages/paddle/base/../../../../libcudnn_cnn.so.9.3.0
=========     Host Frame: [0x1f9fa]
=========                in /home/jwnhy/miniconda3/envs/gpu-paddle/lib/python3.12/site-packages/paddle/base/../../../../libcudnn_cnn.so.9.3.0
=========     Host Frame:cudnnConvolutionBackwardData [0x4262f]
=========                in /home/jwnhy/miniconda3/envs/gpu-paddle/lib/python3.12/site-packages/paddle/base/../../../../libcudnn_cnn.so.9.3.0
=========     Host Frame:std::_Function_handler<void (void*), phi::ConvRunner<float, (phi::ConvKind)2>::Apply(phi::GPUContext const&, phi::ConvArgsBase<cudnnContext*, cudnnDataType_t> const&, phi::SearchResult<cudnnConvolutionBwdDataAlgo_t> const&, float const*, float const*, float*, int, int, int, int, unsigned long, phi::DnnWorkspaceHandle*, bool)::{lambda(void*)#1}>::_M_invoke(std::_Any_data const&, void*&&) [0x5274427]
=========                in /home/jwnhy/miniconda3/envs/gpu-paddle/lib/python3.12/site-packages/paddle/base/../libs/libphi_core.so
=========     Host Frame:void phi::ConvTransposeRawGPUDNNKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, std::vector<int, std::allocator<int> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, phi::DenseTensor*) [0x52e77b6]
=========                in /home/jwnhy/miniconda3/envs/gpu-paddle/lib/python3.12/site-packages/paddle/base/../libs/libphi_core.so
=========     Host Frame:paddle::experimental::conv3d_transpose(paddle::Tensor const&, paddle::Tensor const&, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, std::vector<int, std::allocator<int> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) [0x2d13ab6]
=========                in /home/jwnhy/miniconda3/envs/gpu-paddle/lib/python3.12/site-packages/paddle/base/../libs/libphi_core.so
=========     Host Frame:conv3d_transpose_ad_func(paddle::Tensor const&, paddle::Tensor const&, std::vector<int, std::allocator<int> >, std::vector<int, std::allocator<int> >, std::vector<int, std::allocator<int> >, std::vector<int, std::allocator<int> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::vector<int, std::allocator<int> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [0x7012611]

环境如下

[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.3.4.1
[pip3] nvidia-cuda-cupti-cu12==12.3.101
[pip3] nvidia-cuda-nvrtc-cu12==12.3.107
[pip3] nvidia-cuda-runtime-cu12==12.3.101
[pip3] nvidia-cudnn-cu12==9.0.0.312
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-nccl-cu12==2.19.3
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.4.127
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.3.4.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.3.101                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.3.107                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.3.101                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.0.0.312                pypi_0    pypi
[conda] nvidia-cufft-cu12         11.2.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.5.147               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.6.1.9                 pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.3.1.170               pypi_0    pypi
[conda] nvidia-nccl-cu12          2.19.3                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.6.85                  pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.4.127                 pypi_0    pypi

设备是H100

其他补充信息 Additional Supplementary Information

No response

@liym27
Copy link
Contributor

liym27 commented Jan 15, 2025

感谢反馈,我们排查下问题。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants