Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transitioned dpctl to require DPC++ 2023, removed host related functions, fixed linker crash, enabled SyclKernel.max_sub_group_size property #1028

Merged

Conversation

oleksandr-pavlyk
Copy link
Collaborator

@oleksandr-pavlyk oleksandr-pavlyk commented Dec 22, 2022

The crash manifested itself as "Relocation trucated to fit" error, and was caused by the large size of device code produced in debug build. The suggested solution is to use link option -fsycl-link-huge-device-code.

See https://github.com/intel/llvm/blob/sycl/sycl/doc/UsersManual.md#link-options

FAILED: dpctl/tensor/_tensor_impl.cpython-39-x86_64-linux-gnu.so
: && /opt/intel/oneapi/compiler/2023.0.0/linux/bin/icpx -fPIC -fsycl  -O3 -Wall -Wextra -Winit-self -Wunused-function -Wuninitialized -Wmissing-declarations -fdiagnostics-color=auto -fstack-protector -fstack-protector-all -fpic -fPIC -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -fno-strict-overflow -fno-delete-null-pointer-checks -fsycl  -g -Wall -Wextra -Winit-self -Wunused-function -Wuninitialized -Wmissing-declarations -fdiagnostics-color=auto -fstack-protector -fstack-protector-all -fpic -fPIC -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -fno-strict-overflow -fno-delete-null-pointer-checks -fsycl  -O0 -ggdb3 -DDEBUG  -fsycl-device-code-split=per_kernel -shared  -o dpctl/tensor/_tensor_impl.cpython-39-x86_64-linux-gnu.so dpctl/tensor/CMakeFiles/_tensor_impl.dir/libtensor/source/tensor_py.cpp.o dpctl/tensor/CMakeFiles/_tensor_impl.dir/libtensor/source/simplify_iteration_space.cpp.o dpctl/tensor/CMakeFiles/_tensor_impl.dir/libtensor/source/copy_and_cast_usm_to_usm.cpp.o dpctl/tensor/CMakeFiles/_tensor_impl.dir/libtensor/source/copy_numpy_ndarray_into_usm_ndarray.cpp.o dpctl/tensor/CMakeFiles/_tensor_impl.dir/libtensor/source/copy_for_reshape.cpp.o dpctl/tensor/CMakeFiles/_tensor_impl.dir/libtensor/source/linear_sequences.cpp.o dpctl/tensor/CMakeFiles/_tensor_impl.dir/libtensor/source/eye_ctor.cpp.o dpctl/tensor/CMakeFiles/_tensor_impl.dir/libtensor/source/full_ctor.cpp.o dpctl/tensor/CMakeFiles/_tensor_impl.dir/libtensor/source/triul_ctor.cpp.o dpctl/tensor/CMakeFiles/_tensor_impl.dir/libtensor/source/device_support_queries.cpp.o  -Wl,-rpath,::::::: && :
/lib/x86_64-linux-gnu/crti.o: in function `_init':
(.init+0xb): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against undefined symbol `__gmon_start__'
/tmp/icpx-a02203/_tensor_impl-411e70.o: in function `sycl.descriptor_reg':
offload.wrapper.object:(.text.startup+0x4): relocation truncated to fit: R_X86_64_PC32 against `.data.rel.ro'
/tmp/icpx-a02203/_tensor_impl-411e70.o: in function `sycl.descriptor_unreg':
offload.wrapper.object:(.text.startup+0x14): relocation truncated to fit: R_X86_64_PC32 against `.data.rel.ro'
/tmp/icpx-a02203/_tensor_impl-411e70.o:(.eh_frame+0x20): relocation truncated to fit: R_X86_64_PC32 against `.text.startup'
/tmp/icpx-a02203/_tensor_impl-411e70.o:(.eh_frame+0x38): relocation truncated to fit: R_X86_64_PC32 against `.text.startup'
/usr/lib/gcc/x86_64-linux-gnu/9/crtbeginS.o: in function `deregister_tm_clones':
crtstuff.c:(.text+0x3): relocation truncated to fit: R_X86_64_PC32 against `.tm_clone_table'
crtstuff.c:(.text+0xa): relocation truncated to fit: R_X86_64_PC32 against symbol `__TMC_END__' defined in .data section in dpctl/tensor/_tensor_impl.cpython-39-x86_64-linux-gnu.so
crtstuff.c:(.text+0x16): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against undefined symbol `_ITM_deregisterTMCloneTable'
/usr/lib/gcc/x86_64-linux-gnu/9/crtbeginS.o: in function `register_tm_clones':
crtstuff.c:(.text+0x33): relocation truncated to fit: R_X86_64_PC32 against `.tm_clone_table'
crtstuff.c:(.text+0x3a): relocation truncated to fit: R_X86_64_PC32 against symbol `__TMC_END__' defined in .data section in dpctl/tensor/_tensor_impl.cpython-39-x86_64-linux-gnu.so
crtstuff.c:(.text+0x57): additional relocation overflows omitted from the output
dpctl/tensor/_tensor_impl.cpython-39-x86_64-linux-gnu.so: PC-relative offset overflow in PLT entry for `_ZSt10_ConstructIN4sycl3_V15eventEJEEvPT_DpOT0_'
icpx: error: linker command failed with exit code 1 (use -v to see invocation)

Removed DPCTLDevice_IsHost, DPCTLContext_IsHost, DPCTLHostSelector_Create, and Python API: dpctl.select_host_device, dpctl.has_host_device, dpctl.SyclDevice.is_host, and dpctl.SyclDevice.has_aspect_host, as well as host backend and host_device device type.

Also added support for enabled in DPC++ 2023 dpctl.SyclKernel.max_sub_group_size.

Removed use of __SYCL_COMPILER_2023_SWITCHOVER preprocessor constant and introduced __SYCL_COMPILER_VERSION_REQUIRED constant in Config/dpctl_config.h. Introduced static_assert in libsyclinterface/source/*.cpp files that #include <CL/sycl.hpp> to ensure that the compiler meets the minimum required version.

  • Have you provided a meaningful PR description?
  • Have you added a test, reproducer or referred to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • If this PR is a work in progress, are you filing the PR as a draft?

The issue manifested itself as "Relocation trucated to fit" error, and was caused by
the large size of device code produced in debug build. The suggested solution is to
use link option `-fsycl-link-huge-device-code`.

See https://github.com/intel/llvm/blob/sycl/sycl/doc/UsersManual.md#link-options

```
FAILED: dpctl/tensor/_tensor_impl.cpython-39-x86_64-linux-gnu.so
: && /opt/intel/oneapi/compiler/2023.0.0/linux/bin/icpx -fPIC -fsycl  -O3 -Wall -Wextra -Winit-self -Wunused-function -Wuninitialized -Wmissing-declarations -fdiagnostics-color=auto -fstack-protector -fstack-protector-all -fpic -fPIC -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -fno-strict-overflow -fno-delete-null-pointer-checks -fsycl  -g -Wall -Wextra -Winit-self -Wunused-function -Wuninitialized -Wmissing-declarations -fdiagnostics-color=auto -fstack-protector -fstack-protector-all -fpic -fPIC -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -fno-strict-overflow -fno-delete-null-pointer-checks -fsycl  -O0 -ggdb3 -DDEBUG  -fsycl-device-code-split=per_kernel -shared  -o dpctl/tensor/_tensor_impl.cpython-39-x86_64-linux-gnu.so dpctl/tensor/CMakeFiles/_tensor_impl.dir/libtensor/source/tensor_py.cpp.o dpctl/tensor/CMakeFiles/_tensor_impl.dir/libtensor/source/simplify_iteration_space.cpp.o dpctl/tensor/CMakeFiles/_tensor_impl.dir/libtensor/source/copy_and_cast_usm_to_usm.cpp.o dpctl/tensor/CMakeFiles/_tensor_impl.dir/libtensor/source/copy_numpy_ndarray_into_usm_ndarray.cpp.o dpctl/tensor/CMakeFiles/_tensor_impl.dir/libtensor/source/copy_for_reshape.cpp.o dpctl/tensor/CMakeFiles/_tensor_impl.dir/libtensor/source/linear_sequences.cpp.o dpctl/tensor/CMakeFiles/_tensor_impl.dir/libtensor/source/eye_ctor.cpp.o dpctl/tensor/CMakeFiles/_tensor_impl.dir/libtensor/source/full_ctor.cpp.o dpctl/tensor/CMakeFiles/_tensor_impl.dir/libtensor/source/triul_ctor.cpp.o dpctl/tensor/CMakeFiles/_tensor_impl.dir/libtensor/source/device_support_queries.cpp.o  -Wl,-rpath,::::::: && :
/lib/x86_64-linux-gnu/crti.o: in function `_init':
(.init+0xb): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against undefined symbol `__gmon_start__'
/tmp/icpx-a02203/_tensor_impl-411e70.o: in function `sycl.descriptor_reg':
offload.wrapper.object:(.text.startup+0x4): relocation truncated to fit: R_X86_64_PC32 against `.data.rel.ro'
/tmp/icpx-a02203/_tensor_impl-411e70.o: in function `sycl.descriptor_unreg':
offload.wrapper.object:(.text.startup+0x14): relocation truncated to fit: R_X86_64_PC32 against `.data.rel.ro'
/tmp/icpx-a02203/_tensor_impl-411e70.o:(.eh_frame+0x20): relocation truncated to fit: R_X86_64_PC32 against `.text.startup'
/tmp/icpx-a02203/_tensor_impl-411e70.o:(.eh_frame+0x38): relocation truncated to fit: R_X86_64_PC32 against `.text.startup'
/usr/lib/gcc/x86_64-linux-gnu/9/crtbeginS.o: in function `deregister_tm_clones':
crtstuff.c:(.text+0x3): relocation truncated to fit: R_X86_64_PC32 against `.tm_clone_table'
crtstuff.c:(.text+0xa): relocation truncated to fit: R_X86_64_PC32 against symbol `__TMC_END__' defined in .data section in dpctl/tensor/_tensor_impl.cpython-39-x86_64-linux-gnu.so
crtstuff.c:(.text+0x16): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against undefined symbol `_ITM_deregisterTMCloneTable'
/usr/lib/gcc/x86_64-linux-gnu/9/crtbeginS.o: in function `register_tm_clones':
crtstuff.c:(.text+0x33): relocation truncated to fit: R_X86_64_PC32 against `.tm_clone_table'
crtstuff.c:(.text+0x3a): relocation truncated to fit: R_X86_64_PC32 against symbol `__TMC_END__' defined in .data section in dpctl/tensor/_tensor_impl.cpython-39-x86_64-linux-gnu.so
crtstuff.c:(.text+0x57): additional relocation overflows omitted from the output
dpctl/tensor/_tensor_impl.cpython-39-x86_64-linux-gnu.so: PC-relative offset overflow in PLT entry for `_ZSt10_ConstructIN4sycl3_V15eventEJEEvPT_DpOT0_'
icpx: error: linker command failed with exit code 1 (use -v to see invocation)
```
@oleksandr-pavlyk
Copy link
Collaborator Author

@mdtoguchi @sergey-semenov I applied the suggested solution (learned in CMPLRLLVM-39897). Are there alternative ways to address the problem? Would splitting the library into several smaller SO files work as well?

@github-actions
Copy link

Array API standard conformance tests for dpctl=0.14.1dev0=py310h76be34b_47 ran successfully.
Passed: 33
Failed: 801
Skipped: 280

@mdtoguchi
Copy link

@mdtoguchi @sergey-semenov I applied the suggested solution (learned in CMPLRLLVM-39897). Are there alternative ways to address the problem? Would splitting the library into several smaller SO files work as well?

@oleksandr-pavlyk, breaking up in to separate .so files should also work as long as the resulting size of the device code allows for all required sections to be 'visible' at link.

@github-actions
Copy link

DPC++ 2023 has introduced a regression where the aspect returns 1
even for devices without fp64 aspect.
… since host device has been removed from 2023 compiler
With 2022.2 a non-zero value was being returned for interoperability
kernel compiler for GPU device, but it now returns zero with 2023.0
compiler.
Debug build was used previsouly to accelerate the build, but
due to growth of device section, debug build takes longer than
the release build.
@oleksandr-pavlyk oleksandr-pavlyk force-pushed the fix-debug-build-linker-relocation-truncated-to-fit branch from 7b3e89f to 37a3fe7 Compare December 23, 2022 17:32
Moved host operations done after kernel submission to improve
chance of detecting non-complete status of kernel submission.
Host device has been removed from DPC++ compiler in 2023.0.0

Also removed support for host device type and for backend host.
Removed `DPCTLDevice_IsHost`, `DPCTLHostSelector_Create`,
`DPCTLContext_IsHost` and use of is_host from Python API.
@oleksandr-pavlyk oleksandr-pavlyk force-pushed the fix-debug-build-linker-relocation-truncated-to-fit branch from 37a3fe7 to 9aa909f Compare December 24, 2022 12:12
@oleksandr-pavlyk oleksandr-pavlyk changed the title Used new in DPC++ 2023 option to fix linker crash Transitioned dpctl to require DPC++ 2023, remove host related functions, fixed linker crash Dec 24, 2022
@oleksandr-pavlyk oleksandr-pavlyk changed the title Transitioned dpctl to require DPC++ 2023, remove host related functions, fixed linker crash Transitioned dpctl to require DPC++ 2023, removed host related functions, fixed linker crash Dec 24, 2022
@oleksandr-pavlyk oleksandr-pavlyk changed the title Transitioned dpctl to require DPC++ 2023, removed host related functions, fixed linker crash Transitioned dpctl to require DPC++ 2023, removed host related functions, fixed linker crash, enabled SyclKernel.max_sub_group_size property Dec 24, 2022
@github-actions
Copy link

Array API standard conformance tests for dpctl=0.14.1dev0=py310h76be34b_63 ran successfully.
Passed: 33
Failed: 801
Skipped: 280

@github-actions
Copy link

Array API standard conformance tests for dpctl=0.14.1dev0=py310h76be34b_68 ran successfully.
Passed: 33
Failed: 801
Skipped: 280

@oleksandr-pavlyk oleksandr-pavlyk force-pushed the fix-debug-build-linker-relocation-truncated-to-fit branch from b5f0f10 to 4b0f484 Compare December 24, 2022 22:25
@github-actions
Copy link

Array API standard conformance tests for dpctl=0.14.1dev0=py310h76be34b_68 ran successfully.
Passed: 33
Failed: 801
Skipped: 280

@github-actions
Copy link

Array API standard conformance tests for dpctl=0.14.1dev0=py310h76be34b_69 ran successfully.
Passed: 33
Failed: 801
Skipped: 280

@github-actions
Copy link

Array API standard conformance tests for dpctl=0.14.1dev0=py310h76be34b_71 ran successfully.
Passed: 33
Failed: 801
Skipped: 280

Comment on lines +148 to +149
ASSERT_TRUE(add_private_mem_sz >= 0);
ASSERT_TRUE(axpy_private_mem_sz >= 0);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually an improvement in DPC++ RT, not a regression. Neither of the kernels use private memory.

@oleksandr-pavlyk oleksandr-pavlyk merged commit e6da513 into master Dec 26, 2022
@oleksandr-pavlyk oleksandr-pavlyk deleted the fix-debug-build-linker-relocation-truncated-to-fit branch December 26, 2022 15:06
@github-actions
Copy link

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞

@github-actions
Copy link

Array API standard conformance tests for dpctl=0.14.1dev0=py310h76be34b_71 ran successfully.
Passed: 33
Failed: 801
Skipped: 280

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants