-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up custom reductions #1932
Conversation
The function used to perform custom reduction in a single work-item (leader of the work-group sequentially). It now does so cooperatively few iterations, and processes remaining non-reduced elements sequentially in the leading work-item. The custom_reduce_over_group got sped up about a factor of 3x. The following now shows timing of the reduction kernel ``` unitrace -d -v -i 20 python -c "import dpctl.tensor as dpt; dpt.min(dpt.ones(10**7, dtype=dpt.float32)).sycl_queue.wait()" ``` or par (less that 10%) slower than the int32 kernel, which uses built-in sycl::reduce_over_group: ``` unitrace -d -v -i 20 python -c "import dpctl.tensor as dpt; dpt.min(dpt.ones(10**7, dtype=dpt.int32)).sycl_queue.wait()" ```
Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞 |
Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_324 ran successfully. |
Doing so exactly recovers the behavior of sub_group::load<vec_sz>, sub_group::store<vec_sz> and eliminates warnings with 2025.1 and SYCLOS. With this change, enable use of group_load, group_store for DPC++ compiler with `__SYCL_MAJOR_VERSION >= 8u` which includes oneAPI DPC++ 2025.0.x compiler and SYCLOS bundle.
e8c10f1
to
03910f3
Compare
Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_324 ran successfully. |
Factor out bounds as constexpr values, reused between power-of-2 branch and not-power-of-two branch. Lowered lower bounds from 32 to 8 based on pefrormance testing on PVC and Iris Xe.
…_over_group function
Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_325 ran successfully. |
Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_326 ran successfully. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This brings a great performance improvement for reductions not using sycl::reduce_over_group
.
This LGTM, thank you @oleksandr-pavlyk !
The function used to perform custom reduction in a single
work-item (leader of the work-group sequentially).
It now does so cooperatively few iterations, and
processes remaining non-reduced elements sequentially
in the leading work-item.
The custom_reduce_over_group got sped up about a factor of 3x.
The following now shows timing of the reduction kernel
or par (less that 10%) slower than the int32 kernel, which uses
built-in sycl::reduce_over_group:
Also fixed uses of group_load/group_store and enabled them for 2025.0 compiler. SYCLOS warnings should be gone now.