Speed up custom reductions #1932

oleksandr-pavlyk · 2024-12-12T04:14:25Z

The function used to perform custom reduction in a single
work-item (leader of the work-group sequentially).

It now does so cooperatively few iterations, and
processes remaining non-reduced elements sequentially
in the leading work-item.

The custom_reduce_over_group got sped up about a factor of 3x.

The following now shows timing of the reduction kernel

unitrace -d -v -i 20 python -c "import dpctl.tensor as dpt; dpt.min(dpt.ones(10**7, dtype=dpt.float32)).sycl_queue.wait()"

or par (less that 10%) slower than the int32 kernel, which uses
built-in sycl::reduce_over_group:

unitrace -d -v -i 20 python -c "import dpctl.tensor as dpt; dpt.min(dpt.ones(10**7, dtype=dpt.int32)).sycl_queue.wait()"

Also fixed uses of group_load/group_store and enabled them for 2025.0 compiler. SYCLOS warnings should be gone now.

Have you provided a meaningful PR description?
Have you added a test, reproducer or referred to an issue with a reproducer?
Have you tested your changes locally for CPU and GPU devices?
Have you made sure that new changes do not introduce compiler warnings?
Have you checked performance impact of proposed changes?
Have you added documentation for your changes, if necessary?
Have you added your changes to the changelog?
If this PR is a work in progress, are you opening the PR as a draft?

The function used to perform custom reduction in a single work-item (leader of the work-group sequentially). It now does so cooperatively few iterations, and processes remaining non-reduced elements sequentially in the leading work-item. The custom_reduce_over_group got sped up about a factor of 3x. The following now shows timing of the reduction kernel ``` unitrace -d -v -i 20 python -c "import dpctl.tensor as dpt; dpt.min(dpt.ones(10**7, dtype=dpt.float32)).sycl_queue.wait()" ``` or par (less that 10%) slower than the int32 kernel, which uses built-in sycl::reduce_over_group: ``` unitrace -d -v -i 20 python -c "import dpctl.tensor as dpt; dpt.min(dpt.ones(10**7, dtype=dpt.int32)).sycl_queue.wait()" ```

github-actions · 2024-12-12T04:50:26Z

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞

github-actions · 2024-12-12T04:58:17Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_324 ran successfully.
Passed: 893
Failed: 3
Skipped: 118

coveralls · 2024-12-12T05:00:08Z

coverage: 87.659%. remained the same
when pulling 63c1947 on speed-up-custom-reductions
into 0bcd635 on master.

Doing so exactly recovers the behavior of sub_group::load<vec_sz>, sub_group::store<vec_sz> and eliminates warnings with 2025.1 and SYCLOS. With this change, enable use of group_load, group_store for DPC++ compiler with `__SYCL_MAJOR_VERSION >= 8u` which includes oneAPI DPC++ 2025.0.x compiler and SYCLOS bundle.

github-actions · 2024-12-12T15:42:03Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_324 ran successfully.
Passed: 894
Failed: 2
Skipped: 118

Factor out bounds as constexpr values, reused between power-of-2 branch and not-power-of-two branch. Lowered lower bounds from 32 to 8 based on pefrormance testing on PVC and Iris Xe.

…_over_group function

github-actions · 2024-12-12T20:28:53Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_325 ran successfully.
Passed: 894
Failed: 2
Skipped: 118

github-actions · 2024-12-12T20:40:11Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_326 ran successfully.
Passed: 893
Failed: 3
Skipped: 118

ndgrigorian

This brings a great performance improvement for reductions not using sycl::reduce_over_group.

This LGTM, thank you @oleksandr-pavlyk !

dpctl/tensor/libtensor/include/utils/sycl_utils.hpp

oleksandr-pavlyk requested a review from ndgrigorian as a code owner December 12, 2024 04:14

oleksandr-pavlyk force-pushed the speed-up-custom-reductions branch from e8c10f1 to 03910f3 Compare December 12, 2024 14:54

oleksandr-pavlyk added 2 commits December 12, 2024 13:44

Tweak bounds of cooperative reduction steps

e15e3aa

Factor out bounds as constexpr values, reused between power-of-2 branch and not-power-of-two branch. Lowered lower bounds from 32 to 8 based on pefrormance testing on PVC and Iris Xe.

Add line for performance improvement in reductions with custom reduce…

63c1947

…_over_group function

ndgrigorian approved these changes Dec 13, 2024

View reviewed changes

oleksandr-pavlyk commented Dec 13, 2024

View reviewed changes

dpctl/tensor/libtensor/include/utils/sycl_utils.hpp Show resolved Hide resolved

oleksandr-pavlyk commented Dec 13, 2024

View reviewed changes

dpctl/tensor/libtensor/include/utils/sycl_utils.hpp Show resolved Hide resolved

oleksandr-pavlyk merged commit c5cb665 into master Dec 13, 2024
49 of 51 checks passed

oleksandr-pavlyk deleted the speed-up-custom-reductions branch December 13, 2024 16:53

coveralls mentioned this pull request Dec 13, 2024

This is a follow-up pr to make some changes suggested after approval #1937

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up custom reductions #1932

Speed up custom reductions #1932

oleksandr-pavlyk commented Dec 12, 2024

github-actions bot commented Dec 12, 2024 •

edited

Loading

github-actions bot commented Dec 12, 2024

coveralls commented Dec 12, 2024 •

edited

Loading

github-actions bot commented Dec 12, 2024

github-actions bot commented Dec 12, 2024

github-actions bot commented Dec 12, 2024

ndgrigorian left a comment

Speed up custom reductions #1932

Speed up custom reductions #1932

Conversation

oleksandr-pavlyk commented Dec 12, 2024

github-actions bot commented Dec 12, 2024 • edited Loading

github-actions bot commented Dec 12, 2024

coveralls commented Dec 12, 2024 • edited Loading

github-actions bot commented Dec 12, 2024

github-actions bot commented Dec 12, 2024

github-actions bot commented Dec 12, 2024

ndgrigorian left a comment

Choose a reason for hiding this comment

github-actions bot commented Dec 12, 2024 •

edited

Loading

coveralls commented Dec 12, 2024 •

edited

Loading