Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up custom reductions #1932

Merged
merged 4 commits into from
Dec 13, 2024
Merged

Conversation

oleksandr-pavlyk
Copy link
Collaborator

The function used to perform custom reduction in a single
work-item (leader of the work-group sequentially).

It now does so cooperatively few iterations, and
processes remaining non-reduced elements sequentially
in the leading work-item.

The custom_reduce_over_group got sped up about a factor of 3x.

The following now shows timing of the reduction kernel

unitrace -d -v -i 20 python -c "import dpctl.tensor as dpt; dpt.min(dpt.ones(10**7, dtype=dpt.float32)).sycl_queue.wait()"

or par (less that 10%) slower than the int32 kernel, which uses
built-in sycl::reduce_over_group:

unitrace -d -v -i 20 python -c "import dpctl.tensor as dpt; dpt.min(dpt.ones(10**7, dtype=dpt.int32)).sycl_queue.wait()"

Also fixed uses of group_load/group_store and enabled them for 2025.0 compiler. SYCLOS warnings should be gone now.

  • Have you provided a meaningful PR description?
  • Have you added a test, reproducer or referred to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • Have you checked performance impact of proposed changes?
  • Have you added documentation for your changes, if necessary?
  • Have you added your changes to the changelog?
  • If this PR is a work in progress, are you opening the PR as a draft?

The function used to perform custom reduction in a single
work-item (leader of the work-group sequentially).

It now does so cooperatively few iterations, and
processes remaining non-reduced elements sequentially
in the leading work-item.

The custom_reduce_over_group got sped up about a factor of 3x.

The following now shows timing of the reduction kernel

```
unitrace -d -v -i 20 python -c "import dpctl.tensor as dpt; dpt.min(dpt.ones(10**7, dtype=dpt.float32)).sycl_queue.wait()"
```

or par (less that 10%) slower than the int32 kernel, which uses
built-in sycl::reduce_over_group:

```
unitrace -d -v -i 20 python -c "import dpctl.tensor as dpt; dpt.min(dpt.ones(10**7, dtype=dpt.int32)).sycl_queue.wait()"
```
Copy link

github-actions bot commented Dec 12, 2024

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞

Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_324 ran successfully.
Passed: 893
Failed: 3
Skipped: 118

@coveralls
Copy link
Collaborator

coveralls commented Dec 12, 2024

Coverage Status

coverage: 87.659%. remained the same
when pulling 63c1947 on speed-up-custom-reductions
into 0bcd635 on master.

Doing so exactly recovers the behavior of sub_group::load<vec_sz>,
sub_group::store<vec_sz> and eliminates warnings with 2025.1 and SYCLOS.

With this change, enable use of group_load, group_store for DPC++
compiler with `__SYCL_MAJOR_VERSION >= 8u` which includes
oneAPI DPC++ 2025.0.x compiler and SYCLOS bundle.
@oleksandr-pavlyk oleksandr-pavlyk force-pushed the speed-up-custom-reductions branch from e8c10f1 to 03910f3 Compare December 12, 2024 14:54
Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_324 ran successfully.
Passed: 894
Failed: 2
Skipped: 118

Factor out bounds as constexpr values, reused between power-of-2
branch and not-power-of-two branch.

Lowered lower bounds from 32 to 8 based on pefrormance testing
on PVC and Iris Xe.
Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_325 ran successfully.
Passed: 894
Failed: 2
Skipped: 118

Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_326 ran successfully.
Passed: 893
Failed: 3
Skipped: 118

Copy link
Collaborator

@ndgrigorian ndgrigorian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This brings a great performance improvement for reductions not using sycl::reduce_over_group.

This LGTM, thank you @oleksandr-pavlyk !

@oleksandr-pavlyk oleksandr-pavlyk merged commit c5cb665 into master Dec 13, 2024
49 of 51 checks passed
@oleksandr-pavlyk oleksandr-pavlyk deleted the speed-up-custom-reductions branch December 13, 2024 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants