Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elementwise functions tuning #1889

Merged
merged 13 commits into from
Nov 22, 2024
Merged

Elementwise functions tuning #1889

merged 13 commits into from
Nov 22, 2024

Conversation

oleksandr-pavlyk
Copy link
Collaborator

This PR revisits elementwise functions functors for contiguous inputs.

  1. Since work-group size is chose so that it is always a multiple of any permissible sub-group size, there is no point in using more expensive sg.get_local_range(), so it is replaced with cheaper sg.get_max_local_range(). This change also slightly reduces the binary size due to leaner kernel (from 36428264 bytes down to 36345112 bytes).

  2. Implementations of each elementwise function for contiguous input can now set hyperparameters vec_sz and n_vecs differently for different input types. This ability is applied to add_contig_impl for some modest performance improvement for int32_t, uint32_t, int64_t, uint64_t, float and double.

  3. Fixed missing check in implementation of minimum and maximum for sycl::half type for vector inputs, which caused test failures on AMD CPUs in CI during earlier iterations of this work (Subgroup load store cleanup #1879).

  4. Added missing include <type_traits> in type dispatching headers, and simplified code.


  • Have you provided a meaningful PR description?
  • Have you added a test, reproducer or referred to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • Have you checked performance impact of proposed changes?
  • Have you added documentation for your changes, if necessary?
  • Have you added your changes to the changelog?
  • If this PR is a work in progress, are you opening the PR as a draft?

Copy link

github-actions bot commented Nov 12, 2024

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞

Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_202 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_203 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

@coveralls
Copy link
Collaborator

coveralls commented Nov 12, 2024

Coverage Status

coverage: 87.725%. remained the same
when pulling 88c3e1a on elementwise-functions-tuning
into 3699f92 on master.

Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_209 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

@oleksandr-pavlyk oleksandr-pavlyk force-pushed the elementwise-functions-tuning branch from ae56e7b to 4c0de00 Compare November 13, 2024 01:59
Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_204 ran successfully.
Passed: 895
Failed: 0
Skipped: 119

@oleksandr-pavlyk oleksandr-pavlyk force-pushed the elementwise-functions-tuning branch from 4c0de00 to 70a0a3f Compare November 15, 2024 21:52
Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_226 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_227 ran successfully.
Passed: 895
Failed: 0
Skipped: 119

@oleksandr-pavlyk oleksandr-pavlyk force-pushed the elementwise-functions-tuning branch from ce43706 to b97fcb4 Compare November 18, 2024 18:14
Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_228 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_231 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_249 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

@oleksandr-pavlyk oleksandr-pavlyk force-pushed the elementwise-functions-tuning branch from cb42f35 to e152430 Compare November 19, 2024 14:12
Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_249 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

@oleksandr-pavlyk oleksandr-pavlyk force-pushed the elementwise-functions-tuning branch from 070fcfd to d63ef28 Compare November 19, 2024 20:47
Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_258 ran successfully.
Passed: 895
Failed: 0
Skipped: 119

@oleksandr-pavlyk oleksandr-pavlyk force-pushed the elementwise-functions-tuning branch from d63ef28 to 01a25f0 Compare November 19, 2024 21:39
Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_258 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

Use sg.get_max_local_range instead. The `sg.get_local_range` must perform
lots of checks to determine if this is the last trailing sub-group in the
work-group and its actual size may be smaller. We set the local work-group
size to be 128, which is a multiple of any sub-group size, and hence
get_local_range() always equals to get_max_local_raneg().

The size of the work-groups was increated from 128 to 256, which is
chosen so that all 8 threads of single vector with simd32 are used.

Set vec_sz and n_vecs in implementations of contig_impl for each support function

Make local work-groups size dependent on number of elements to process

Fixes for type dispatching utils

1. Add missing include <type_traits> needed for std::true_type, and
   std::disjunction, std::conjunction

2. Replace std::bool_constant<std::same_v<T1, T2>> with direct
   and simpler std::same<T1, T2> in couple of instances

Hide hyperparameter selection struct in anonymous namespace
vec operator should also apply isnan for sycl::half
This would resolve compiler warnings about deprecated sub_group::load,
sub_group::store methods. (Warnings in build with nightly SYCLOS DPC++
bundle should be fixed now).

Additionally, replaced unsigned int type for template parameters with
std::uint8_t
Predicate use of experimental extension on this variable being set.

Since use of this experimental extension, as implemented by oneAPI
DPC++ 2025.0.0, causes test failures in `dpctl`, the use of this
extension is turned off for DPC++ 2025.0.0
@oleksandr-pavlyk oleksandr-pavlyk force-pushed the elementwise-functions-tuning branch from 01a25f0 to 2531261 Compare November 21, 2024 01:43
Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_268 ran successfully.
Passed: 893
Failed: 2
Skipped: 119

@oleksandr-pavlyk
Copy link
Collaborator Author

oleksandr-pavlyk commented Nov 21, 2024

I ran performance micro benchmark on elementwise binary functions add and multiply on Meteor Lake iGPU and on RTX 3050 for Laptops GPU. There was no performance regression noted for changes in this branch versus the development branch.

ndgrigorian
ndgrigorian previously approved these changes Nov 21, 2024
Copy link
Collaborator

@ndgrigorian ndgrigorian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this brings a welcome performance improvement!

Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_271 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

@oleksandr-pavlyk
Copy link
Collaborator Author

The coverall got the results of coverage run, per its own records:

image

which corresponds to the latest commit 88c3e1a on this branch.

@ndgrigorian ndgrigorian merged commit ea6ae0b into master Nov 22, 2024
50 of 52 checks passed
@ndgrigorian ndgrigorian deleted the elementwise-functions-tuning branch November 22, 2024 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants