-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elementwise functions tuning #1889
Conversation
dpctl/tensor/libtensor/include/kernels/elementwise_functions/vec_size_util.hpp
Outdated
Show resolved
Hide resolved
Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞 |
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_202 ran successfully. |
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_203 ran successfully. |
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_209 ran successfully. |
ae56e7b
to
4c0de00
Compare
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_204 ran successfully. |
4c0de00
to
70a0a3f
Compare
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_226 ran successfully. |
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_227 ran successfully. |
ce43706
to
b97fcb4
Compare
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_228 ran successfully. |
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_231 ran successfully. |
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_249 ran successfully. |
cb42f35
to
e152430
Compare
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_249 ran successfully. |
070fcfd
to
d63ef28
Compare
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_258 ran successfully. |
d63ef28
to
01a25f0
Compare
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_258 ran successfully. |
Use sg.get_max_local_range instead. The `sg.get_local_range` must perform lots of checks to determine if this is the last trailing sub-group in the work-group and its actual size may be smaller. We set the local work-group size to be 128, which is a multiple of any sub-group size, and hence get_local_range() always equals to get_max_local_raneg(). The size of the work-groups was increated from 128 to 256, which is chosen so that all 8 threads of single vector with simd32 are used. Set vec_sz and n_vecs in implementations of contig_impl for each support function Make local work-groups size dependent on number of elements to process Fixes for type dispatching utils 1. Add missing include <type_traits> needed for std::true_type, and std::disjunction, std::conjunction 2. Replace std::bool_constant<std::same_v<T1, T2>> with direct and simpler std::same<T1, T2> in couple of instances Hide hyperparameter selection struct in anonymous namespace
vec operator should also apply isnan for sycl::half
This would resolve compiler warnings about deprecated sub_group::load, sub_group::store methods. (Warnings in build with nightly SYCLOS DPC++ bundle should be fixed now). Additionally, replaced unsigned int type for template parameters with std::uint8_t
Predicate use of experimental extension on this variable being set. Since use of this experimental extension, as implemented by oneAPI DPC++ 2025.0.0, causes test failures in `dpctl`, the use of this extension is turned off for DPC++ 2025.0.0
01a25f0
to
2531261
Compare
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_268 ran successfully. |
I ran performance micro benchmark on elementwise binary functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, this brings a welcome performance improvement!
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_271 ran successfully. |
The coverall got the results of coverage run, per its own records: which corresponds to the latest commit 88c3e1a on this branch. |
This PR revisits elementwise functions functors for contiguous inputs.
Since work-group size is chose so that it is always a multiple of any permissible sub-group size, there is no point in using more expensive
sg.get_local_range()
, so it is replaced with cheapersg.get_max_local_range()
. This change also slightly reduces the binary size due to leaner kernel (from36428264
bytes down to36345112
bytes).Implementations of each elementwise function for contiguous input can now set hyperparameters
vec_sz
andn_vecs
differently for different input types. This ability is applied toadd_contig_impl
for some modest performance improvement forint32_t
,uint32_t
,int64_t
,uint64_t
,float
anddouble
.Fixed missing check in implementation of
minimum
andmaximum
forsycl::half
type for vector inputs, which caused test failures on AMD CPUs in CI during earlier iterations of this work (Subgroup load store cleanup #1879).Added missing
include <type_traits>
in type dispatching headers, and simplified code.