-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH small cleaning and optimizations accross the repo #88
Conversation
Out of WIP. See the top level post for the list of changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR of general improvements, @fcharras. 🙂
Here is a first pass.
sklearn_numba_dpex/kmeans/drivers.py
Outdated
n_features, _divide, max_work_group_size, compute_dtype | ||
divide_by_n_samples_kernel = make_apply_elementwise_func( | ||
(n_features,), | ||
_divide_by(compute_dtype(n_samples)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not find this chaining super intuitive. Can you extract it above just above the call to this function and give it a name? 🙂
math.ceil(math.ceil(n_samples / window_n_candidates) / candidates_window_height) | ||
* candidates_window_height, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give this a name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree but I don't really now how to make it clear, it's twisted that the shape of the grid is adapted to the sliding window on centroids, but each work item apply to one single input sample (even if it's the right thing to do).
n_subgroups = global_size // sub_group_size | ||
n_subgroups = math.ceil(n_samples / window_n_centroids) | ||
global_size = ( | ||
math.ceil(n_subgroups / centroids_window_height) * centroids_window_height, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Identically, is it possible to give this a name?
math.ceil(math.ceil(n_samples / window_n_centroids) / centroids_window_height) | ||
* centroids_window_height, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Identically, is it possible to give this a name?
privatization_idx = ( | ||
sample_idx // sub_group_size | ||
) % n_centroids_private_copies | ||
privatization_idx = sub_group_idx % n_centroids_private_copies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to avoid using a modulo, here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so :-(
Co-authored-by: Julien Jerphanion <[email protected]>
sklearn_numba_dpex/common/kernels.py
Outdated
# results in some tests. | ||
# TODO: create a minimal reproducer and report to `numba_dpex`. Remove | ||
# the hack once it is fixed. | ||
(_is_cpu or (col_idx < n_cols)) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's probably some bug in numba_dpex or with the cpu backend here that causes branching to not be correctly executed in very specific cases, for this very particular condition, on cpu. I had no luck when trying to have a minimal reproducer and I can't see failures on gpu. So I'd be in favor of letting this pass through in the meantime. ( Note that this also fixes the case with (<-- edit: no it doesn't, ran on wrong device))work_group_size=1
that I added back to the unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well it's still wrong on the CI. 🤷
Something is wrong with I'll also run this branch on the dev cloud with the flex170 to see if it shows any performance improvement, and play with the group size to see if it affects performance more than it seems to do on local iGPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, the changes improve the readability a lot (I think). Looking forward to the performance impacts (once the cause of the bug has been found).
I have seen a bunch of fishing things (see some of the comments below). Maybe this can explain the broken tests?
def make_broadcast_division_1d_2d_axis0_kernel(shape, work_group_size): | ||
n_rows, n_cols = shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def make_broadcast_division_1d_2d_axis0_kernel(shape, work_group_size): | |
n_rows, n_cols = shape | |
def make_broadcast_division_1d_2d_axis0_kernel(shape, work_group_size): | |
n_rows, n_cols = shape |
Here it seems that we assume shape
is always 2d (which is fine): maybe the function name should be updated accordingly to drop the "1d".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent here was to mean that the broadcasted data is 1d
. Would 1d_to_2d
make it clearer ?
""" | ||
ops must be a function that will be interpreted as a dpex.func and is subject to | ||
the same rules. It is expected to take two scalar arguments and return one scalar | ||
value. lambda functions are advised against since the cache will not work with lamda | ||
functions. sklearn_numba_dpex.common._utils expose some pre-defined `ops`. | ||
""" | ||
n_rows, n_cols = shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment for this function name.
sklearn_numba_dpex/common/kernels.py
Outdated
work_group_size = n_sub_groups_per_work_group * sub_group_size | ||
|
||
_is_cpu = device.has_aspect_cpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: drop the leading _
in the variable name if you don't expect a name conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll revert this _is_cpu
change, in fact it does not fix the JIT issue. We must wait for the root cause to be addressed in numba_dpex
sklearn_numba_dpex/common/kernels.py
Outdated
def _make_partial_sum_reduction_2d_axis0_kernel( | ||
n_cols, work_group_size, sub_group_size, fused_elementwise_func, dtype | ||
n_cols, work_group_size, sub_group_size, fused_elementwise_func, dtype, _is_cpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here.
sklearn_numba_dpex/common/kernels.py
Outdated
|
||
# The current work item use the following second coordinate (given by the | ||
# position of the window in the grid of windows, and by the local position of | ||
# the work item in the 2D index): | ||
col_idx = ( | ||
(group_id // n_blocks_per_col) * sub_group_size + local_col_idx | ||
(dpex.get_group_id(one_idx) * sub_group_size) + local_col_idx | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
black:
) | |
) |
which makes me think that I wanted to open a PR to fix the black problem but completely forgot about it. I will wait for this PR to me finalized first though.
n_centroids_private_copies = int(min(n_subgroups, n_centroids_private_copies)) | ||
|
||
# Each set of `sub_group_size` consecutive work items is assigned one private | ||
# copy, and several such set can be assigned to the same private copy. Thus, at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# copy, and several such set can be assigned to the same private copy. Thus, at | |
# copy, and several such sets can be assigned to the same private copy. Thus, at |
@@ -217,7 +222,8 @@ def _initialize_results(results): | |||
@dpex.func | |||
# fmt: off | |||
def _initialize_window_of_centroids( | |||
local_work_id, # PARAM | |||
local_row_idx, # PARAM | |||
local_col_idx, # PARAM | |||
first_centroid_idx, # PARAM | |||
centroids_half_l2_norm, # IN | |||
window_of_centroids_half_l2_norms, # OUT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making the shapes explicit after IN / OUT would probably help.
if item_idx < n_centroid_items: | ||
for copy_idx in range(n_centroids_private_copies): | ||
sum_ += centroids_t_private_copies[copy_idx, item_idx] | ||
centroids_t[item_idx] = sum_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there an indentation problem here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch 😌 TY
sum_ += centroids_t_private_copies[copy_idx, item_idx] | ||
centroids_t[item_idx] = sum_ | ||
|
||
elif item_idx < n_sums: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems redundant with the if item_idx >= n_sums:
early return at the beginning of the kernel.
cluster_sizes, # OUT (n_clusters,) | ||
centroids_t, # OUT (n_features, n_clusters) | ||
centroids_t, # OUT (n_features * n_clusters,) (flattened) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename to centroids_t_flattened
I've reduced the last failure on the pipeline to what appears to be a JIT issue. The merge will be blocked until it's fixed. It's likely the same JIT issue we've seen before and it gave a much nicer reproducer. See IntelPython/numba-dpex#906 . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once @ogrisel's comment have been addressed.
) | ||
|
||
# XXX: The kernels seem to work fine with work_group_size==1 on GPU but fail on CPU. | ||
if work_group_size == 1: | ||
if math.prod(work_group_shape) == 1: | ||
raise NotImplementedError("work_group_size==1 is not supported.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the message be changed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my mind work_group_size=math.prod(work_group_shape)
so I think it's good.
There is this nuance because I have introduced the syntax work_group_size=max
where we try to automatically use the maximum possible value for work_group_size
. This change enable checking that max
does not result in work_group_size=1
.
sklearn_numba_dpex/common/kernels.py
Outdated
return ( | ||
work_group_shape, | ||
( | ||
(partial_sum_reduction, partial_sum_reduction_nofunc), # kernels | ||
reduction_block_size, | ||
), | ||
(get_result_shape, get_global_size), | ||
) # shape_update_fn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last comment likely is off due to black formatting.
What do you think of this suggestion?
return ( | |
work_group_shape, | |
( | |
(partial_sum_reduction, partial_sum_reduction_nofunc), # kernels | |
reduction_block_size, | |
), | |
(get_result_shape, get_global_size), | |
) # shape_update_fn | |
kernels = (partial_sum_reduction, partial_sum_reduction_nofunc) | |
shape_update_fn = (get_result_shape, get_global_size) | |
return ( | |
work_group_shape, | |
( | |
kernels, | |
reduction_block_size, | |
), | |
shape_update_fn, | |
) |
sklearn_numba_dpex/common/kernels.py
Outdated
return ( | ||
work_group_shape, | ||
( | ||
(partial_sum_reduction, partial_sum_reduction_nofunc), # kernels | ||
reduction_block_size, | ||
), | ||
(get_result_shape, get_global_size), | ||
) # shape_update_fn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous suggestion also applies here.
sklearn_numba_dpex/common/kernels.py
Outdated
if ( | ||
(local_row_idx < n_active_sub_groups) and | ||
(col_idx < n_cols) and | ||
(work_item_row_idx < sum_axis_size) | ||
(_is_cpu or ((col_idx < n_cols) and | ||
(work_item_row_idx < sum_axis_size))) | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add inline comment to give some precisions regarding this branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll revert this _is_cpu
change, in fact it does not fix the JIT issue. We must wait for the root cause to be addressed in numba_dpex
Co-authored-by: Julien Jerphanion <[email protected]> Co-authored-by: Olivier Grisel <[email protected]>
Perfect. 👍 I leave @ogrisel merge. |
The tests are still red. Ok for merge once fixed. |
The tests fail because of IntelPython/numba-dpex#906 we can merge as soon as it's fixed and we can bump If there's a conflict with other branches before that we can merge except the diff on the 2d sum kernel. |
shape
centroids_private_copies_max_cache_occupancy
and improve the heuristic usingdevice.max_compute_units
It's WIP because using 2D groups for the sum over axis 0 kernel seems to trigger weirg bugs like in IntelPython/numba-dpex#892 in some of the tests.Out of WIP, remaining issues were unrelated, PR is green.
edit: confirmed affected by IntelPython/numba-dpex#906