Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix work around #906 #96

Merged
merged 20 commits into from
Feb 28, 2023
Merged

Fix work around #906 #96

merged 20 commits into from
Feb 28, 2023

Conversation

fcharras
Copy link
Collaborator

@fcharras fcharras commented Feb 27, 2023

We were having more and more occurences of IntelPython/numba-dpex#906 to the point that it felt counter-productive to continue developing with numba_dpex until it's fixed.

After torturing the reproducers it seems I've found a way to consistently work around the bug, without having to change the logic of the kernels.

All kernels that contain barriers seems to be affected. The workaround basically consists in moving instructions that contain array.__setitem__ calls (or atomic functions) in those kernels to dpex.func device functions.

The added unit test in this PR demonstrate how it works on a small example. I will also expand on it in IntelPython/numba-dpex#906 .

The minimal reproducer show that not all such instructions seems to require to be moved to device functions, only replacing a select few ones seems to be enough to make the bugs disappear, but since the trigger condition remain unclear for this bug I went the safe way and replaced them all.

The downside is of course readability, but I don't think we should try to improve it now (having the constraint of incorporating those unnatural dpex.func factorizations make it very challenging anyway) let's rather move on and hope the bug is fixed upstream in the coming month so we can revert all the hacks.

Three more unrelated fixes have infiltrated the PR:

  • A dtype mismatch that fails some sklearn tests when the gpu does not support float64 compute
  • I noticed that our tolerance threshold was severely off for large datasets (we forgot to divide it by n_samples...), not sure how we let this slip through
  • On my machine, the scikit-learn test test_kmeans_verbose[0-lloyd] suddenly decided that from now on it should fail half of the time. Its reason is a mechanism I'm skeptical about in scikit-learn that is called "strict convergence" that is met by comparing labels that have been computed by two successive iterations of lloyd. I had to implement the exact same behavior in our lloyd to consistently pass the test, but I think it should only be done when tol == 0 (which is enough to pass the tests), see arguments in an inline comment in the PR.

@fcharras fcharras changed the base branch from main to minor_changes_enh February 27, 2023 14:09
@fcharras fcharras force-pushed the FIX_work_around_#906 branch from 563d59e to a82b10d Compare February 27, 2023 15:57
@fcharras fcharras marked this pull request as ready for review February 27, 2023 15:57
@fcharras fcharras requested review from jjerphan and ogrisel February 27, 2023 15:59
Copy link
Collaborator

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just some nits:

Comment on lines 338 to 339
# TODO: open an issue at `scikit-learn` and propose to adopt this behavior
# instead ?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# TODO: open an issue at `scikit-learn` and propose to adopt this behavior
# instead ?
# See: https://github.com/scikit-learn/scikit-learn/issues/25716

# behavior if and only if `tol == 0`, because in this case, it is easy to see
# that lloyd can indeed fail to stop at the right time due to numerical errors.
# Moreover, this is enough to pass scikit-learn unit tests. When `tol > 0`, we
# rely on the user setting an appropriate tolerance threshold.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather track the exact behavior of scikit-learn w.r.t. convergence checks.

@@ -86,3 +87,82 @@ def test_spirv_fix():
kmeans.fit(X_array)
finally:
_load_numba_dpex_with_patches()


def test_hack_906():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use an explicit function name:

Suggested change
def test_hack_906():
def test_need_to_workaround_numba_dpex_906():

sample_idx, # PARAM
first_centroid_idx, # PARAM
euclidean_distances_t, # IN
sq_distances # OUT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please swap the last two args as sq_distances is the input and euclidean_distances_t is the output.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, TY

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Thank you, @fcharras.

Comment on lines +703 to +704
local_sum_and_set_items_if = _make_sum_and_set_items_if_kernel_func()
global_sum_and_set_items_if = _make_sum_and_set_items_if_kernel_func()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need two of those and not just one? Is it for this reason?

Suggested change
local_sum_and_set_items_if = _make_sum_and_set_items_if_kernel_func()
global_sum_and_set_items_if = _make_sum_and_set_items_if_kernel_func()
# HACK: must define twice to work around the bug highlighted in
# test_regression_fix
local_sum_and_set_items_if = _make_sum_and_set_items_if_kernel_func()
global_sum_and_set_items_if = _make_sum_and_set_items_if_kernel_func()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to worry too much about this, the hack will be removed everywhere after bump to 0.20.0dev3, the PR has been opened #93, and hope can be merged soon after this one

Comment on lines +467 to +468
local_sum_and_set_items_if = _make_sum_and_set_items_if_kernel_func()
global_sum_and_set_items_if = _make_sum_and_set_items_if_kernel_func()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need two of those and not just one? Is it for this reason?

Suggested change
local_sum_and_set_items_if = _make_sum_and_set_items_if_kernel_func()
global_sum_and_set_items_if = _make_sum_and_set_items_if_kernel_func()
# HACK: must define twice to work around the bug highlighted in
# test_regression_fix
local_sum_and_set_items_if = _make_sum_and_set_items_if_kernel_func()
global_sum_and_set_items_if = _make_sum_and_set_items_if_kernel_func()

assert dpt.asnumpy(result)[0] == 10


# HACK 906
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# HACK 906
# HACK 906: see sklearn_numba_dpex.patches.tests.test_patches.test_need_to_workaround_numba_dpex_906 # noqa

argmin_indices[group_id] = local_argmin[zero_idx]
else:
argmin_indices[group_id] = local_argmin[one_idx]
# HACK 906
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# HACK 906
# HACK 906: see sklearn_numba_dpex.patches.tests.test_patches.test_need_to_workaround_numba_dpex_906 # noqa

Comment on lines +333 to +335
# Thus, shouldnt strict convergence checking be enabled only if `tol == 0` ?
# (which is, moreover, the only case where strict convergence really is tested
# in scikit learn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it must in this case for consistency, yes.

Comment on lines +320 to +321
# ???: if two successive assignations have been computed equal, it's called
# "strict convergence" and means that the algorithm has converged and can't get
Copy link
Member

@jjerphan jjerphan Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naive question for a better understanding: What is the meaning of the ??? here? Is it a question to be resolved? 🙂

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it opens the discsussion regarding diverging from sklearn behavior or considering changing it upstream (scikit-learn/scikit-learn#25716)

Comment on lines +326 to +327
# Providing the user chooses a sensible value for `tol`, wouldn't the cost of
# this check be in general greater than what the benefits ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both checks (i.e. the one using tol and the one checking for equality) should have the same computational cost. What do you think?

Copy link
Collaborator Author

@fcharras fcharras Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strict convergence checking has a additional cost that scales linearly with the amount of points in the data

Comment on lines +130 to +134
rationale = """If this test fails, it means that the bug reported at
https://github.com/IntelPython/numba-dpex/issues/906 has been fixed, and all the
hacks tags with `# HACK 906` that were used to work around it can now be removed.
This test can also be removed.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. 👍

@@ -539,7 +582,7 @@ def prepare_data_for_lloyd(X_t, init, tol, sample_weight, copy_x):

variance = variance_kernel(dpt.reshape(X_t, -1))
# Use numpy type to work around https://github.com/IntelPython/dpnp/issues/1238
tol = (dpt.asnumpy(variance)[0] / n_features) * tol
tol = (dpt.asnumpy(variance)[0] / (n_features * n_samples)) * tol
Copy link
Member

@jjerphan jjerphan Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side-note: Is this change related to the scope of this PR?

I am fine with it being integrated as part of this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related (but a one line fix)

@fcharras
Copy link
Collaborator Author

TY for the reviews @jjerphan @ogrisel I've applied your suggestions and answered the question.

I've added an additional commit that fuse strict convergence checking with the main lloyd kernel.

I'll go ahead and merge if it's green after launch.

@fcharras
Copy link
Collaborator Author

So I noticed that using 2D work group size harms performance (kernels 30% slower). The last commit revert the 2D work group size diff from #88 . I will open afterward a revert PR for this revert commit, maybe if things are fixed in numba_dpex regarding 2D group sizes we can merge again later. (theoritically, it should be better)

@fcharras fcharras changed the base branch from minor_changes_enh to main February 28, 2023 16:40
@fcharras fcharras merged commit 7b647cb into main Feb 28, 2023
@fcharras fcharras deleted the FIX_work_around_#906 branch February 28, 2023 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants