-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Having a lot of segfault / incorrect outputs with latest updates #1152
Comments
@fcharras Try running with The out of resources hints that the kernel may be asking for too much SLM for example. |
(showing the last printed events before terminate) with
with opencl:
It seems that I can indeed avoid the terminate if I decrease the group size. But it used to work before, and I really don't think the kernel allocates too much. (comparatively to It does not address the other issue where some output results are now wrong. It reminds #1106 and #906 , where some work groups just seems to not be dispatched. |
The -14 error code for OpenCL is EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST, and -5 is OUT_OF_RESOURCES. Could you please check if OpenCL trace signaled any error codes that preceded -14? |
here is the full traceback for opencl, -14 and -5 are the only errors. |
@fcharras thank you for reporting this bug. I'll look into it. Could you, please, provide instructions how to reproduce the problem? (what steps should I perform to run tests and see error/tests failing) |
@ZzEeKkAa Please follow instructions on https://github.com/soda-inria/sklearn-numba-dpex |
I am testing different combinations of versions of dependencies (oneapi basekit release, dpnp, dpctl, numba_dpex versions) and running the tests, and the trend is that the frequency and the scope of the issues similar to this one or #906 or #1106, really only depend on numba_dpex version. So I think the regressions either come from numba_dpex itself or its specific dependencies, with numba that was bumped from Using @ZzEeKkAa I think the problem can be reproduced with any recent numba_dpex version on any gpu, so if you set up an environment for it, and then also install sklearn_numba_dpex (see e.g this howto) in editable mode and run the tests, it should also trigger the segfaults at least. (also worth mentionning that the more recent benchmarks that I've been running for our benchmark project suggest that there's been a significant performance regression around |
Could numba_dpex latest version be made compatible with both numba 0.56 and >=0.57 ? by adapting code paths depending on the version of numba. Could be easier to test for possible regressions from numba / llvmlite this way. |
Thank you! That's definetly helps. We've applied several fixes to support 0.57, so it may be tricky to add support for numba 0.56. I guess it is worth trying with Will try to reproduce issue tomorrow. |
As @diptorupd suggested setting environment variable |
Oh no, actually still having some segfaults in sklearn-numba-dpex test suite, but much less. |
So I've bisected recent numba dpex history and things start breaking at d3d7ef6. d3d7ef6 comes along the merge commit 547382d of #1112 . It's the second commit of this 2 commits PR. Could d3d7ef6 be reverted as a hotfix ? (thus it does not seem related to numba / llvmlite versions after all) Edit: More specifically reverting the
addition do fix the issues. Seems like otherwise keeping
is fine. |
Yes, even as I was investigating I had a feeling that setting an aggressive inline level may also be an issue. Ok. Can you please test #906 with and without the pmb.inlining_threshold = 2? I want to create a unit test as well. |
I don't think #906 can be related to Sorry maybe I'm adding confusion when mentionning this issue along with #1106 and #906 all in the same bucket, it somewhat feels that all those might have a common cause but maybe not. |
In fact with most recent tag 0.22.0dev0 I can't reproduce issues from #906 examples now, Did something happen ? I would not be positive that it is definitely fixed before I see more long-term stability (those kind of reproducers seemed to be sensible to details such as work group sizes...) but maybe there's been a step forward ! With |
#1106 is still reproducible in |
@fcharras will you be able to join our Gitter channel (https://matrix.to/#/#Data-Parallel-Python_community:gitter.im)? We can discuss in real time quicker with you that way. |
I'm attempting bumps of all the numba_dpex + dpctl stack but a lot of tests at sklearn-numba-dpex either segfaults or return wrong results.
Yet again minimal reproducers are tricky to extract, but the regressions are common enough to impact about 10% of the test suite, it's not that rare.
Using GPU devices sometimes lead to segfaults, if using level zero backend it prints:
with opencl it's a slightly different error:
but in both cases it didn't let me find out what is the actual issue.
When running on CPU, I don't see segfaults but sometimes the tests just fail with wrong outputs.
@diptorupd @oleksandr-pavlyk it would greatly help if you have some insight on things I could try to overcome this. Last time it happened, reverting bumps on drivers solved most of the issues, this time it doesn't seem to be related.
I can reproduce the issue both locally in an iGPU in our custom docker build, but also with the conda install on intel dev cloud with max series gpu.
The text was updated successfully, but these errors were encountered: