Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Having a lot of segfault / incorrect outputs with latest updates #1152

Closed
fcharras opened this issue Oct 2, 2023 · 18 comments
Closed

Having a lot of segfault / incorrect outputs with latest updates #1152

fcharras opened this issue Oct 2, 2023 · 18 comments
Labels
user User submitted issue

Comments

@fcharras
Copy link

fcharras commented Oct 2, 2023

I'm attempting bumps of all the numba_dpex + dpctl stack but a lot of tests at sklearn-numba-dpex either segfaults or return wrong results.

Yet again minimal reproducers are tricky to extract, but the regressions are common enough to impact about 10% of the test suite, it's not that rare.

Using GPU devices sometimes lead to segfaults, if using level zero backend it prints:

terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -1 (PI_ERROR_DEVICE_NOT_FOUND) -1 (PI_ERROR_DEVICE_NOT_FOUND)

Aborted (core dumped)

with opencl it's a slightly different error:

terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)
Aborted (core dumped)

but in both cases it didn't let me find out what is the actual issue.

When running on CPU, I don't see segfaults but sometimes the tests just fail with wrong outputs.

@diptorupd @oleksandr-pavlyk it would greatly help if you have some insight on things I could try to overcome this. Last time it happened, reverting bumps on drivers solved most of the issues, this time it doesn't seem to be related.

I can reproduce the issue both locally in an iGPU in our custom docker build, but also with the conda install on intel dev cloud with max series gpu.

@oleksandr-pavlyk
Copy link
Contributor

oleksandr-pavlyk commented Oct 2, 2023

@fcharras Try running with SYCL_PI_TRACE=-1 to identify the actual underlying operation that returns the error status leading to terminate. I suspect it is a kernel launch, but it is best confirmed.

The out of resources hints that the kernel may be asking for too much SLM for example.

@fcharras
Copy link
Author

fcharras commented Oct 2, 2023

(showing the last printed events before terminate) with level_zero:

---> piEventGetInfo(
	pi_event : 0x55820996a300
	<unknown> : 4563
	<unknown> : 4
	<unknown> : 0x7ffcc61a43ac
	<nullptr>
) ---> 	pi_result : PI_SUCCESS

---> piEventsWait(
	<unknown> : 1
	pi_event * : 0x7f2904000c40[ 0x55820996a300 ... ]
) ---> 	pi_result : -1
	[out]pi_event * : 0x7f2904000c40[ 0x55820996a300 ... ]

---> piEventRelease(
	pi_event : 0x55820996a300
PI ---> piEventReleaseInternal(Event)
) ---> 	pi_result : PI_SUCCESS

---> piQueueFinish(
	<unknown> : 0x558208ff1240
) ---> 	pi_result : -1

with opencl:

---> piEventsWait(
	<unknown> : 1
	pi_event * : 0x7fc360000c40[ 0x55c8df07a1d0 ... ]
) ---> 	pi_result : -14
	[out]pi_event * : 0x7fc360000c40[ 0x55c8df07a1d0 ... ]

---> piEventRelease(
	pi_event : 0x55c8df07a1d0
) ---> 	pi_result : PI_SUCCESS

---> piQueueFinish(
	<unknown> : 0x55c8dea6f1e0
) ---> 	pi_result : -5

It seems that I can indeed avoid the terminate if I decrease the group size. But it used to work before, and I really don't think the kernel allocates too much. (comparatively to dpctl.SyclDevice().local_mem_size it allocates much much less).

It does not address the other issue where some output results are now wrong. It reminds #1106 and #906 , where some work groups just seems to not be dispatched.

@oleksandr-pavlyk
Copy link
Contributor

The -14 error code for OpenCL is EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST, and -5 is OUT_OF_RESOURCES.

Could you please check if OpenCL trace signaled any error codes that preceded -14?

@fcharras
Copy link
Author

fcharras commented Oct 2, 2023

here is the full traceback for opencl, -14 and -5 are the only errors.

@diptorupd diptorupd added the user User submitted issue label Oct 3, 2023
@ZzEeKkAa
Copy link
Contributor

ZzEeKkAa commented Oct 3, 2023

@fcharras thank you for reporting this bug. I'll look into it. Could you, please, provide instructions how to reproduce the problem? (what steps should I perform to run tests and see error/tests failing)

@oleksandr-pavlyk
Copy link
Contributor

@ZzEeKkAa Please follow instructions on https://github.com/soda-inria/sklearn-numba-dpex

@fcharras
Copy link
Author

fcharras commented Oct 3, 2023

I am testing different combinations of versions of dependencies (oneapi basekit release, dpnp, dpctl, numba_dpex versions) and running the tests, and the trend is that the frequency and the scope of the issues similar to this one or #906 or #1106, really only depend on numba_dpex version. So I think the regressions either come from numba_dpex itself or its specific dependencies, with numba that was bumped from <0.57 to >=0.57 at some point this year, with underlying llvmlite version bumping from <0.40 to <0.41. I don't think there's been major changes to the way numba_dpex calls the jit compiler ? so maybe llvmlite and numba are better suspects for the bug hunt.

Using numba_dpex==-0.20.0 with otherwise everything else (dpctl, dpnp, basekit 2023.2.0) bumped to most recent versions, do not showcase any of the regressions. I will continue bisecting (it's a bit tricky because older numba dpex versions had other issues, in particular regarding cache management, we've had different patches for those, that must be applied or not depending on the versions...)

@ZzEeKkAa I think the problem can be reproduced with any recent numba_dpex version on any gpu, so if you set up an environment for it, and then also install sklearn_numba_dpex (see e.g this howto) in editable mode and run the tests, it should also trigger the segfaults at least.

(also worth mentionning that the more recent benchmarks that I've been running for our benchmark project suggest that there's been a significant performance regression around numba_dpex==0.21 but it's not showing anymore with the most recent version)

@fcharras
Copy link
Author

fcharras commented Oct 3, 2023

Could numba_dpex latest version be made compatible with both numba 0.56 and >=0.57 ? by adapting code paths depending on the version of numba. Could be easier to test for possible regressions from numba / llvmlite this way.

@ZzEeKkAa
Copy link
Contributor

ZzEeKkAa commented Oct 3, 2023

Thank you! That's definetly helps. We've applied several fixes to support 0.57, so it may be tricky to add support for numba 0.56. I guess it is worth trying with conda --force flag.

Will try to reproduce issue tomorrow.

@fcharras
Copy link
Author

fcharras commented Oct 4, 2023

As @diptorupd suggested setting environment variable NUMBA_OPT=0 solves segfault / incorrect results issues. (it might however affect the performance, I haven't tested by how much)

@fcharras
Copy link
Author

fcharras commented Oct 4, 2023

Oh no, actually still having some segfaults in sklearn-numba-dpex test suite, but much less.

@fcharras
Copy link
Author

fcharras commented Oct 4, 2023

So I've bisected recent numba dpex history and things start breaking at d3d7ef6.

d3d7ef6 comes along the merge commit 547382d of #1112 . It's the second commit of this 2 commits PR.

Could d3d7ef6 be reverted as a hotfix ?

(thus it does not seem related to numba / llvmlite versions after all)

Edit:

More specifically reverting the

    pmb.inlining_threshold = 2

addition do fix the issues. Seems like otherwise keeping

    cres.library._optimize_final_module()

is fine.

@diptorupd
Copy link
Contributor

More specifically reverting the

    pmb.inlining_threshold = 2

Yes, even as I was investigating I had a feeling that setting an aggressive inline level may also be an issue. Ok. Can you please test #906 with and without the pmb.inlining_threshold = 2? I want to create a unit test as well.

@fcharras
Copy link
Author

fcharras commented Oct 4, 2023

I don't think #906 can be related to pmb.inlining_threshold = 2 since this addition is from august 18th but #906 was first reported in february ?

Sorry maybe I'm adding confusion when mentionning this issue along with #1106 and #906 all in the same bucket, it somewhat feels that all those might have a common cause but maybe not.

@fcharras
Copy link
Author

fcharras commented Oct 4, 2023

In fact with most recent tag 0.22.0dev0 I can't reproduce issues from #906 examples now, pmb.inlining_threshold = 2 has no impact.

Did something happen ? I would not be positive that it is definitely fixed before I see more long-term stability (those kind of reproducers seemed to be sensible to details such as work group sizes...) but maybe there's been a step forward !

With 0.21.0dev1 (which is the most stable tag I've otherwise found for dpex.kernel feature, see #1157 ) the bug is still reproducible 🤷

@fcharras
Copy link
Author

fcharras commented Oct 4, 2023

#1106 is still reproducible in 0.22.0dev0 though, is not sensitive to pmb.inlining_threshold = 2 either

@diptorupd
Copy link
Contributor

@fcharras will you be able to join our Gitter channel (https://matrix.to/#/#Data-Parallel-Python_community:gitter.im)? We can discuss in real time quicker with you that way.

@fcharras
Copy link
Author

fcharras commented Oct 6, 2023

I confirm that #1158 fixes the recent stability issues that were initially reported here so I think this issue can be closed.

All compilation issues are not fixed yet however, the reproducer in #1106 still miscompiles.

@fcharras fcharras closed this as completed Oct 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
user User submitted issue
Projects
None yet
Development

No branches or pull requests

4 participants