Having a lot of segfault / incorrect outputs with latest updates #1152

fcharras · 2023-10-02T10:24:28Z

I'm attempting bumps of all the numba_dpex + dpctl stack but a lot of tests at sklearn-numba-dpex either segfaults or return wrong results.

Yet again minimal reproducers are tricky to extract, but the regressions are common enough to impact about 10% of the test suite, it's not that rare.

Using GPU devices sometimes lead to segfaults, if using level zero backend it prints:

terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -1 (PI_ERROR_DEVICE_NOT_FOUND) -1 (PI_ERROR_DEVICE_NOT_FOUND)

Aborted (core dumped)

with opencl it's a slightly different error:

terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)
Aborted (core dumped)

but in both cases it didn't let me find out what is the actual issue.

When running on CPU, I don't see segfaults but sometimes the tests just fail with wrong outputs.

@diptorupd @oleksandr-pavlyk it would greatly help if you have some insight on things I could try to overcome this. Last time it happened, reverting bumps on drivers solved most of the issues, this time it doesn't seem to be related.

I can reproduce the issue both locally in an iGPU in our custom docker build, but also with the conda install on intel dev cloud with max series gpu.

The text was updated successfully, but these errors were encountered:

oleksandr-pavlyk · 2023-10-02T11:46:48Z

@fcharras Try running with SYCL_PI_TRACE=-1 to identify the actual underlying operation that returns the error status leading to terminate. I suspect it is a kernel launch, but it is best confirmed.

The out of resources hints that the kernel may be asking for too much SLM for example.

fcharras · 2023-10-02T13:16:14Z

(showing the last printed events before terminate) with level_zero:

---> piEventGetInfo(
	pi_event : 0x55820996a300
	<unknown> : 4563
	<unknown> : 4
	<unknown> : 0x7ffcc61a43ac
	<nullptr>
) ---> 	pi_result : PI_SUCCESS

---> piEventsWait(
	<unknown> : 1
	pi_event * : 0x7f2904000c40[ 0x55820996a300 ... ]
) ---> 	pi_result : -1
	[out]pi_event * : 0x7f2904000c40[ 0x55820996a300 ... ]

---> piEventRelease(
	pi_event : 0x55820996a300
PI ---> piEventReleaseInternal(Event)
) ---> 	pi_result : PI_SUCCESS

---> piQueueFinish(
	<unknown> : 0x558208ff1240
) ---> 	pi_result : -1

with opencl:

---> piEventsWait(
	<unknown> : 1
	pi_event * : 0x7fc360000c40[ 0x55c8df07a1d0 ... ]
) ---> 	pi_result : -14
	[out]pi_event * : 0x7fc360000c40[ 0x55c8df07a1d0 ... ]

---> piEventRelease(
	pi_event : 0x55c8df07a1d0
) ---> 	pi_result : PI_SUCCESS

---> piQueueFinish(
	<unknown> : 0x55c8dea6f1e0
) ---> 	pi_result : -5

It seems that I can indeed avoid the terminate if I decrease the group size. But it used to work before, and I really don't think the kernel allocates too much. (comparatively to dpctl.SyclDevice().local_mem_size it allocates much much less).

It does not address the other issue where some output results are now wrong. It reminds #1106 and #906 , where some work groups just seems to not be dispatched.

oleksandr-pavlyk · 2023-10-02T13:25:23Z

The -14 error code for OpenCL is EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST, and -5 is OUT_OF_RESOURCES.

Could you please check if OpenCL trace signaled any error codes that preceded -14?

fcharras · 2023-10-02T14:01:09Z

here is the full traceback for opencl, -14 and -5 are the only errors.

ZzEeKkAa · 2023-10-03T19:41:16Z

@fcharras thank you for reporting this bug. I'll look into it. Could you, please, provide instructions how to reproduce the problem? (what steps should I perform to run tests and see error/tests failing)

oleksandr-pavlyk · 2023-10-03T20:02:04Z

@ZzEeKkAa Please follow instructions on https://github.com/soda-inria/sklearn-numba-dpex

fcharras · 2023-10-03T21:58:17Z

I am testing different combinations of versions of dependencies (oneapi basekit release, dpnp, dpctl, numba_dpex versions) and running the tests, and the trend is that the frequency and the scope of the issues similar to this one or #906 or #1106, really only depend on numba_dpex version. So I think the regressions either come from numba_dpex itself or its specific dependencies, with numba that was bumped from <0.57 to >=0.57 at some point this year, with underlying llvmlite version bumping from <0.40 to <0.41. I don't think there's been major changes to the way numba_dpex calls the jit compiler ? so maybe llvmlite and numba are better suspects for the bug hunt.

Using numba_dpex==-0.20.0 with otherwise everything else (dpctl, dpnp, basekit 2023.2.0) bumped to most recent versions, do not showcase any of the regressions. I will continue bisecting (it's a bit tricky because older numba dpex versions had other issues, in particular regarding cache management, we've had different patches for those, that must be applied or not depending on the versions...)

@ZzEeKkAa I think the problem can be reproduced with any recent numba_dpex version on any gpu, so if you set up an environment for it, and then also install sklearn_numba_dpex (see e.g this howto) in editable mode and run the tests, it should also trigger the segfaults at least.

(also worth mentionning that the more recent benchmarks that I've been running for our benchmark project suggest that there's been a significant performance regression around numba_dpex==0.21 but it's not showing anymore with the most recent version)

fcharras · 2023-10-03T22:08:40Z

Could numba_dpex latest version be made compatible with both numba 0.56 and >=0.57 ? by adapting code paths depending on the version of numba. Could be easier to test for possible regressions from numba / llvmlite this way.

ZzEeKkAa · 2023-10-03T22:41:51Z

Thank you! That's definetly helps. We've applied several fixes to support 0.57, so it may be tricky to add support for numba 0.56. I guess it is worth trying with conda --force flag.

Will try to reproduce issue tomorrow.

fcharras · 2023-10-04T09:24:15Z

As @diptorupd suggested setting environment variable NUMBA_OPT=0 solves segfault / incorrect results issues. (it might however affect the performance, I haven't tested by how much)

fcharras · 2023-10-04T09:27:25Z

Oh no, actually still having some segfaults in sklearn-numba-dpex test suite, but much less.

fcharras · 2023-10-04T12:46:08Z

So I've bisected recent numba dpex history and things start breaking at d3d7ef6.

d3d7ef6 comes along the merge commit 547382d of #1112 . It's the second commit of this 2 commits PR.

Could d3d7ef6 be reverted as a hotfix ?

(thus it does not seem related to numba / llvmlite versions after all)

Edit:

More specifically reverting the

    pmb.inlining_threshold = 2

addition do fix the issues. Seems like otherwise keeping

    cres.library._optimize_final_module()

is fine.

diptorupd · 2023-10-04T15:27:43Z

More specifically reverting the
    pmb.inlining_threshold = 2

Yes, even as I was investigating I had a feeling that setting an aggressive inline level may also be an issue. Ok. Can you please test #906 with and without the pmb.inlining_threshold = 2? I want to create a unit test as well.

fcharras · 2023-10-04T15:50:04Z

I don't think #906 can be related to pmb.inlining_threshold = 2 since this addition is from august 18th but #906 was first reported in february ?

Sorry maybe I'm adding confusion when mentionning this issue along with #1106 and #906 all in the same bucket, it somewhat feels that all those might have a common cause but maybe not.

fcharras · 2023-10-04T16:03:03Z

In fact with most recent tag 0.22.0dev0 I can't reproduce issues from #906 examples now, pmb.inlining_threshold = 2 has no impact.

Did something happen ? I would not be positive that it is definitely fixed before I see more long-term stability (those kind of reproducers seemed to be sensible to details such as work group sizes...) but maybe there's been a step forward !

With 0.21.0dev1 (which is the most stable tag I've otherwise found for dpex.kernel feature, see #1157 ) the bug is still reproducible 🤷

fcharras · 2023-10-04T16:08:15Z

#1106 is still reproducible in 0.22.0dev0 though, is not sensitive to pmb.inlining_threshold = 2 either

diptorupd · 2023-10-04T16:39:05Z

@fcharras will you be able to join our Gitter channel (https://matrix.to/#/#Data-Parallel-Python_community:gitter.im)? We can discuss in real time quicker with you that way.

fcharras · 2023-10-06T10:47:12Z

I confirm that #1158 fixes the recent stability issues that were initially reported here so I think this issue can be closed.

All compilation issues are not fixed yet however, the reproducer in #1106 still miscompiles.

diptorupd added the user User submitted issue label Oct 3, 2023

fcharras mentioned this issue Oct 4, 2023

Release 0.20.2 needed for dpex.kernel support #1157

Closed

ZzEeKkAa mentioned this issue Oct 4, 2023

Add DPEX_OPT, INLINE_THRESHOLD config and do not use numba's OPT #1158

Merged

5 tasks

diptorupd mentioned this issue Oct 5, 2023

Minimal reproducer for incorrect code generated for multiple barriers followed by multiple conditional ops in a kernel #906

Closed

fcharras closed this as completed Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Having a lot of segfault / incorrect outputs with latest updates #1152

Having a lot of segfault / incorrect outputs with latest updates #1152

fcharras commented Oct 2, 2023

oleksandr-pavlyk commented Oct 2, 2023 •

edited

Loading

fcharras commented Oct 2, 2023

oleksandr-pavlyk commented Oct 2, 2023

fcharras commented Oct 2, 2023

ZzEeKkAa commented Oct 3, 2023 •

edited

Loading

oleksandr-pavlyk commented Oct 3, 2023

fcharras commented Oct 3, 2023 •

edited

Loading

fcharras commented Oct 3, 2023

ZzEeKkAa commented Oct 3, 2023

fcharras commented Oct 4, 2023 •

edited

Loading

fcharras commented Oct 4, 2023

fcharras commented Oct 4, 2023 •

edited

Loading

diptorupd commented Oct 4, 2023

fcharras commented Oct 4, 2023

fcharras commented Oct 4, 2023 •

edited

Loading

fcharras commented Oct 4, 2023

diptorupd commented Oct 4, 2023

fcharras commented Oct 6, 2023

Having a lot of segfault / incorrect outputs with latest updates #1152

Having a lot of segfault / incorrect outputs with latest updates #1152

Comments

fcharras commented Oct 2, 2023

oleksandr-pavlyk commented Oct 2, 2023 • edited Loading

fcharras commented Oct 2, 2023

oleksandr-pavlyk commented Oct 2, 2023

fcharras commented Oct 2, 2023

ZzEeKkAa commented Oct 3, 2023 • edited Loading

oleksandr-pavlyk commented Oct 3, 2023

fcharras commented Oct 3, 2023 • edited Loading

fcharras commented Oct 3, 2023

ZzEeKkAa commented Oct 3, 2023

fcharras commented Oct 4, 2023 • edited Loading

fcharras commented Oct 4, 2023

fcharras commented Oct 4, 2023 • edited Loading

diptorupd commented Oct 4, 2023

fcharras commented Oct 4, 2023

fcharras commented Oct 4, 2023 • edited Loading

fcharras commented Oct 4, 2023

diptorupd commented Oct 4, 2023

fcharras commented Oct 6, 2023

oleksandr-pavlyk commented Oct 2, 2023 •

edited

Loading

ZzEeKkAa commented Oct 3, 2023 •

edited

Loading

fcharras commented Oct 3, 2023 •

edited

Loading

fcharras commented Oct 4, 2023 •

edited

Loading

fcharras commented Oct 4, 2023 •

edited

Loading

fcharras commented Oct 4, 2023 •

edited

Loading