-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
This kernel outputs wrong results on CPU, featuring nested kernel funcs, private and local memories, and barriers #1106
Comments
@fcharras I can reproduce the issue as well. I am investigating, and will update as soon as I have a root cause identified. |
I've found a simpler workaround, instead of the suggested else:
result[window_col_idx] += increment
# The bug disappears if the previous instruction is replaced
# with an atomic add (which shouldn't be, since result is
# private memory here, there can't be conflicts)
# dpex.atomic.add(result, window_col_idx, increment) this: else:
result[window_col_idx] += increment
# The bug disappears if summing +1 -1 to the increment
# to this instruction
# result[window_col_idx] += increment + 1 - 1 fixes the kernel. (basically trying to add random neutral instructions that somehow trigger a a non-bugged compilation path, here it seems that adding Good enough for me for the time being, since atomic add on private memory was what caused #1156 for me. |
The previous trick do fix the reproducer, but does not fix for my real world usecase. So I keep looking. One thing that do solve both on main after #1158, is setting
|
@fcharras @roxx30198 @Hardcode84 Based on the correct root cause provided by @Hardcode84 in #1204, I have investigated the issue and I think I have a proper fix. import dpnp
import numba_dpex as dpex
@dpex.kernel
def twice(A, B):
i = dpex.get_global_id(0)
if i < 1:
A[0] = 880
dpex.barrier(dpex.LOCAL_MEM_FENCE) # local mem fence
if i < 0:
B[0] = 990
arr = dpnp.arange(1024, dtype=dpnp.float32)
arr2 = dpnp.arange(1024, dtype=dpnp.float32)
twice[dpex.Range(1024)](arr, arr2)
print(arr) Using the above pseudo reproducer, I generated the code using existing The original compiled code before any of my changes, at least on my test setup, never returns and gets stuck and had to be killed. After my changes, I recompiled the same function and it generates the following CFG. The behavior is pretty much what @Hardcode84 explained in #1204. However, I was not able to get it working using the __kernel void test(
__global float *g_idata,
__global float *g_odata
)
{
unsigned int tid = get_local_id(0);
if(tid < 2)
g_idata[tid] = 880;
barrier(CLK_LOCAL_MEM_FENCE);
if( tid < 1)
g_odata[tid] = 990;
} The CFG for the code clang generated using the command As can be observed, after my changes the code generated by With all that build up, here is change I made to our barrier code generation: barrier.attributes.add("convergent")
callinst = builder.call(barrier, [flags])
callinst.attributes.add("convergent")
callinst.attributes.add("nounwind") The main thing was to add the So, now that I have provided (what I think is) the good news, let me come to the bad news. https://github.com/numba/llvmlite/blob/da22592b9409b67d2d67330f59b3972a66a99ff9/llvmlite/ir/values.py#L881 I am going to open a PR for --- /tmp/MG5FLU_oclimpl.py
+++ /home/diptorupd/Desktop/devel/numba-dpex/numba_dpex/ocl/oclimpl.py
@@ -105,7 +105,10 @@
barrier = _declare_function(
context, builder, "barrier", sig, ["unsigned int"]
)
- builder.call(barrier, [flags])
+ barrier.attributes.add("convergent")
+ callinst = builder.call(barrier, [flags])
+ callinst.attributes.add("convergent")
+ callinst.attributes.add("nounwind")
return _void_value
@@ -116,8 +119,11 @@
barrier = _declare_function(
context, builder, "barrier", sig, ["unsigned int"]
)
+ barrier.attributes.add("convergent")
flags = context.get_constant(types.uint32, stubs.GLOBAL_MEM_FENCE)
- builder.call(barrier, [flags])
+ callinst = builder.call(barrier, [flags])
+ callinst.attributes.add("convergent")
+ callinst.attributes.add("nounwind")
return _void_value
@@ -138,8 +144,11 @@
barrier = _declare_function(
context, builder, "barrier", sig, ["unsigned int"]
)
+ barrier.attributes.add("convergent")
flags = context.get_constant(types.uint32, stubs.LOCAL_MEM_FENCE)
- builder.call(barrier, [flags])
+ callinst = builder.call(barrier, [flags])
+ callinst.attributes.add("convergent")
+ callinst.attributes.add("nounwind")
return _void_value
|
|
@fcharras I tested your reproducer with my patch and the llvmlite patch, it works as expected on |
import dpnp
import numba_dpex
@numba_dpex.kernel
def _pathfinder_kernel(prev, deviceWall, cols, cur_row, result):
current_element = numba_dpex.get_global_id(0)
left_ind = current_element - 1 if current_element >= 1 else current_element
up_ind = current_element
index = cur_row * cols + current_element
left = prev[left_ind]
up = prev[up_ind]
shortest = left if left<=up else up
numba_dpex.barrier(numba_dpex.GLOBAL_MEM_FENCE)
prev[current_element] = deviceWall[index] + shortest
numba_dpex.barrier(numba_dpex.GLOBAL_MEM_FENCE)
result[current_element] = prev[current_element]
def pathfinder(data, cols, result):
# create a temp list that hold first row of data as first element and empty numpy array as second element
device_dest = dpnp.array(data[:cols], dtype=dpnp.int64) # first row
device_wall = dpnp.array(data[cols:], dtype=dpnp.int64)
_pathfinder_kernel[numba_dpex.Range(cols)](
device_dest, device_wall, cols, 0, result
)
data = dpnp.array(
[3, 0, 7, 5, 6, 5, 4, 2], dtype=dpnp.int64
)
res = dpnp.zeros(shape=(4), dtype=dpnp.int64)
pathfinder(data, 4, res)
print(res) This still produces incorrect result with the patch provided. |
Thanks for the reproducer. The issue is happening because of the same reason: the second barrier is accessible only inside a branch. However, the root cause that triggers the issue is not the same as the above. I am seeing the incorrect code getting generated even when we turn off all LLVM optimizations ( I will investigate and update. If I am positive that it is a different issue, I will move it to a separate ticket. UPDATE: Well I had made a bit of change in your code that actually led to the obvious barrier codegen issue. If I revert back to your original code, then it is no longer quite obvious to me what the issue is. I will keep hunting. |
After some investigation from my side it is appeared that the issue is related to the fact that Stable small reproducer: import dpnp
import numba_dpex
@numba_dpex.kernel
def kernel(arr, copy, res):
i = numba_dpex.get_global_id(0)
copy[i]=arr[i]
numba_dpex.barrier(numba_dpex.GLOBAL_MEM_FENCE)
# get shifted data
res[i] = copy[(i+1)%arr.size]
arr = dpnp.ones(3, dtype=dpnp.int64);
copy = dpnp.zeros_like(arr)
res = dpnp.zeros_like(arr)
numba_dpex.call_kernel(kernel,numba_dpex.Range(arr.size), arr, copy, res)
# Expected:
# [1 1 1], but on "opencl:cpu" it is [0 0 1]
print(res) |
Closing as fixed by recent |
I recently bumped all dependencies from our KMeans project with numba_dpex at https://github.com/soda-inria/sklearn-numba-dpex/ , including bump to OneAPI 2023.2.0 , latest numba_dpex, numba>=0.57, drivers, etc.
A few tests that used to be stable before started to fail when ran on CPU, the cause is that some kernels output wrong values. It has similar expression and consequences than #906 . Those issues really shake the core of what developers need to embrace the stack and build more complex system on top. I hope the reproducers I try to craft could unlock progress on this side.
The reproducer is a fairly complicated kernel with nested kernel funcs, using private and local memories, and barriers. I can't seem to manage to reduce it further to a simpler reproducer, so there it is:
Click to expand the reproducer
The reproducer contains two workarounds that are commented out (i.e the correct output will be printed if commenting out one of the two workarounds).
The text was updated successfully, but these errors were encountered: