This kernel outputs wrong results on CPU, featuring nested kernel funcs, private and local memories, and barriers #1106

fcharras · 2023-08-02T16:14:48Z

I recently bumped all dependencies from our KMeans project with numba_dpex at https://github.com/soda-inria/sklearn-numba-dpex/ , including bump to OneAPI 2023.2.0 , latest numba_dpex, numba>=0.57, drivers, etc.

A few tests that used to be stable before started to fail when ran on CPU, the cause is that some kernels output wrong values. It has similar expression and consequences than #906 . Those issues really shake the core of what developers need to embrace the stack and build more complex system on top. I hope the reproducers I try to craft could unlock progress on this side.

The reproducer is a fairly complicated kernel with nested kernel funcs, using private and local memories, and barriers. I can't seem to manage to reduce it further to a simpler reproducer, so there it is:

Click to expand the reproducer

import numba_dpex as dpex
import dpctl.tensor as dpt

dtype = dpt.float32

nb_Y_window_cols = 8
nb_Y_window_rows = 10
Y_window_shape = nb_Y_window_rows, nb_Y_window_cols


@dpex.kernel
def columnwise_broadcast_dot_product(
    X,             # IN, shape (10)
    Y,             # IN, shape (10, 10)
    dot_products,  # IN  shape (10)
):
    """Write into dot_products the results of each dot product of X with a 
    column of Y
    """

    # All work items will cooperate to load windows of Y data into shared
    # memory. Each work item is responsible for loading a single slot in
    # shared memory, at following coordinates:
    Y_local_row_idx = dpex.get_local_id(1)
    Y_local_col_idx = dpex.get_local_id(0)


    # The only work item that will actually compute the dot products and write
    # them to global memory is the first fork item
    is_first_work_item = (dpex.get_global_id(0) == 0) and (dpex.get_global_id(1) == 0)

    # allocate shared memory where windows of Y will be cooperateively loaded
    Y_window = dpex.local.array(shape=Y_window_shape, dtype=dtype)

    # allocate private memory where dot products will be accumulated by the
    # first work item
    partial_dot_products = dpex.private.array(shape=8, dtype=dtype)

    first_Y_col_idx = 0

    for Y_window_idx in range(2):  # Y has 10 cols so is covered by 2 * 8 cols
        # NB: to ensure no issue related to
        # https://github.com/IntelPython/numba-dpex/issues/906
        # most of the code between barriers is moved to kernel funcs.

        is_last_Y_window = Y_window_idx >= 1

        loading_Y_col_idx = first_Y_col_idx + Y_local_col_idx

        # load the window over Y in shared memory. Note that work items that
        # are mapped to out-of-bounds values of Y will just write a 0 in
        # the slot
        load_Y_window(
            loading_Y_col_idx,
            Y_local_col_idx,
            Y_local_row_idx,
            Y,
            Y_window,
        )

        # wait that memory loading is done before continuing
        dpex.barrier(dpex.LOCAL_MEM_FENCE)

        # Now the first work item computes dot products, from reading
        # values in X and Y_window
        accumulate_dot_products(
            is_first_work_item,
            X,
            Y_window,
            is_last_Y_window,
            partial_dot_products
        )

        # wait for the first work item to finish reading into local memory
        # before starting loading the next window
        dpex.barrier(dpex.LOCAL_MEM_FENCE)

        # write the results back to global memory, using, yet again, only
        # the first work item
        write_results(
            is_first_work_item,
            first_Y_col_idx,
            is_last_Y_window,
            partial_dot_products,
            dot_products
        )

        # move the boundaries of the next window over Y
        first_Y_col_idx += 8



@dpex.func
def load_Y_window(
    loading_Y_col_idx,
    Y_local_col_idx,
    Y_local_row_idx,
    Y,
    Y_window,
):
    if (Y_local_row_idx < 10) and (Y_local_col_idx < 10):
        value = Y[Y_local_row_idx, loading_Y_col_idx]
    else:
        value = 0
    Y_window[Y_local_row_idx, Y_local_col_idx] = value


@dpex.func
def accumulate_dot_products(
    is_first_work_item,
    X,
    Y_window,
    is_last_Y_window,
    result,
):
    if is_last_Y_window:
        accumulate_dot_products_last_window(
                is_first_work_item,
                X,
                Y_window,
                result,
            )
    else:
        accumulate_dot_products_first_windows(
                is_first_work_item,
                X,
                Y_window,
                result,
            )


@dpex.func
def accumulate_dot_products_first_windows(
    is_first_work_item,
    X,
    Y_window,
    result,
):
    for row_idx in range(10):

        if is_first_work_item:
            X_value = X[row_idx]
        else:
            X_value = 0

        is_first_row = row_idx == 0

        for window_col_idx in range(8):  # <-- the difference is here
            Y_value = (Y_window[row_idx, window_col_idx])
            increment = Y_value * X_value

            if is_first_row:
                result[window_col_idx] = increment

            else:
                result[window_col_idx] += increment


@dpex.func
def accumulate_dot_products_last_window(
    is_first_work_item,
    X,
    Y_window,
    result,
):
    for row_idx in range(10):

        if is_first_work_item:
            X_value = X[row_idx]
        else:
            X_value = 0

        is_first_row = row_idx == 0

        for window_col_idx in range(2):  # <-- the difference is here
            Y_value = (Y_window[row_idx, window_col_idx])
            increment = Y_value * X_value

            if is_first_row:
                result[window_col_idx] = increment

            else:
                result[window_col_idx] += increment
                # The bug disappears if the previous instruction is replaced
                # with an atomic add (which shouldn't be, since result is
                # private memory here, there can't be conflicts)
                # dpex.atomic.add(result, window_col_idx, increment)


# Or, the bug disappear if the previous function is defined in one single
# function rather than using nested kernel functions

# @dpex.func
# def accumulate_dot_products(
#     global_flattened_idx,
#     X,
#     Y_window,
#     is_last_Y_window,
#     result,
# ):
#     for row_idx in range(10):

#         if global_flattened_idx < 1:
#             X_value = X[row_idx]
#         else:
#             X_value = 0

#         is_first_row = row_idx == 0

#         window_width = 2 if is_last_Y_window else 8

#         for window_col_idx in range(window_width):
#             Y_value = (Y_window[row_idx, window_col_idx])
#             increment = Y_value * X_value

#             if is_first_row:
#                 result[window_col_idx] = increment

#             else:
#                 result[window_col_idx] += increment


@dpex.func
def write_results(
    is_first_work_item,
    first_Y_col_idx,
    is_last_Y_window,
    partial_dot_products,
    dot_products
):
    if is_last_Y_window:
        L = 2
    else:
        L = 8

    for i in range(L):
        if is_first_work_item:
            dot_products[first_Y_col_idx + i] = partial_dot_products[i]


Y = dpt.asarray(
    dpt.reshape(dpt.arange(0, 100, dtype=dtype), (10, 10)).T, copy=True
)
X = dpt.asarray(Y[:, -1], copy=True)
dot_products = dpt.empty((10,), dtype=dtype)
columnwise_broadcast_dot_product[(8, 10), (8, 10)](X, Y, dot_products)
print(dot_products)
# expected: [ 4335. 13785. 23235. 32685. 42135. 51585. 61035. 70485. 79935. 89385.]

The reproducer contains two workarounds that are commented out (i.e the correct output will be printed if commenting out one of the two workarounds).

diptorupd · 2023-10-05T02:10:05Z

@fcharras I can reproduce the issue as well. I am investigating, and will update as soon as I have a root cause identified.

fcharras · 2023-10-06T11:06:03Z

I've found a simpler workaround, instead of the suggested

            else:
                result[window_col_idx] += increment
                # The bug disappears if the previous instruction is replaced
                # with an atomic add (which shouldn't be, since result is
                # private memory here, there can't be conflicts)
                # dpex.atomic.add(result, window_col_idx, increment)

this:

            else:
                result[window_col_idx] += increment
                # The bug disappears if summing +1 -1 to the increment
                # to this instruction
                # result[window_col_idx] += increment + 1 - 1

fixes the kernel.

(basically trying to add random neutral instructions that somehow trigger a a non-bugged compilation path, here it seems that adding +1 -1 fixes the issue too.)

Good enough for me for the time being, since atomic add on private memory was what caused #1156 for me.

fcharras · 2023-10-06T11:41:40Z

The previous trick do fix the reproducer, but does not fix for my real world usecase. So I keep looking.

One thing that do solve both on main after #1158, is setting NUMBA_DPEX_OPT=0. So it seems to be only related to compilation optimization after all.

NUMBA_DPEX_OPT=1 already gives wrong output (all nan results).

diptorupd · 2023-11-29T05:12:16Z

@fcharras @roxx30198 @Hardcode84

Based on the correct root cause provided by @Hardcode84 in #1204, I have investigated the issue and I think I have a proper fix.

import dpnp
import numba_dpex as dpex


@dpex.kernel
def twice(A, B):
    i = dpex.get_global_id(0)

    if i < 1:
        A[0] = 880
    dpex.barrier(dpex.LOCAL_MEM_FENCE)  # local mem fence
    if i < 0:
        B[0] = 990


arr = dpnp.arange(1024, dtype=dpnp.float32)
arr2 = dpnp.arange(1024, dtype=dpnp.float32)

twice[dpex.Range(1024)](arr, arr2)
print(arr)

Using the above pseudo reproducer, I generated the code using existing numba-dpex main branch. The control flow graph (CFG) for the code is embedded next:

The original compiled code before any of my changes, at least on my test setup, never returns and gets stuck and had to be killed.

After my changes, I recompiled the same function and it generates the following CFG.

The behavior is pretty much what @Hardcode84 explained in #1204. However, I was not able to get it working using the noduplicate attribute. Instead, I looked at the code generated by clang for a essentially similar OpenCL reproducer:

__kernel void test(
    __global float *g_idata,
    __global float *g_odata
)
{
    unsigned int tid = get_local_id(0);

    if(tid < 2)
        g_idata[tid] = 880;

    barrier(CLK_LOCAL_MEM_FENCE);

    if( tid < 1)
        g_odata[tid] = 990;
}

The CFG for the code clang generated using the command clang -O3 -flto -fveclib=none -target spir64-unknown-unknown -c -x cl -S -emit-llvm -cl-std=CL2.0 -Xclang -finclude-default-header test.cl -o test.ll is attached next:

As can be observed, after my changes the code generated by numba-dpex matches the one generated by clang.

With all that build up, here is change I made to our barrier code generation:

barrier.attributes.add("convergent")
callinst = builder.call(barrier, [flags])
callinst.attributes.add("convergent")
callinst.attributes.add("nounwind")

The main thing was to add the convergent attribute at both function declaration and the call site. The unwind attribute is not relevant in here. If you are interested in understanding what the attribute does, refer https://llvm.org/docs/ConvergentOperations.html

So, now that I have provided (what I think is) the good news, let me come to the bad news. llvmlite does not support the convergent attribute yet. The following two places will need to be patched. We just need to add the string "convergent" to these sets.

https://github.com/numba/llvmlite/blob/da22592b9409b67d2d67330f59b3972a66a99ff9/llvmlite/ir/values.py#L881
https://github.com/numba/llvmlite/blob/da22592b9409b67d2d67330f59b3972a66a99ff9/llvmlite/ir/instructions.py#L55

I am going to open a PR for llvmlite, but for now monkey-patching llvmlite is the only recourse. Here is my patch, if you want to test it out.

--- /tmp/MG5FLU_oclimpl.py
+++ /home/diptorupd/Desktop/devel/numba-dpex/numba_dpex/ocl/oclimpl.py
@@ -105,7 +105,10 @@
     barrier = _declare_function(
         context, builder, "barrier", sig, ["unsigned int"]
     )
-    builder.call(barrier, [flags])
+    barrier.attributes.add("convergent")
+    callinst = builder.call(barrier, [flags])
+    callinst.attributes.add("convergent")
+    callinst.attributes.add("nounwind")
     return _void_value
 
 
@@ -116,8 +119,11 @@
     barrier = _declare_function(
         context, builder, "barrier", sig, ["unsigned int"]
     )
+    barrier.attributes.add("convergent")
     flags = context.get_constant(types.uint32, stubs.GLOBAL_MEM_FENCE)
-    builder.call(barrier, [flags])
+    callinst = builder.call(barrier, [flags])
+    callinst.attributes.add("convergent")
+    callinst.attributes.add("nounwind")
     return _void_value
 
 
@@ -138,8 +144,11 @@
     barrier = _declare_function(
         context, builder, "barrier", sig, ["unsigned int"]
     )
+    barrier.attributes.add("convergent")
     flags = context.get_constant(types.uint32, stubs.LOCAL_MEM_FENCE)
-    builder.call(barrier, [flags])
+    callinst = builder.call(barrier, [flags])
+    callinst.attributes.add("convergent")
+    callinst.attributes.add("nounwind")
     return _void_value

diptorupd · 2023-11-29T05:30:31Z

I am going to open a PR for llvmlite

numba/llvmlite#1016

diptorupd · 2023-11-29T15:40:49Z

@fcharras I tested your reproducer with my patch and the llvmlite patch, it works as expected on opencl:cpu, opencl:gpu, level_zero:gpu.

roxx30198 · 2023-11-29T23:42:32Z

import dpnp

import numba_dpex

@numba_dpex.kernel
def _pathfinder_kernel(prev, deviceWall, cols, cur_row, result):
    current_element = numba_dpex.get_global_id(0)
    left_ind = current_element - 1 if current_element >= 1 else current_element
    up_ind = current_element

    index = cur_row * cols + current_element
    left = prev[left_ind]
    up = prev[up_ind]

    shortest = left if left<=up else up

    numba_dpex.barrier(numba_dpex.GLOBAL_MEM_FENCE)
    prev[current_element] = deviceWall[index] + shortest
    numba_dpex.barrier(numba_dpex.GLOBAL_MEM_FENCE)
    result[current_element] = prev[current_element]


def pathfinder(data, cols, result):
    # create a temp list that hold first row of data as first element and empty numpy array as second element
    device_dest = dpnp.array(data[:cols], dtype=dpnp.int64)  # first row
    device_wall = dpnp.array(data[cols:], dtype=dpnp.int64)

    _pathfinder_kernel[numba_dpex.Range(cols)](
        device_dest, device_wall, cols, 0, result
    )


data = dpnp.array(
    [3, 0, 7, 5, 6, 5, 4, 2], dtype=dpnp.int64
)

res = dpnp.zeros(shape=(4), dtype=dpnp.int64)

pathfinder(data, 4, res)
print(res)

This still produces incorrect result with the patch provided.

diptorupd · 2023-11-30T03:09:04Z

This still produces incorrect result with the patch provided.

Thanks for the reproducer. The issue is happening because of the same reason: the second barrier is accessible only inside a branch. However, the root cause that triggers the issue is not the same as the above. I am seeing the incorrect code getting generated even when we turn off all LLVM optimizations (O0). It perhaps is due to some issue in the way Numba is lowering the code.

I will investigate and update. If I am positive that it is a different issue, I will move it to a separate ticket.

UPDATE: Well I had made a bit of change in your code that actually led to the obvious barrier codegen issue. If I revert back to your original code, then it is no longer quite obvious to me what the issue is. I will keep hunting.

ZzEeKkAa · 2024-02-02T16:48:59Z

After some investigation from my side it is appeared that the issue is related to the fact that group_barrier does not work for ranges and supported only for nd_ranges. Support for this can be found that dpcpp does not provide syntax to use group_barrier for ranges and related issue can be found on stack overflow.

Stable small reproducer:

import dpnp

import numba_dpex

@numba_dpex.kernel
def kernel(arr, copy, res):
    i = numba_dpex.get_global_id(0)

    copy[i]=arr[i]
    numba_dpex.barrier(numba_dpex.GLOBAL_MEM_FENCE)

    # get shifted data
    res[i] = copy[(i+1)%arr.size]

arr = dpnp.ones(3, dtype=dpnp.int64);
copy = dpnp.zeros_like(arr)
res = dpnp.zeros_like(arr)

numba_dpex.call_kernel(kernel,numba_dpex.Range(arr.size), arr, copy, res)

# Expected:
# [1 1 1], but on "opencl:cpu" it is [0 0 1]
print(res)

diptorupd · 2024-03-30T04:20:33Z

Closing as fixed by recent group_barrier fixes. Please reopen if you still encounter the problem.

fcharras changed the title ~~Incorrect compiled code that outputs wrong results on CPU, featuring nested kernel funcs, private and local memories, and barriers~~ This kernel outputs wrong results on CPU, featuring nested kernel funcs, private and local memories, and barriers Aug 2, 2023

diptorupd self-assigned this Aug 8, 2023

diptorupd added bug Something isn't working user User submitted issue labels Aug 8, 2023

fcharras mentioned this issue Oct 2, 2023

Having a lot of segfault / incorrect outputs with latest updates #1152

Closed

fcharras mentioned this issue Oct 5, 2023

dpctl.program._program.SyclProgramCompilationError for some kernels for cpu target (latest updates) #1156

Closed

roxx30198 mentioned this issue Nov 22, 2023

Pathfinder SYCL and numba-dpex implementation. IntelPython/dpbench#303

Closed

5 tasks

diptorupd added this to the 0.22 milestone Dec 19, 2023

ZzEeKkAa mentioned this issue Feb 2, 2024

Add kernel item and nd_item support #1301

Merged

5 tasks

diptorupd mentioned this issue Feb 8, 2024

Fix barrier issue #1204

Closed

diptorupd closed this as completed Mar 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This kernel outputs wrong results on CPU, featuring nested kernel funcs, private and local memories, and barriers #1106

This kernel outputs wrong results on CPU, featuring nested kernel funcs, private and local memories, and barriers #1106

fcharras commented Aug 2, 2023 •

edited

Loading

diptorupd commented Oct 5, 2023

fcharras commented Oct 6, 2023

fcharras commented Oct 6, 2023

diptorupd commented Nov 29, 2023 •

edited

Loading

diptorupd commented Nov 29, 2023

diptorupd commented Nov 29, 2023

roxx30198 commented Nov 29, 2023

diptorupd commented Nov 30, 2023 •

edited

Loading

ZzEeKkAa commented Feb 2, 2024 •

edited by diptorupd

Loading

diptorupd commented Mar 30, 2024

This kernel outputs wrong results on CPU, featuring nested kernel funcs, private and local memories, and barriers #1106

This kernel outputs wrong results on CPU, featuring nested kernel funcs, private and local memories, and barriers #1106

Comments

fcharras commented Aug 2, 2023 • edited Loading

diptorupd commented Oct 5, 2023

fcharras commented Oct 6, 2023

fcharras commented Oct 6, 2023

diptorupd commented Nov 29, 2023 • edited Loading

diptorupd commented Nov 29, 2023

diptorupd commented Nov 29, 2023

roxx30198 commented Nov 29, 2023

diptorupd commented Nov 30, 2023 • edited Loading

ZzEeKkAa commented Feb 2, 2024 • edited by diptorupd Loading

diptorupd commented Mar 30, 2024

fcharras commented Aug 2, 2023 •

edited

Loading

diptorupd commented Nov 29, 2023 •

edited

Loading

diptorupd commented Nov 30, 2023 •

edited

Loading

ZzEeKkAa commented Feb 2, 2024 •

edited by diptorupd

Loading