Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This kernel outputs wrong results on CPU, featuring nested kernel funcs, private and local memories, and barriers #1106

Closed
fcharras opened this issue Aug 2, 2023 · 10 comments
Assignees
Labels
bug Something isn't working user User submitted issue
Milestone

Comments

@fcharras
Copy link

fcharras commented Aug 2, 2023

I recently bumped all dependencies from our KMeans project with numba_dpex at https://github.com/soda-inria/sklearn-numba-dpex/ , including bump to OneAPI 2023.2.0 , latest numba_dpex, numba>=0.57, drivers, etc.

A few tests that used to be stable before started to fail when ran on CPU, the cause is that some kernels output wrong values. It has similar expression and consequences than #906 . Those issues really shake the core of what developers need to embrace the stack and build more complex system on top. I hope the reproducers I try to craft could unlock progress on this side.

The reproducer is a fairly complicated kernel with nested kernel funcs, using private and local memories, and barriers. I can't seem to manage to reduce it further to a simpler reproducer, so there it is:

Click to expand the reproducer
import numba_dpex as dpex
import dpctl.tensor as dpt

dtype = dpt.float32

nb_Y_window_cols = 8
nb_Y_window_rows = 10
Y_window_shape = nb_Y_window_rows, nb_Y_window_cols


@dpex.kernel
def columnwise_broadcast_dot_product(
    X,             # IN, shape (10)
    Y,             # IN, shape (10, 10)
    dot_products,  # IN  shape (10)
):
    """Write into dot_products the results of each dot product of X with a 
    column of Y
    """

    # All work items will cooperate to load windows of Y data into shared
    # memory. Each work item is responsible for loading a single slot in
    # shared memory, at following coordinates:
    Y_local_row_idx = dpex.get_local_id(1)
    Y_local_col_idx = dpex.get_local_id(0)


    # The only work item that will actually compute the dot products and write
    # them to global memory is the first fork item
    is_first_work_item = (dpex.get_global_id(0) == 0) and (dpex.get_global_id(1) == 0)

    # allocate shared memory where windows of Y will be cooperateively loaded
    Y_window = dpex.local.array(shape=Y_window_shape, dtype=dtype)

    # allocate private memory where dot products will be accumulated by the
    # first work item
    partial_dot_products = dpex.private.array(shape=8, dtype=dtype)

    first_Y_col_idx = 0

    for Y_window_idx in range(2):  # Y has 10 cols so is covered by 2 * 8 cols
        # NB: to ensure no issue related to
        # https://github.com/IntelPython/numba-dpex/issues/906
        # most of the code between barriers is moved to kernel funcs.

        is_last_Y_window = Y_window_idx >= 1

        loading_Y_col_idx = first_Y_col_idx + Y_local_col_idx

        # load the window over Y in shared memory. Note that work items that
        # are mapped to out-of-bounds values of Y will just write a 0 in
        # the slot
        load_Y_window(
            loading_Y_col_idx,
            Y_local_col_idx,
            Y_local_row_idx,
            Y,
            Y_window,
        )

        # wait that memory loading is done before continuing
        dpex.barrier(dpex.LOCAL_MEM_FENCE)

        # Now the first work item computes dot products, from reading
        # values in X and Y_window
        accumulate_dot_products(
            is_first_work_item,
            X,
            Y_window,
            is_last_Y_window,
            partial_dot_products
        )

        # wait for the first work item to finish reading into local memory
        # before starting loading the next window
        dpex.barrier(dpex.LOCAL_MEM_FENCE)

        # write the results back to global memory, using, yet again, only
        # the first work item
        write_results(
            is_first_work_item,
            first_Y_col_idx,
            is_last_Y_window,
            partial_dot_products,
            dot_products
        )

        # move the boundaries of the next window over Y
        first_Y_col_idx += 8



@dpex.func
def load_Y_window(
    loading_Y_col_idx,
    Y_local_col_idx,
    Y_local_row_idx,
    Y,
    Y_window,
):
    if (Y_local_row_idx < 10) and (Y_local_col_idx < 10):
        value = Y[Y_local_row_idx, loading_Y_col_idx]
    else:
        value = 0
    Y_window[Y_local_row_idx, Y_local_col_idx] = value


@dpex.func
def accumulate_dot_products(
    is_first_work_item,
    X,
    Y_window,
    is_last_Y_window,
    result,
):
    if is_last_Y_window:
        accumulate_dot_products_last_window(
                is_first_work_item,
                X,
                Y_window,
                result,
            )
    else:
        accumulate_dot_products_first_windows(
                is_first_work_item,
                X,
                Y_window,
                result,
            )


@dpex.func
def accumulate_dot_products_first_windows(
    is_first_work_item,
    X,
    Y_window,
    result,
):
    for row_idx in range(10):

        if is_first_work_item:
            X_value = X[row_idx]
        else:
            X_value = 0

        is_first_row = row_idx == 0

        for window_col_idx in range(8):  # <-- the difference is here
            Y_value = (Y_window[row_idx, window_col_idx])
            increment = Y_value * X_value

            if is_first_row:
                result[window_col_idx] = increment

            else:
                result[window_col_idx] += increment


@dpex.func
def accumulate_dot_products_last_window(
    is_first_work_item,
    X,
    Y_window,
    result,
):
    for row_idx in range(10):

        if is_first_work_item:
            X_value = X[row_idx]
        else:
            X_value = 0

        is_first_row = row_idx == 0

        for window_col_idx in range(2):  # <-- the difference is here
            Y_value = (Y_window[row_idx, window_col_idx])
            increment = Y_value * X_value

            if is_first_row:
                result[window_col_idx] = increment

            else:
                result[window_col_idx] += increment
                # The bug disappears if the previous instruction is replaced
                # with an atomic add (which shouldn't be, since result is
                # private memory here, there can't be conflicts)
                # dpex.atomic.add(result, window_col_idx, increment)


# Or, the bug disappear if the previous function is defined in one single
# function rather than using nested kernel functions

# @dpex.func
# def accumulate_dot_products(
#     global_flattened_idx,
#     X,
#     Y_window,
#     is_last_Y_window,
#     result,
# ):
#     for row_idx in range(10):

#         if global_flattened_idx < 1:
#             X_value = X[row_idx]
#         else:
#             X_value = 0

#         is_first_row = row_idx == 0

#         window_width = 2 if is_last_Y_window else 8

#         for window_col_idx in range(window_width):
#             Y_value = (Y_window[row_idx, window_col_idx])
#             increment = Y_value * X_value

#             if is_first_row:
#                 result[window_col_idx] = increment

#             else:
#                 result[window_col_idx] += increment


@dpex.func
def write_results(
    is_first_work_item,
    first_Y_col_idx,
    is_last_Y_window,
    partial_dot_products,
    dot_products
):
    if is_last_Y_window:
        L = 2
    else:
        L = 8

    for i in range(L):
        if is_first_work_item:
            dot_products[first_Y_col_idx + i] = partial_dot_products[i]


Y = dpt.asarray(
    dpt.reshape(dpt.arange(0, 100, dtype=dtype), (10, 10)).T, copy=True
)
X = dpt.asarray(Y[:, -1], copy=True)
dot_products = dpt.empty((10,), dtype=dtype)
columnwise_broadcast_dot_product[(8, 10), (8, 10)](X, Y, dot_products)
print(dot_products)
# expected: [ 4335. 13785. 23235. 32685. 42135. 51585. 61035. 70485. 79935. 89385.]

The reproducer contains two workarounds that are commented out (i.e the correct output will be printed if commenting out one of the two workarounds).

@fcharras fcharras changed the title Incorrect compiled code that outputs wrong results on CPU, featuring nested kernel funcs, private and local memories, and barriers This kernel outputs wrong results on CPU, featuring nested kernel funcs, private and local memories, and barriers Aug 2, 2023
@diptorupd diptorupd self-assigned this Aug 8, 2023
@diptorupd diptorupd added bug Something isn't working user User submitted issue labels Aug 8, 2023
@diptorupd
Copy link
Contributor

@fcharras I can reproduce the issue as well. I am investigating, and will update as soon as I have a root cause identified.

@fcharras
Copy link
Author

fcharras commented Oct 6, 2023

I've found a simpler workaround, instead of the suggested

            else:
                result[window_col_idx] += increment
                # The bug disappears if the previous instruction is replaced
                # with an atomic add (which shouldn't be, since result is
                # private memory here, there can't be conflicts)
                # dpex.atomic.add(result, window_col_idx, increment)

this:

            else:
                result[window_col_idx] += increment
                # The bug disappears if summing +1 -1 to the increment
                # to this instruction
                # result[window_col_idx] += increment + 1 - 1

fixes the kernel.

(basically trying to add random neutral instructions that somehow trigger a a non-bugged compilation path, here it seems that adding +1 -1 fixes the issue too.)

Good enough for me for the time being, since atomic add on private memory was what caused #1156 for me.

@fcharras
Copy link
Author

fcharras commented Oct 6, 2023

The previous trick do fix the reproducer, but does not fix for my real world usecase. So I keep looking.

One thing that do solve both on main after #1158, is setting NUMBA_DPEX_OPT=0. So it seems to be only related to compilation optimization after all.

NUMBA_DPEX_OPT=1 already gives wrong output (all nan results).

@diptorupd
Copy link
Contributor

diptorupd commented Nov 29, 2023

@fcharras @roxx30198 @Hardcode84

Based on the correct root cause provided by @Hardcode84 in #1204, I have investigated the issue and I think I have a proper fix.

import dpnp
import numba_dpex as dpex


@dpex.kernel
def twice(A, B):
    i = dpex.get_global_id(0)

    if i < 1:
        A[0] = 880
    dpex.barrier(dpex.LOCAL_MEM_FENCE)  # local mem fence
    if i < 0:
        B[0] = 990


arr = dpnp.arange(1024, dtype=dpnp.float32)
arr2 = dpnp.arange(1024, dtype=dpnp.float32)

twice[dpex.Range(1024)](arr, arr2)
print(arr)

Using the above pseudo reproducer, I generated the code using existing numba-dpex main branch. The control flow graph (CFG) for the code is embedded next:

before

The original compiled code before any of my changes, at least on my test setup, never returns and gets stuck and had to be killed.

After my changes, I recompiled the same function and it generates the following CFG.

after

The behavior is pretty much what @Hardcode84 explained in #1204. However, I was not able to get it working using the noduplicate attribute. Instead, I looked at the code generated by clang for a essentially similar OpenCL reproducer:

__kernel void test(
    __global float *g_idata,
    __global float *g_odata
)
{
    unsigned int tid = get_local_id(0);

    if(tid < 2)
        g_idata[tid] = 880;

    barrier(CLK_LOCAL_MEM_FENCE);

    if( tid < 1)
        g_odata[tid] = 990;
}

The CFG for the code clang generated using the command clang -O3 -flto -fveclib=none -target spir64-unknown-unknown -c -x cl -S -emit-llvm -cl-std=CL2.0 -Xclang -finclude-default-header test.cl -o test.ll is attached next:

clang-ocl

As can be observed, after my changes the code generated by numba-dpex matches the one generated by clang.

With all that build up, here is change I made to our barrier code generation:

barrier.attributes.add("convergent")
callinst = builder.call(barrier, [flags])
callinst.attributes.add("convergent")
callinst.attributes.add("nounwind")

The main thing was to add the convergent attribute at both function declaration and the call site. The unwind attribute is not relevant in here. If you are interested in understanding what the attribute does, refer https://llvm.org/docs/ConvergentOperations.html

So, now that I have provided (what I think is) the good news, let me come to the bad news. llvmlite does not support the convergent attribute yet. The following two places will need to be patched. We just need to add the string "convergent" to these sets.

https://github.com/numba/llvmlite/blob/da22592b9409b67d2d67330f59b3972a66a99ff9/llvmlite/ir/values.py#L881
https://github.com/numba/llvmlite/blob/da22592b9409b67d2d67330f59b3972a66a99ff9/llvmlite/ir/instructions.py#L55

I am going to open a PR for llvmlite, but for now monkey-patching llvmlite is the only recourse. Here is my patch, if you want to test it out.

--- /tmp/MG5FLU_oclimpl.py
+++ /home/diptorupd/Desktop/devel/numba-dpex/numba_dpex/ocl/oclimpl.py
@@ -105,7 +105,10 @@
     barrier = _declare_function(
         context, builder, "barrier", sig, ["unsigned int"]
     )
-    builder.call(barrier, [flags])
+    barrier.attributes.add("convergent")
+    callinst = builder.call(barrier, [flags])
+    callinst.attributes.add("convergent")
+    callinst.attributes.add("nounwind")
     return _void_value
 
 
@@ -116,8 +119,11 @@
     barrier = _declare_function(
         context, builder, "barrier", sig, ["unsigned int"]
     )
+    barrier.attributes.add("convergent")
     flags = context.get_constant(types.uint32, stubs.GLOBAL_MEM_FENCE)
-    builder.call(barrier, [flags])
+    callinst = builder.call(barrier, [flags])
+    callinst.attributes.add("convergent")
+    callinst.attributes.add("nounwind")
     return _void_value
 
 
@@ -138,8 +144,11 @@
     barrier = _declare_function(
         context, builder, "barrier", sig, ["unsigned int"]
     )
+    barrier.attributes.add("convergent")
     flags = context.get_constant(types.uint32, stubs.LOCAL_MEM_FENCE)
-    builder.call(barrier, [flags])
+    callinst = builder.call(barrier, [flags])
+    callinst.attributes.add("convergent")
+    callinst.attributes.add("nounwind")
     return _void_value
 
 

@diptorupd
Copy link
Contributor

I am going to open a PR for llvmlite

numba/llvmlite#1016

@diptorupd
Copy link
Contributor

@fcharras I tested your reproducer with my patch and the llvmlite patch, it works as expected on opencl:cpu, opencl:gpu, level_zero:gpu.

@roxx30198
Copy link
Contributor

import dpnp

import numba_dpex

@numba_dpex.kernel
def _pathfinder_kernel(prev, deviceWall, cols, cur_row, result):
    current_element = numba_dpex.get_global_id(0)
    left_ind = current_element - 1 if current_element >= 1 else current_element
    up_ind = current_element

    index = cur_row * cols + current_element
    left = prev[left_ind]
    up = prev[up_ind]

    shortest = left if left<=up else up

    numba_dpex.barrier(numba_dpex.GLOBAL_MEM_FENCE)
    prev[current_element] = deviceWall[index] + shortest
    numba_dpex.barrier(numba_dpex.GLOBAL_MEM_FENCE)
    result[current_element] = prev[current_element]


def pathfinder(data, cols, result):
    # create a temp list that hold first row of data as first element and empty numpy array as second element
    device_dest = dpnp.array(data[:cols], dtype=dpnp.int64)  # first row
    device_wall = dpnp.array(data[cols:], dtype=dpnp.int64)

    _pathfinder_kernel[numba_dpex.Range(cols)](
        device_dest, device_wall, cols, 0, result
    )


data = dpnp.array(
    [3, 0, 7, 5, 6, 5, 4, 2], dtype=dpnp.int64
)

res = dpnp.zeros(shape=(4), dtype=dpnp.int64)

pathfinder(data, 4, res)
print(res)

This still produces incorrect result with the patch provided.

@diptorupd
Copy link
Contributor

diptorupd commented Nov 30, 2023

This still produces incorrect result with the patch provided.

Thanks for the reproducer. The issue is happening because of the same reason: the second barrier is accessible only inside a branch. However, the root cause that triggers the issue is not the same as the above. I am seeing the incorrect code getting generated even when we turn off all LLVM optimizations (O0). It perhaps is due to some issue in the way Numba is lowering the code.

I will investigate and update. If I am positive that it is a different issue, I will move it to a separate ticket.

UPDATE: Well I had made a bit of change in your code that actually led to the obvious barrier codegen issue. If I revert back to your original code, then it is no longer quite obvious to me what the issue is. I will keep hunting.

@diptorupd diptorupd added this to the 0.22 milestone Dec 19, 2023
@ZzEeKkAa
Copy link
Contributor

ZzEeKkAa commented Feb 2, 2024

After some investigation from my side it is appeared that the issue is related to the fact that group_barrier does not work for ranges and supported only for nd_ranges. Support for this can be found that dpcpp does not provide syntax to use group_barrier for ranges and related issue can be found on stack overflow.

Stable small reproducer:

import dpnp

import numba_dpex

@numba_dpex.kernel
def kernel(arr, copy, res):
    i = numba_dpex.get_global_id(0)

    copy[i]=arr[i]
    numba_dpex.barrier(numba_dpex.GLOBAL_MEM_FENCE)

    # get shifted data
    res[i] = copy[(i+1)%arr.size]

arr = dpnp.ones(3, dtype=dpnp.int64);
copy = dpnp.zeros_like(arr)
res = dpnp.zeros_like(arr)

numba_dpex.call_kernel(kernel,numba_dpex.Range(arr.size), arr, copy, res)

# Expected:
# [1 1 1], but on "opencl:cpu" it is [0 0 1]
print(res)

@diptorupd
Copy link
Contributor

Closing as fixed by recent group_barrier fixes. Please reopen if you still encounter the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working user User submitted issue
Projects
None yet
Development

No branches or pull requests

4 participants