Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimal reproducer for incorrect code generated for multiple barriers followed by multiple conditional ops in a kernel #906

Closed
fcharras opened this issue Feb 11, 2023 · 13 comments
Assignees
Labels
user User submitted issue
Milestone

Comments

@fcharras
Copy link

fcharras commented Feb 11, 2023

The following snippet:

import numba_dpex as dpex
import dpctl.tensor as dpt
import numpy as np

dtype = np.float32


@dpex.kernel
def kernel(result):
    local_col_idx = dpex.get_local_id(0)
    local_values = dpex.local.array((1,), dtype=dtype)

    dpex.barrier(dpex.LOCAL_MEM_FENCE)

    if (local_col_idx < 1):
        local_values[0] = 1

    dpex.barrier(dpex.LOCAL_MEM_FENCE)

    if (local_col_idx < 1):
        result[0] = 10

result = dpt.zeros(sh=(1), dtype=dtype)
kernel[32, 32](result)
print(result)

when ran on CPU, prints:

[0.]

but it should print:

[10.]

I think it is a simpler instance of #892 . The buggy pattern seems to be this particular sequence of instruction: barrier -> conditional op on local memory -> barrier -> conditional op on global memory. You can see how the correct result is printed when one of this step is altered.

Also, it does not occur for all group sizes. Here, [32, 32] triggers the bug but [16, 16] works. (if you can't reproduce the issue, maybe try higher group sizes)

I can't reproduce it on GPU but maybe there's a combination of group size that could make it fail too.

@diptorupd diptorupd changed the title Minimal reproducer for JIT issues Minimal reproducer for incorrect code generated for barriers in kernel Feb 12, 2023
@chudur-budur chudur-budur changed the title Minimal reproducer for incorrect code generated for barriers in kernel Minimal reproducer for incorrect code generated for multiple barriers followed by multiple conditional ops in kernel Feb 13, 2023
@chudur-budur chudur-budur changed the title Minimal reproducer for incorrect code generated for multiple barriers followed by multiple conditional ops in kernel Minimal reproducer for incorrect code generated for multiple barriers followed by multiple conditional ops in a kernel Feb 13, 2023
@mingjie-intel mingjie-intel added the user User submitted issue label Feb 17, 2023
@mingjie-intel
Copy link
Contributor

mingjie-intel commented Feb 22, 2023

An update:

  1. With reproducer, 0 is outputted on a CPU device with dpex.
  2. A dpcpp code is written to reproduce the issue. 10 is reported both on cpu and gpu.
#include <CL/sycl.hpp>

using namespace sycl;

static const int N = 1;

int main() {

  std::array<float, N> host_a;

  for(int i=0; i<N; i++)
    host_a[i]=0;

  queue q(cpu_selector_v);
  //queue q;
  std::cout << "Device : " << q.get_device().get_info<info::device::name>() << "\n";

  float *device_a = malloc_device<float>(N, q);

  range global{32};
  range local{32};
  q.parallel_for(nd_range{global, local}, [=](nd_item<1> idx) {
                                size_t i = idx.get_global_id(0);
                                float local_val = 0.;
                                idx.barrier(sycl::access::fence_space::local_space);
                                if(i<1) local_val=1.;
                                idx.barrier(sycl::access::fence_space::local_space);
                                if(i<1) device_a[i]=10.;  }).wait();

  q.memcpy(&host_a[0], device_a, N*sizeof(float)).wait();

  std::cout << "device_a:" << "\n";

  for (int i = 0; i < N; i++) std::cout << host_a[i] << "\n";

  free(device_a, q);

  return 0;
}
  1. Generated llvm IR from dpcpp and dpex. as attached here. The instructions are different. More investigation is needed.
    llvm.tar.gz

@oleksandr-pavlyk
Copy link
Contributor

@mingjie-intel The C++ reproducer does not use local memory, unlike the numba example.

fcharras added a commit to soda-inria/sklearn-numba-dpex that referenced this issue Feb 27, 2023
fcharras added a commit to soda-inria/sklearn-numba-dpex that referenced this issue Feb 27, 2023
@fcharras
Copy link
Author

fcharras commented Feb 27, 2023

Adding more fuel to the investigation, and a workaround that seems consistent.

First of all this is an even more minimal reproducer:

import numba_dpex as dpex
import dpctl.tensor as dpt
import numpy as np

dtype = np.float32


@dpex.kernel
def kernel(result):
    local_col_idx = dpex.get_local_id(0)
    local_values = dpex.local.array((1,), dtype=dtype)

    if (local_col_idx < 1):
        local_values[0] = 1

    dpex.barrier(dpex.LOCAL_MEM_FENCE)

    if (local_col_idx < 1):
        result[0] = 10

result = dpt.zeros(sh=(1), dtype=dtype)
kernel[32, 32](result)
print(result)

it's the same than the initial minimal reproducer, but the first barrier is removed. I don't think it can get smaller than that. I haven't seen the bug in a kernel without barriers.

What seems to happen is that, either some work items (including the first work item) decide to abort right after the barrier, or the second local_col_idx < 1 instruction is not evaluated properly. Here's an interesting workaround that works for this kernel (but in general is not consistent):

@dpex.kernel
def kernel(result):
    local_col_idx = dpex.get_local_id(0)
    local_values = dpex.local.array((1,), dtype=dtype)

    if (local_col_idx < 1):
        local_values[0] = 1

    dpex.barrier(dpex.LOCAL_MEM_FENCE)
    local_col_idx = dpex.get_local_id(0)

    if (local_col_idx < 1):
        result[0] = 10

i.e. redefining local_col_idx right after the barrier works.

Another interesting behavior: replacing 1 in the second condition (local_col_idx < 1) with 2, 3, ... 16 doesn't change the outcome, but if using 17, ..., up to 31 instead, it works. It's possible to try to write the value of local_col_idx in an output array and see that 16 first values (from 0 to 15) can't be written.

Now, here's the consistent workaround that seems to have solved the issue in the 4 different kernels where I've witnessed it so far (including the matmul from #892). It consists in moving the instructions that seems to have been mis-compiled, to dpex.func device functions:

@dpex.func
def func(condition, idx, value, result):
    if condition:
        result[idx] = value


@dpex.kernel
def kernel(result):
    local_col_idx = dpex.get_local_id(0)
    local_values = dpex.local.array((1,), dtype=dtype)

    if (local_col_idx < 1):
        local_values[0] = 1

    dpex.barrier(dpex.LOCAL_MEM_FENCE)

    func((local_col_idx < 1), 0, 10, result)

The rule seems to be "move to a dpex.func device function all conditional write-in-an-array instructions that follow a barrier before which another conditional write-in-an-array instructions (which is not within a dpex.func device function itself) happens ". This also works:

@dpex.kernel
def kernel(result):
    local_col_idx = dpex.get_local_id(0)
    local_values = dpex.local.array((1,), dtype=dtype)

    func((local_col_idx < 1), 0, 1, local_values)

    dpex.barrier(dpex.LOCAL_MEM_FENCE)

    if (local_col_idx < 1):
        result[0] = 10

This trick also solves #892 and all the kernels I've had troubles with (which seems to confirm that it's the same bug everywhere, and also affect gpu runtimes). Also, instead of trying to guess how the compiler goes wrong without the dpex.func trick I've started to move all such instructions to device functions, e.g. :

@dpex.kernel
def kernel(result):
    local_col_idx = dpex.get_local_id(0)
    local_values = dpex.local.array((1,), dtype=dtype)

    func((local_col_idx < 1), 0, 1, local_values)

    dpex.barrier(dpex.LOCAL_MEM_FENCE)

    func((local_col_idx < 1), 0, 10, result

@fcharras
Copy link
Author

fcharras commented Feb 27, 2023

Forgot to add that the but still occurs when replacing array setitems with atomic add, and fortunately the same trick works.

fcharras added a commit to soda-inria/sklearn-numba-dpex that referenced this issue Feb 28, 2023
- Work around IntelPython/numba-dpex#906 by moving bugged instruction patterns to `dpex.func` functions

- Fix tolerance value

- Implement strict convergence checking

---------

Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Julien Jerphanion <[email protected]>
@fcharras
Copy link
Author

fcharras commented Jul 19, 2023

Has this been solved ? Using numba_dpex==0.21.1 with latest numba==0.57.1 release, the minimal reproducer now works well in my test pipeline. it's not fixed, reminder that the minimal reproducer only reproduces on cpu 😀

@fcharras
Copy link
Author

fcharras commented Jul 24, 2023

Updating on the reproducer, it no longer fails with 100% certainty in sklearn_numba_dpex pipeline. The behavior of the reproducer now depends on the history of kernel execution in the process. The reproducer have the correct expected behavior if executed after the process ran other tests in the pipeline, but still outputs wrong results if it's singled out. From there we could try to find a minimal example of kernel that can make the subsequent kernels work well if it's executed before. Surprinsingly it does not seem cache-related since it's invariant to passing enable_cache=False to the kernel decorator.

@diptorupd
Copy link
Contributor

@fcharras As you reported in #1152 (comment) I too am unable to reproduce the issue any more. Based on my findings in #892, I think this too might be down to LLVM compiler optimizations at 03 level that are no longer reproducible with the latest llvmlite.

I am investigating further and we also have a PR #1158 to disallow O3 level of optimizations and make O2 the default level of optimization on the kernel LLVM IR module. Do note even dpcpp does not run all LLVM optimizations on the kernel modules and further optimizations are pushed to a driver compiler (IGC or NVVM) that has a better understanding of the target device capabilities.

@fcharras
Copy link
Author

fcharras commented Oct 5, 2023

with the latest llvmlite.

Bumps of numba and llvmlite are not enough to explain the fix, since the issue is still reproducible with 0.21.0dev1, which was the first tag after numba was bumped to >=0.57, using latest numba==0.58 and llvmlite==0.41. I can try to bisect recent history to find the commit where it seems to work again maybe that can tell more about the cause.

Could you add a unit test with the reproducer ?

@fcharras
Copy link
Author

fcharras commented Oct 6, 2023

From the latest commit on main, the reproducer does not showcase any issue anymore in any context for me at all. Looks like it has been fixed as a side effect of something else.

@roxx30198
Copy link
Contributor

roxx30198 commented Oct 18, 2023

minimal reproducer:

import dpnp
import numba_dpex
@numba_dpex.func(debug=True)
def min_dpex(a, b):
    t = a if a <= b else b
    return t
@numba_dpex.kernel(debug=True)
def _pathfinder_kernel(prev, deviceWall, cols, iteration, cur_row, result):
    current_element = numba_dpex.get_global_id(0)
    left_ind = current_element - 1 if current_element >= 1 else current_element
    right_ind = current_element + 1 if current_element < cols - 1 else cols - 1
    up_ind = current_element
    
    for i in range(iteration):
        numba_dpex.barrier(numba_dpex.LOCAL_MEM_FENCE)
        index = (cur_row + i) * cols + current_element
        left = prev[left_ind]
        up = prev[up_ind]
        right = prev[right_ind]
        
        shortest = min_dpex(left,up)
        shortest = min_dpex(shortest,right)
        numba_dpex.barrier(numba_dpex.LOCAL_MEM_FENCE)
        prev[current_element] = deviceWall[index] + shortest
        if i == iteration - 1:
            break
    numba_dpex.barrier(numba_dpex.LOCAL_MEM_FENCE)
    result[current_element] = prev[current_element]
def pathfinder(data, rows, cols, pyramid_height, result):
    # create a temp list that hold first row of data as first element and empty numpy array as second element
    device_dest = dpnp.array(data[:cols], dtype=dpnp.int64)  # first row
    device_wall = dpnp.array(data[cols:], dtype=dpnp.int64)
    t = 1

    while t < rows:
        iteration = min(pyramid_height, rows - t)

        _pathfinder_kernel[numba_dpex.Range(cols)](
            device_dest, device_wall, cols, iteration, t - 1, result
        )
        device_dest = dpnp.array(result, dpnp.int64)
        t += pyramid_height

In this case , due to the conditional statements inside the min_dpex, the value of operands of min_dpex are reset to 0 even before evaluating the expression. It works fine if we have a normal assignment operation before the barrier and not conditional operations. Also only the value of operands for if-else are reset to 0 and the value of index remains what it is supposed to be.

@diptorupd diptorupd added this to the 0.22 milestone Dec 20, 2023
@diptorupd
Copy link
Contributor

@fcharras In line with #892 (comment), the modified reproducer works without issues on all supported device with NUMBA_DPEX_OPT=3 INLINE_THRESHOLD=3. Again for now, I used #1331 to retest.

import numba_dpex.experimental as dpex_exp
import dpctl.tensor as dpt
import numpy as np
from numba_dpex import kernel_api as kapi

dtype = np.float32


@dpex_exp.kernel
def kernel(nditem: kapi.NdItem, slm, result):
    local_col_idx = nditem.get_local_id(0)
    gr = nditem.get_group()
    kapi.group_barrier(gr)

    if local_col_idx < 1:
        slm[0] = 1
    kapi.group_barrier(gr)
    if local_col_idx < 1:
        result[0] = 10

result = dpt.zeros(shape=(1), dtype=dtype)
slm = kapi.LocalAccessor((1,), dtype=dtype)
dpex_exp.call_kernel(kernel, kapi.NdRange((32,), (32,)), slm, result)
print(result)

@diptorupd
Copy link
Contributor

@roxx30198 Can you update your reproducer with the initialization for the kernels? I have updated your reproducer as well with the latest API. Also, one important thing to note is that group_barrier can only be used with a NdRange kernel as specified in SYCL.

import dpnp
import numba_dpex.experimental as dpex_exp
from numba_dpex import kernel_api as kapi


@dpex_exp.device_func
def min_dpex(a, b):
    t = a if a <= b else b
    return t


@dpex_exp.kernel
def _pathfinder_kernel(
    nditem: kapi.NdItem, prev, deviceWall, cols, iteration, cur_row, result
):
    current_element = nditem.get_global_id(0)
    left_ind = current_element - 1 if current_element >= 1 else current_element
    right_ind = current_element + 1 if current_element < cols - 1 else cols - 1
    up_ind = current_element

    gr = nditem.get_group()

    for i in range(iteration):
        kapi.group_barrier(gr)
        index = (cur_row + i) * cols + current_element
        left = prev[left_ind]
        up = prev[up_ind]
        right = prev[right_ind]

        shortest = min_dpex(left, up)
        shortest = min_dpex(shortest, right)
        kapi.group_barrier(gr)
        prev[current_element] = deviceWall[index] + shortest
        if i == iteration - 1:
            break
    kapi.group_barrier(gr)
    result[current_element] = prev[current_element]


def pathfinder(data, rows, cols, pyramid_height, result):
    # create a temp list that hold first row of data as first element and
    # empty numpy array as second element
    device_dest = dpnp.array(data[:cols], dtype=dpnp.int64)  # first row
    device_wall = dpnp.array(data[cols:], dtype=dpnp.int64)
    t = 1

    while t < rows:
        iteration = min(pyramid_height, rows - t)

        dpex_exp.call_kernel(
            _pathfinder_kernel,
            kapi.NdRange((cols,), (cols,)),
            device_dest,
            device_wall,
            cols,
            iteration,
            t - 1,
            result,
        )
        device_dest = dpnp.array(result, dpnp.int64)
        t += pyramid_height

@diptorupd
Copy link
Contributor

Closing as the main issue with respect to group_barrier and indexing code generation has been fixed. Please reopen if the issue happens again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
user User submitted issue
Projects
None yet
Development

No branches or pull requests

5 participants