Minimal reproducer for incorrect code generated for multiple barriers followed by multiple conditional ops in a kernel #906

fcharras · 2023-02-11T11:33:48Z

The following snippet:

import numba_dpex as dpex
import dpctl.tensor as dpt
import numpy as np

dtype = np.float32


@dpex.kernel
def kernel(result):
    local_col_idx = dpex.get_local_id(0)
    local_values = dpex.local.array((1,), dtype=dtype)

    dpex.barrier(dpex.LOCAL_MEM_FENCE)

    if (local_col_idx < 1):
        local_values[0] = 1

    dpex.barrier(dpex.LOCAL_MEM_FENCE)

    if (local_col_idx < 1):
        result[0] = 10

result = dpt.zeros(sh=(1), dtype=dtype)
kernel[32, 32](result)
print(result)

when ran on CPU, prints:

[0.]

but it should print:

[10.]

I think it is a simpler instance of #892 . The buggy pattern seems to be this particular sequence of instruction: barrier -> conditional op on local memory -> barrier -> conditional op on global memory. You can see how the correct result is printed when one of this step is altered.

Also, it does not occur for all group sizes. Here, [32, 32] triggers the bug but [16, 16] works. (if you can't reproduce the issue, maybe try higher group sizes)

I can't reproduce it on GPU but maybe there's a combination of group size that could make it fail too.

The text was updated successfully, but these errors were encountered:

mingjie-intel · 2023-02-22T00:07:16Z

An update:

With reproducer, 0 is outputted on a CPU device with dpex.
A dpcpp code is written to reproduce the issue. 10 is reported both on cpu and gpu.

#include <CL/sycl.hpp>

using namespace sycl;

static const int N = 1;

int main() {

  std::array<float, N> host_a;

  for(int i=0; i<N; i++)
    host_a[i]=0;

  queue q(cpu_selector_v);
  //queue q;
  std::cout << "Device : " << q.get_device().get_info<info::device::name>() << "\n";

  float *device_a = malloc_device<float>(N, q);

  range global{32};
  range local{32};
  q.parallel_for(nd_range{global, local}, [=](nd_item<1> idx) {
                                size_t i = idx.get_global_id(0);
                                float local_val = 0.;
                                idx.barrier(sycl::access::fence_space::local_space);
                                if(i<1) local_val=1.;
                                idx.barrier(sycl::access::fence_space::local_space);
                                if(i<1) device_a[i]=10.;  }).wait();

  q.memcpy(&host_a[0], device_a, N*sizeof(float)).wait();

  std::cout << "device_a:" << "\n";

  for (int i = 0; i < N; i++) std::cout << host_a[i] << "\n";

  free(device_a, q);

  return 0;
}

Generated llvm IR from dpcpp and dpex. as attached here. The instructions are different. More investigation is needed.
llvm.tar.gz

oleksandr-pavlyk · 2023-02-23T14:13:00Z

@mingjie-intel The C++ reproducer does not use local memory, unlike the numba example.

…atterns to `dpex.func` functions

fcharras · 2023-02-27T16:34:53Z

Adding more fuel to the investigation, and a workaround that seems consistent.

First of all this is an even more minimal reproducer:

import numba_dpex as dpex
import dpctl.tensor as dpt
import numpy as np

dtype = np.float32


@dpex.kernel
def kernel(result):
    local_col_idx = dpex.get_local_id(0)
    local_values = dpex.local.array((1,), dtype=dtype)

    if (local_col_idx < 1):
        local_values[0] = 1

    dpex.barrier(dpex.LOCAL_MEM_FENCE)

    if (local_col_idx < 1):
        result[0] = 10

result = dpt.zeros(sh=(1), dtype=dtype)
kernel[32, 32](result)
print(result)

it's the same than the initial minimal reproducer, but the first barrier is removed. I don't think it can get smaller than that. I haven't seen the bug in a kernel without barriers.

What seems to happen is that, either some work items (including the first work item) decide to abort right after the barrier, or the second local_col_idx < 1 instruction is not evaluated properly. Here's an interesting workaround that works for this kernel (but in general is not consistent):

@dpex.kernel
def kernel(result):
    local_col_idx = dpex.get_local_id(0)
    local_values = dpex.local.array((1,), dtype=dtype)

    if (local_col_idx < 1):
        local_values[0] = 1

    dpex.barrier(dpex.LOCAL_MEM_FENCE)
    local_col_idx = dpex.get_local_id(0)

    if (local_col_idx < 1):
        result[0] = 10

i.e. redefining local_col_idx right after the barrier works.

Another interesting behavior: replacing 1 in the second condition (local_col_idx < 1) with 2, 3, ... 16 doesn't change the outcome, but if using 17, ..., up to 31 instead, it works. It's possible to try to write the value of local_col_idx in an output array and see that 16 first values (from 0 to 15) can't be written.

Now, here's the consistent workaround that seems to have solved the issue in the 4 different kernels where I've witnessed it so far (including the matmul from #892). It consists in moving the instructions that seems to have been mis-compiled, to dpex.func device functions:

@dpex.func
def func(condition, idx, value, result):
    if condition:
        result[idx] = value


@dpex.kernel
def kernel(result):
    local_col_idx = dpex.get_local_id(0)
    local_values = dpex.local.array((1,), dtype=dtype)

    if (local_col_idx < 1):
        local_values[0] = 1

    dpex.barrier(dpex.LOCAL_MEM_FENCE)

    func((local_col_idx < 1), 0, 10, result)

The rule seems to be "move to a dpex.func device function all conditional write-in-an-array instructions that follow a barrier before which another conditional write-in-an-array instructions (which is not within a dpex.func device function itself) happens ". This also works:

@dpex.kernel
def kernel(result):
    local_col_idx = dpex.get_local_id(0)
    local_values = dpex.local.array((1,), dtype=dtype)

    func((local_col_idx < 1), 0, 1, local_values)

    dpex.barrier(dpex.LOCAL_MEM_FENCE)

    if (local_col_idx < 1):
        result[0] = 10

This trick also solves #892 and all the kernels I've had troubles with (which seems to confirm that it's the same bug everywhere, and also affect gpu runtimes). Also, instead of trying to guess how the compiler goes wrong without the dpex.func trick I've started to move all such instructions to device functions, e.g. :

@dpex.kernel
def kernel(result):
    local_col_idx = dpex.get_local_id(0)
    local_values = dpex.local.array((1,), dtype=dtype)

    func((local_col_idx < 1), 0, 1, local_values)

    dpex.barrier(dpex.LOCAL_MEM_FENCE)

    func((local_col_idx < 1), 0, 10, result

fcharras · 2023-02-27T16:41:37Z

Forgot to add that the but still occurs when replacing array setitems with atomic add, and fortunately the same trick works.

- Work around IntelPython/numba-dpex#906 by moving bugged instruction patterns to `dpex.func` functions - Fix tolerance value - Implement strict convergence checking --------- Co-authored-by: Olivier Grisel <[email protected]> Co-authored-by: Julien Jerphanion <[email protected]>

fcharras · 2023-07-19T15:05:27Z

~~Has this been solved ? Using numba_dpex==0.21.1 with latest numba==0.57.1 release, the minimal reproducer now works well in my test pipeline.~~ it's not fixed, reminder that the minimal reproducer only reproduces on cpu 😀

fcharras · 2023-07-24T08:25:16Z

Updating on the reproducer, it no longer fails with 100% certainty in sklearn_numba_dpex pipeline. The behavior of the reproducer now depends on the history of kernel execution in the process. The reproducer have the correct expected behavior if executed after the process ran other tests in the pipeline, but still outputs wrong results if it's singled out. From there we could try to find a minimal example of kernel that can make the subsequent kernels work well if it's executed before. Surprinsingly it does not seem cache-related since it's invariant to passing enable_cache=False to the kernel decorator.

diptorupd · 2023-10-05T02:34:33Z

@fcharras As you reported in #1152 (comment) I too am unable to reproduce the issue any more. Based on my findings in #892, I think this too might be down to LLVM compiler optimizations at 03 level that are no longer reproducible with the latest llvmlite.

I am investigating further and we also have a PR #1158 to disallow O3 level of optimizations and make O2 the default level of optimization on the kernel LLVM IR module. Do note even dpcpp does not run all LLVM optimizations on the kernel modules and further optimizations are pushed to a driver compiler (IGC or NVVM) that has a better understanding of the target device capabilities.

fcharras · 2023-10-05T08:15:33Z

with the latest llvmlite.

Bumps of numba and llvmlite are not enough to explain the fix, since the issue is still reproducible with 0.21.0dev1, which was the first tag after numba was bumped to >=0.57, using latest numba==0.58 and llvmlite==0.41. I can try to bisect recent history to find the commit where it seems to work again maybe that can tell more about the cause.

Could you add a unit test with the reproducer ?

fcharras · 2023-10-06T13:29:55Z

From the latest commit on main, the reproducer does not showcase any issue anymore in any context for me at all. Looks like it has been fixed as a side effect of something else.

roxx30198 · 2023-10-18T21:30:34Z

minimal reproducer:

import dpnp
import numba_dpex
@numba_dpex.func(debug=True)
def min_dpex(a, b):
    t = a if a <= b else b
    return t
@numba_dpex.kernel(debug=True)
def _pathfinder_kernel(prev, deviceWall, cols, iteration, cur_row, result):
    current_element = numba_dpex.get_global_id(0)
    left_ind = current_element - 1 if current_element >= 1 else current_element
    right_ind = current_element + 1 if current_element < cols - 1 else cols - 1
    up_ind = current_element
    
    for i in range(iteration):
        numba_dpex.barrier(numba_dpex.LOCAL_MEM_FENCE)
        index = (cur_row + i) * cols + current_element
        left = prev[left_ind]
        up = prev[up_ind]
        right = prev[right_ind]
        
        shortest = min_dpex(left,up)
        shortest = min_dpex(shortest,right)
        numba_dpex.barrier(numba_dpex.LOCAL_MEM_FENCE)
        prev[current_element] = deviceWall[index] + shortest
        if i == iteration - 1:
            break
    numba_dpex.barrier(numba_dpex.LOCAL_MEM_FENCE)
    result[current_element] = prev[current_element]
def pathfinder(data, rows, cols, pyramid_height, result):
    # create a temp list that hold first row of data as first element and empty numpy array as second element
    device_dest = dpnp.array(data[:cols], dtype=dpnp.int64)  # first row
    device_wall = dpnp.array(data[cols:], dtype=dpnp.int64)
    t = 1

    while t < rows:
        iteration = min(pyramid_height, rows - t)

        _pathfinder_kernel[numba_dpex.Range(cols)](
            device_dest, device_wall, cols, iteration, t - 1, result
        )
        device_dest = dpnp.array(result, dpnp.int64)
        t += pyramid_height

In this case , due to the conditional statements inside the min_dpex, the value of operands of min_dpex are reset to 0 even before evaluating the expression. It works fine if we have a normal assignment operation before the barrier and not conditional operations. Also only the value of operands for if-else are reset to 0 and the value of index remains what it is supposed to be.

diptorupd · 2024-03-15T03:06:21Z

@fcharras In line with #892 (comment), the modified reproducer works without issues on all supported device with NUMBA_DPEX_OPT=3 INLINE_THRESHOLD=3. Again for now, I used #1331 to retest.

import numba_dpex.experimental as dpex_exp
import dpctl.tensor as dpt
import numpy as np
from numba_dpex import kernel_api as kapi

dtype = np.float32


@dpex_exp.kernel
def kernel(nditem: kapi.NdItem, slm, result):
    local_col_idx = nditem.get_local_id(0)
    gr = nditem.get_group()
    kapi.group_barrier(gr)

    if local_col_idx < 1:
        slm[0] = 1
    kapi.group_barrier(gr)
    if local_col_idx < 1:
        result[0] = 10

result = dpt.zeros(shape=(1), dtype=dtype)
slm = kapi.LocalAccessor((1,), dtype=dtype)
dpex_exp.call_kernel(kernel, kapi.NdRange((32,), (32,)), slm, result)
print(result)

diptorupd · 2024-03-15T03:13:24Z

@roxx30198 Can you update your reproducer with the initialization for the kernels? I have updated your reproducer as well with the latest API. Also, one important thing to note is that group_barrier can only be used with a NdRange kernel as specified in SYCL.

import dpnp
import numba_dpex.experimental as dpex_exp
from numba_dpex import kernel_api as kapi


@dpex_exp.device_func
def min_dpex(a, b):
    t = a if a <= b else b
    return t


@dpex_exp.kernel
def _pathfinder_kernel(
    nditem: kapi.NdItem, prev, deviceWall, cols, iteration, cur_row, result
):
    current_element = nditem.get_global_id(0)
    left_ind = current_element - 1 if current_element >= 1 else current_element
    right_ind = current_element + 1 if current_element < cols - 1 else cols - 1
    up_ind = current_element

    gr = nditem.get_group()

    for i in range(iteration):
        kapi.group_barrier(gr)
        index = (cur_row + i) * cols + current_element
        left = prev[left_ind]
        up = prev[up_ind]
        right = prev[right_ind]

        shortest = min_dpex(left, up)
        shortest = min_dpex(shortest, right)
        kapi.group_barrier(gr)
        prev[current_element] = deviceWall[index] + shortest
        if i == iteration - 1:
            break
    kapi.group_barrier(gr)
    result[current_element] = prev[current_element]


def pathfinder(data, rows, cols, pyramid_height, result):
    # create a temp list that hold first row of data as first element and
    # empty numpy array as second element
    device_dest = dpnp.array(data[:cols], dtype=dpnp.int64)  # first row
    device_wall = dpnp.array(data[cols:], dtype=dpnp.int64)
    t = 1

    while t < rows:
        iteration = min(pyramid_height, rows - t)

        dpex_exp.call_kernel(
            _pathfinder_kernel,
            kapi.NdRange((cols,), (cols,)),
            device_dest,
            device_wall,
            cols,
            iteration,
            t - 1,
            result,
        )
        device_dest = dpnp.array(result, dpnp.int64)
        t += pyramid_height

diptorupd · 2024-03-30T04:19:10Z

Closing as the main issue with respect to group_barrier and indexing code generation has been fixed. Please reopen if the issue happens again.

This was referenced Feb 11, 2023

Numba-dpex generating incorrect results for sliding window matmul kernel #892

Closed

ENH small cleaning and optimizations accross the repo soda-inria/sklearn-numba-dpex#88

Closed

diptorupd changed the title ~~Minimal reproducer for JIT issues~~ Minimal reproducer for incorrect code generated for barriers in kernel Feb 12, 2023

chudur-budur changed the title ~~Minimal reproducer for incorrect code generated for barriers in kernel~~ Minimal reproducer for incorrect code generated for multiple barriers followed by multiple conditional ops in kernel Feb 13, 2023

chudur-budur changed the title ~~Minimal reproducer for incorrect code generated for multiple barriers followed by multiple conditional ops in kernel~~ Minimal reproducer for incorrect code generated for multiple barriers followed by multiple conditional ops in a kernel Feb 13, 2023

mingjie-intel added the user User submitted issue label Feb 17, 2023

fcharras mentioned this issue Feb 21, 2023

MNT bump dependencies and update, including removing hacks required by 0.19.0 soda-inria/sklearn-numba-dpex#93

Merged

mingjie-intel mentioned this issue Feb 21, 2023

Output LLVM IR if NUMBA_DPEX_DEBUG environment variable is set #924

Merged

5 tasks

fcharras added a commit to soda-inria/sklearn-numba-dpex that referenced this issue Feb 27, 2023

work around IntelPython/numba-dpex#906 by moving bugged instruction p…

563d59e

…atterns to `dpex.func` functions

fcharras mentioned this issue Feb 27, 2023

Fix work around #906 soda-inria/sklearn-numba-dpex#96

Merged

3 tasks

fcharras added a commit to soda-inria/sklearn-numba-dpex that referenced this issue Feb 27, 2023

work around IntelPython/numba-dpex#906 by moving bugged instruction p…

a82b10d

…atterns to `dpex.func` functions

fcharras mentioned this issue Aug 2, 2023

This kernel outputs wrong results on CPU, featuring nested kernel funcs, private and local memories, and barriers #1106

Closed

fcharras mentioned this issue Oct 2, 2023

Having a lot of segfault / incorrect outputs with latest updates #1152

Closed

diptorupd self-assigned this Oct 4, 2023

diptorupd added this to the 0.22 milestone Dec 20, 2023

diptorupd closed this as completed Mar 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimal reproducer for incorrect code generated for multiple barriers followed by multiple conditional ops in a kernel #906

Minimal reproducer for incorrect code generated for multiple barriers followed by multiple conditional ops in a kernel #906

fcharras commented Feb 11, 2023 •

edited

Loading

mingjie-intel commented Feb 22, 2023 •

edited by diptorupd

Loading

oleksandr-pavlyk commented Feb 23, 2023

fcharras commented Feb 27, 2023 •

edited

Loading

fcharras commented Feb 27, 2023 •

edited

Loading

fcharras commented Jul 19, 2023 •

edited

Loading

fcharras commented Jul 24, 2023 •

edited

Loading

diptorupd commented Oct 5, 2023

fcharras commented Oct 5, 2023

fcharras commented Oct 6, 2023

roxx30198 commented Oct 18, 2023 •

edited

Loading

diptorupd commented Mar 15, 2024

diptorupd commented Mar 15, 2024

diptorupd commented Mar 30, 2024

Minimal reproducer for incorrect code generated for multiple barriers followed by multiple conditional ops in a kernel #906

Minimal reproducer for incorrect code generated for multiple barriers followed by multiple conditional ops in a kernel #906

Comments

fcharras commented Feb 11, 2023 • edited Loading

mingjie-intel commented Feb 22, 2023 • edited by diptorupd Loading

oleksandr-pavlyk commented Feb 23, 2023

fcharras commented Feb 27, 2023 • edited Loading

fcharras commented Feb 27, 2023 • edited Loading

fcharras commented Jul 19, 2023 • edited Loading

fcharras commented Jul 24, 2023 • edited Loading

diptorupd commented Oct 5, 2023

fcharras commented Oct 5, 2023

fcharras commented Oct 6, 2023

roxx30198 commented Oct 18, 2023 • edited Loading

diptorupd commented Mar 15, 2024

diptorupd commented Mar 15, 2024

diptorupd commented Mar 30, 2024

fcharras commented Feb 11, 2023 •

edited

Loading

mingjie-intel commented Feb 22, 2023 •

edited by diptorupd

Loading

fcharras commented Feb 27, 2023 •

edited

Loading

fcharras commented Feb 27, 2023 •

edited

Loading

fcharras commented Jul 19, 2023 •

edited

Loading

fcharras commented Jul 24, 2023 •

edited

Loading

roxx30198 commented Oct 18, 2023 •

edited

Loading