Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numba-dpex implementation of Rambo fails validation #978

Closed
adarshyoga opened this issue Mar 21, 2023 · 8 comments · Fixed by #1023
Closed

Numba-dpex implementation of Rambo fails validation #978

adarshyoga opened this issue Mar 21, 2023 · 8 comments · Fixed by #1023

Comments

@adarshyoga
Copy link
Contributor

Rambo workload's numba-dpex implementation with dpnp calls fails validation. In my initial analysis, I see that the results produced are different to what the numpy/python implementation produces. The numba-dpex prange implementation passes validation. So this problem seems to be specific to dpnp calls.

How to reproduce:

Follow instructions to setup dpbench
Run Rambo - python -c "import dpbench; dpbench.run_benchmark("rambo")

@diptorupd
Copy link
Contributor

Reproducer:

import dpnp
import numba as nb
from numba_dpex import dpjit
import numpy

dpnp.pi = numpy.pi

def rambo_ref(nevts, nout, C1, F1, Q1, output):
    C = 2.0 * C1 - 1.0
    S = numpy.sqrt(1 - numpy.square(C))
    F = 2.0 * numpy.pi * F1
    Q = -numpy.log(Q1)

    output[:, :, 0] = Q
    output[:, :, 1] = Q * S * numpy.sin(F)
    output[:, :, 2] = Q * S * numpy.cos(F)
    output[:, :, 3] = Q * C

@dpjit
def rambo_dpjit(nevts, nout, C1, F1, Q1, output):
    C = 2.0 * C1 - 1.0
    S = dpnp.sqrt(1 - dpnp.square(C))
    F = 2.0 * dpnp.pi * F1
    Q = -dpnp.log(Q1)

    output[:, :, 0] = Q
    output[:, :, 1] = Q * S * dpnp.sin(F)
    output[:, :, 2] = Q * S * dpnp.cos(F)
    output[:, :, 3] = Q * C

@nb.njit(parallel=True)
def rambo_njit(nevts, nout, C1, F1, Q1, output):
    C = 2.0 * C1 - 1.0
    S = numpy.sqrt(1 - numpy.square(C))
    F = 2.0 * numpy.pi * F1
    Q = -numpy.log(Q1)

    output[:, :, 0] = Q
    output[:, :, 1] = Q * S * numpy.sin(F)
    output[:, :, 2] = Q * S * numpy.cos(F)
    output[:, :, 3] = Q * C



def initialize(nevts, nout):

    C1 = numpy.empty((nevts, nout))
    F1 = numpy.empty((nevts, nout))
    Q1 = numpy.empty((nevts, nout))

    numpy.random.seed(777)
    for i in range(nevts):
        for j in range(nout):
            C1[i, j] = numpy.random.rand()
            F1[i, j] = numpy.random.rand()
            Q1[i, j] = numpy.random.rand() * numpy.random.rand()

    return (
        C1,
        F1,
        Q1,
        numpy.empty((nevts, nout, 4)),
    )

def test_rambo(nevts=32768, nout=4):
    C1,F1,Q1, output = initialize(nevts, nout)
    output_n = numpy.copy(output)
    # Copy C1, F1,Q1 to dpnp
    C1d = dpnp.asarray(C1)
    F1d = dpnp.asarray(F1)
    Q1d = dpnp.asarray(Q1)
    output_d = dpnp.asarray(output)

    rambo_ref(nevts, nout, C1, F1, Q1, output)
    rambo_njit(nevts, nout, C1, F1, Q1, output_n)
    rambo_dpjit(nevts, nout, C1d, F1d, Q1d, output_d)

    return output, output_n, output_d

output, output_n, output_d = test_rambo()

print(numpy.allclose(output[:, :, 0], output_d[:,:,0]))
print(numpy.allclose(output[:, :, 1], output_d[:,:,1]))
print(numpy.allclose(output[:, :, 2], output_d[:,:,2]))
print(numpy.allclose(output[:, :, 3], output_d[:,:,3]))

@diptorupd
Copy link
Contributor

diptorupd commented Apr 19, 2023

The following is an abridged version of the Numba IR generated for the dpjit implementation:

label 0:
    _40binary__multiply_17 = arg(0, name=_40binary__multiply_17) ['_40binary__multiply_17']
    C1 = arg(1, name=C1)                     ['C1']
    F1 = arg(2, name=F1)                     ['F1']
    Q1 = arg(3, name=Q1)                     ['Q1']
    _subarr_48 = arg(4, name=_subarr_48)     ['_subarr_48']
    _subarr_53 = arg(5, name=_subarr_53)     ['_subarr_53']
    _subarr_58 = arg(6, name=_subarr_58)     ['_subarr_58']
    _subarr_63 = arg(7, name=_subarr_63)     ['_subarr_63']
    $2load_global.0.730 = global(dpex: <module 'numba_dpex' from '/localdisk/work/diptorup/devel/numba-dpex/numba_dpex/__init__.py'>) ['$2load_global.0.730']
    $4load_method.1.731 = getattr(value=$2load_global.0.730, attr=get_global_id) ['$2load_global.0.730', '$4load_method.1.731']
    $const6.2.732 = const(int, 0)            ['$const6.2.732']
    parfor__index_68 = call $4load_method.1.731($const6.2.732, func=$4load_method.1.731, args=[Var($const6.2.732, <string>:2)], kws=(), vararg=None, varkwarg=None, target=None) ['$4load_method.1.731', '$const6.2.732', 'parfor__index_68']
    $12load_global.4.733 = global(dpex: <module 'numba_dpex' from '/localdisk/work/diptorup/devel/numba-dpex/numba_dpex/__init__.py'>) ['$12load_global.4.733']
    $14load_method.5.734 = getattr(value=$12load_global.4.733, attr=get_global_id) ['$12load_global.4.733', '$14load_method.5.734']
    $const16.6.735 = const(int, 1)           ['$const16.6.735']
    parfor__index_69 = call $14load_method.5.734($const16.6.735, func=$14load_method.5.734, args=[Var($const16.6.735, <string>:3)], kws=(), vararg=None, varkwarg=None, target=None) ['$14load_method.5.734', '$const16.6.735', 'parfor__index_69']
    jump 6                                   []
label 6:
    $parfor_index_tuple_var.78 = build_tuple(items=[Var(parfor__index_68, driver.py:21), Var(parfor__index_69, driver.py:21)]) ['$parfor_index_tuple_var.78', 'parfor__index_68', 'parfor__index_69']
    ...
    _subarr_48[$parfor_index_tuple_var.78] = $expr_out_var.123 ['$expr_out_var.123', '$parfor_index_tuple_var.78', '_subarr_48']
    $arg_out_var.139 = $expr_out_var.123 * $expr_out_var.92 ['$arg_out_var.139', '$expr_out_var.123', '$expr_out_var.92']
    $push_global_to_block.724 = global(dpnp: <module 'dpnp' from '/localdisk/work/diptorup/.conda/envs/dpex-devel/lib/python3.10/site-packages/dpnp-0.11.1-py3.10-linux-x86_64.egg/dpnp/__init__.py'>) ['$push_global_to_block.724']
    $90load_method.41.144 = getattr(value=$push_global_to_block.724, attr=sin) ['$90load_method.41.144', '$push_global_to_block.724']
    $arg_out_var.142 = call $90load_method.41.144($expr_out_var.110, func=$90load_method.41.144, args=[Var($expr_out_var.110, driver.py:23)], kws=(), vararg=None, varkwarg=None, target=None) ['$90load_method.41.144', '$arg_out_var.142', '$expr_out_var.110']
    $expr_out_var.137 = $arg_out_var.139 * $arg_out_var.142 ['$arg_out_var.139', '$arg_out_var.142', '$expr_out_var.137']
    _subarr_53[$parfor_index_tuple_var.78] = $expr_out_var.137 ['$expr_out_var.137', '$parfor_index_tuple_var.78', '_subarr_53']
    $arg_out_var.156 = $expr_out_var.123 * $expr_out_var.92 ['$arg_out_var.156', '$expr_out_var.123', '$expr_out_var.92']
    $push_global_to_block.725 = global(dpnp: <module 'dpnp' from '/localdisk/work/diptorup/.conda/envs/dpex-devel/lib/python3.10/site-packages/dpnp-0.11.1-py3.10-linux-x86_64.egg/dpnp/__init__.py'>) ['$push_global_to_block.725']
    $126load_method.60.161 = getattr(value=$push_global_to_block.725, attr=cos) ['$126load_method.60.161', '$push_global_to_block.725']
    $arg_out_var.159 = call $126load_method.60.161($expr_out_var.110, func=$126load_method.60.161, args=[Var($expr_out_var.110, driver.py:23)], kws=(), vararg=None, varkwarg=None, target=None) ['$126load_method.60.161', '$arg_out_var.159', '$expr_out_var.110']
    $expr_out_var.154 = $arg_out_var.156 * $arg_out_var.159 ['$arg_out_var.156', '$arg_out_var.159', '$expr_out_var.154']
    _subarr_58[$parfor_index_tuple_var.78] = $expr_out_var.154 ['$expr_out_var.154', '$parfor_index_tuple_var.78', '_subarr_58']
    $expr_out_var.171 = $expr_out_var.123 * $expr_out_var.77 ['$expr_out_var.123', '$expr_out_var.171', '$expr_out_var.77']
    _subarr_63[$parfor_index_tuple_var.78] = $expr_out_var.171 ['$expr_out_var.171', '$parfor_index_tuple_var.78', '_subarr_63']
    jump 7                                   []
label 7:
    $const26.9.736 = const(NoneType, None)   ['$const26.9.736']
    $28return_value.10.737 = cast(value=$const26.9.736) ['$28return_value.10.737', '$const26.9.736']
    return $28return_value.10.737            ['$28return_value.10.737']

The issue stems from incorrect index computation for the output array. The index is computed using just i (get_global_id(0)) and j (get_global_id(1)), disregarding the third index variable. Refer the following instruction:

 $parfor_index_tuple_var.78 = build_tuple(items=[Var(parfor__index_68, driver.py:21), Var(parfor__index_69, driver.py:21)]) ['$parfor_index_tuple_var.78', 'parfor__index_68', 'parfor__index_69']

@diptorupd
Copy link
Contributor

On further analysis, for a parfor the slicing operation of output is hoisted out and each slice is passed as a separate input argument to the parfor. For that reason, the indexing operation works.

 $parfor_index_tuple_var.78 = build_tuple(items=[Var(parfor__index_68, driver.py:21), Var(parfor__index_69, driver.py:21)]) ['$parfor_index_tuple_var.78', 'parfor__index_68', 'parfor__index_69']

In dpjit's case, either the slice is done incorrectly or the sliced arrays are passed in to the parfor incorrectly.

@diptorupd
Copy link
Contributor

A minimal reproducer

import dpnp
import numba as nb
from numba_dpex import dpjit
import numpy


@dpjit
def foo(a):
    return a[1:5]


a = dpnp.arange(10)
b = foo(a)
print("Dpnp input ", a)
print("Dpnp slice [1:5] ", b)


@nb.njit
def bar(a):
    return a[1:5]


na = numpy.arange(10)
nb = bar(na)
print("NumPy input ", na)
print("NumPy slice [1:5] ", nb)

Output

Dpnp input [0 1 2 3 4 5 6 7 8 9]
Dpnp slice [1:5] [1 1 1 1]
NumPy input [0 1 2 3 4 5 6 7 8 9]
NumPy slice [1:5] [1 2 3 4]

@diptorupd
Copy link
Contributor

LLVM IR generated for the foo function:

; ModuleID = 'foo.ll'
source_filename = "foo.ll"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

@_ZN08NumbaEnv8__main__3fooB2v1B1cB2_8B1tB1JB1TB1CB3_2fB1WB1QB1AB1lB1bB1WB2_1B1yB1BB1CB2_0B1oB1RB2_6B1GB1EB1LB1EB1UB1MB1EB1LB1YB1SB1PB1GB1rB1IB1QB1MB1VB1jB1AB1QB1nB1iB1QB1cB1IB1XB1KB1QB1IB1MB1VB1wB1oB1OB1GB1KB1oB1QB1DB1DB1VB1QB1QB1RB2_1B1NB1HB1AB1SB2_2B1FB1QB2_9B1XB1gB1SB1sB2_8B1wB1mB2_4B1oB1gB1LB1EB2_0B1AE11DpnpNdArrayIxLi1E1C7mutable7alignedE = common local_unnamed_addr global i8* null

define i32 @_ZN8__main__3foo(
    { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }* noalias nocapture %retptr,
    { i8*, i32, i8* }** noalias nocapture readnone %excinfo,
    i8* %arg.a.0,
    i8* %arg.a.1,
    i64 %arg.a.2,
    i64 %arg.a.3,
    i64* %arg.a.4,
    i64 %arg.a.5.0,
    i64 %arg.a.6.0
) local_unnamed_addr {
entry:
  tail call void @NRT_incref(i8* %arg.a.0)
  %0 = icmp slt i64 %arg.a.5.0, 1
  %.65.sroa.0.2 = select i1 %0, i64 %arg.a.5.0, i64 1, !prof !0
  %1 = icmp slt i64 %arg.a.5.0, 5
  %.65.sroa.13.0 = select i1 %1, i64 %arg.a.5.0, i64 5, !prof !0
  %.153 = sub i64 %.65.sroa.13.0, %.65.sroa.0.2
  %.160.inv = icmp sgt i64 %.153, 0
  %.162 = select i1 %.160.inv, i64 %.153, i64 0
  %.178 = getelementptr i64, i64* %arg.a.4, i64 %.65.sroa.0.2
  tail call void @NRT_incref(i8* %arg.a.0)
  tail call void @NRT_decref(i8* null)
  tail call void @NRT_decref(i8* %arg.a.0)
  tail call void @NRT_incref(i8* %arg.a.0)
  tail call void @NRT_decref(i8* null)
  tail call void @NRT_decref(i8* %arg.a.0)
  %retptr.repack = getelementptr inbounds { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }, { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }* %retptr, i64 0, i32 0
  store i8* %arg.a.0, i8** %retptr.repack, align 8
  %retptr.repack73 = getelementptr inbounds { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }, { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }* %retptr, i64 0, i32 1
  store i8* %arg.a.1, i8** %retptr.repack73, align 8
  %retptr.repack75 = getelementptr inbounds { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }, { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }* %retptr, i64 0, i32 2
  store i64 %.162, i64* %retptr.repack75, align 8
  %retptr.repack77 = getelementptr inbounds { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }, { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }* %retptr, i64 0, i32 3
  store i64 %arg.a.3, i64* %retptr.repack77, align 8
  %retptr.repack79 = getelementptr inbounds { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }, { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }* %retptr, i64 0, i32 4
  store i64* %.178, i64** %retptr.repack79, align 8
  %2 = getelementptr inbounds { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }, { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }* %retptr, i64 0, i32 5, i64 0
  store i64 %.162, i64* %2, align 8
  %3 = getelementptr inbounds { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }, { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }* %retptr, i64 0, i32 6, i64 0
  store i64 %arg.a.6.0, i64* %3, align 8
  ret i32 0
}

declare void @NRT_incref(i8* noalias nocapture) local_unnamed_addr

declare void @NRT_decref(i8* noalias nocapture) local_unnamed_addr

!0 = !{!"branch_weights", i32 1, i32 99}

The relevant lines of code are:

  • %.178 = getelementptr i64, i64* %arg.a.4, i64 %.65.sroa.0.2 : creates a pointer equivalent to a.data+4.
  • store i64* %.178, i64** %retptr.repack79, align 8: sets the data pointer for the returned slice to a.data+4,
  • %.162 = select i1 %.160.inv, i64 %.153, i64 0: Equivalent to 4.
  • store i64 %.162, i64* %2, align 8: Sets the shape of the returned slice to 4.

All these look fine at first glace, with the only caveat that a.data is a USM pointer.

@diptorupd
Copy link
Contributor

A minimal reproducer

import dpnp
import numba as nb
from numba_dpex import dpjit
import numpy


@dpjit
def foo(a):
    return a[1:5]


a = dpnp.arange(10)
b = foo(a)
print("Dpnp input ", a)
print("Dpnp slice [1:5] ", b)


@nb.njit
def bar(a):
    return a[1:5]


na = numpy.arange(10)
nb = bar(na)
print("NumPy input ", na)
print("NumPy slice [1:5] ", nb)

The above issue was caused by a bug in unboxing of a dpnp.ndarray and has been fixed by:

c367015

@diptorupd
Copy link
Contributor

One more reproducer:

import dpnp
from numba_dpex import dpjit

def foo2(a,b):
    b[:,:,0] = a

a1 = dpnp.arange(64)
a1 = a1.reshape(4,16)
b1 = dpnp.empty((4,16,4))
dpjit(foo2)(a1, b1)

print(b1)

The above example still produces incorrect results even after c367015. The reason seems to be incorrect indexing in the kernel generated for the expression b[:,:,0] = a.

@diptorupd
Copy link
Contributor

@adarshyoga @mingjie-intel I am yet to fully confirm, but here is the most likely scenario causing the bug:

  • The slice function is handled by Numba and we get back an offset pointer into a multi-dimensional array.
  • The assignment of values into the slice based off the offset pointer is done in a kernel that is generated via a parfor.
  • However, as the slice does not necessarily represent contiguous data, the store into the slice inside the parfor-kernel should account for strides. Currently, we are not doing that and ending up with wrong results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants