Numba-dpex implementation of Rambo fails validation #978

adarshyoga · 2023-03-21T04:12:18Z

Rambo workload's numba-dpex implementation with dpnp calls fails validation. In my initial analysis, I see that the results produced are different to what the numpy/python implementation produces. The numba-dpex prange implementation passes validation. So this problem seems to be specific to dpnp calls.

How to reproduce:

Follow instructions to setup dpbench
Run Rambo - python -c "import dpbench; dpbench.run_benchmark("rambo")

diptorupd · 2023-04-19T21:23:49Z

Reproducer:

import dpnp
import numba as nb
from numba_dpex import dpjit
import numpy

dpnp.pi = numpy.pi

def rambo_ref(nevts, nout, C1, F1, Q1, output):
    C = 2.0 * C1 - 1.0
    S = numpy.sqrt(1 - numpy.square(C))
    F = 2.0 * numpy.pi * F1
    Q = -numpy.log(Q1)

    output[:, :, 0] = Q
    output[:, :, 1] = Q * S * numpy.sin(F)
    output[:, :, 2] = Q * S * numpy.cos(F)
    output[:, :, 3] = Q * C

@dpjit
def rambo_dpjit(nevts, nout, C1, F1, Q1, output):
    C = 2.0 * C1 - 1.0
    S = dpnp.sqrt(1 - dpnp.square(C))
    F = 2.0 * dpnp.pi * F1
    Q = -dpnp.log(Q1)

    output[:, :, 0] = Q
    output[:, :, 1] = Q * S * dpnp.sin(F)
    output[:, :, 2] = Q * S * dpnp.cos(F)
    output[:, :, 3] = Q * C

@nb.njit(parallel=True)
def rambo_njit(nevts, nout, C1, F1, Q1, output):
    C = 2.0 * C1 - 1.0
    S = numpy.sqrt(1 - numpy.square(C))
    F = 2.0 * numpy.pi * F1
    Q = -numpy.log(Q1)

    output[:, :, 0] = Q
    output[:, :, 1] = Q * S * numpy.sin(F)
    output[:, :, 2] = Q * S * numpy.cos(F)
    output[:, :, 3] = Q * C



def initialize(nevts, nout):

    C1 = numpy.empty((nevts, nout))
    F1 = numpy.empty((nevts, nout))
    Q1 = numpy.empty((nevts, nout))

    numpy.random.seed(777)
    for i in range(nevts):
        for j in range(nout):
            C1[i, j] = numpy.random.rand()
            F1[i, j] = numpy.random.rand()
            Q1[i, j] = numpy.random.rand() * numpy.random.rand()

    return (
        C1,
        F1,
        Q1,
        numpy.empty((nevts, nout, 4)),
    )

def test_rambo(nevts=32768, nout=4):
    C1,F1,Q1, output = initialize(nevts, nout)
    output_n = numpy.copy(output)
    # Copy C1, F1,Q1 to dpnp
    C1d = dpnp.asarray(C1)
    F1d = dpnp.asarray(F1)
    Q1d = dpnp.asarray(Q1)
    output_d = dpnp.asarray(output)

    rambo_ref(nevts, nout, C1, F1, Q1, output)
    rambo_njit(nevts, nout, C1, F1, Q1, output_n)
    rambo_dpjit(nevts, nout, C1d, F1d, Q1d, output_d)

    return output, output_n, output_d

output, output_n, output_d = test_rambo()

print(numpy.allclose(output[:, :, 0], output_d[:,:,0]))
print(numpy.allclose(output[:, :, 1], output_d[:,:,1]))
print(numpy.allclose(output[:, :, 2], output_d[:,:,2]))
print(numpy.allclose(output[:, :, 3], output_d[:,:,3]))

diptorupd · 2023-04-19T21:28:41Z

The following is an abridged version of the Numba IR generated for the dpjit implementation:

label 0:
    _40binary__multiply_17 = arg(0, name=_40binary__multiply_17) ['_40binary__multiply_17']
    C1 = arg(1, name=C1)                     ['C1']
    F1 = arg(2, name=F1)                     ['F1']
    Q1 = arg(3, name=Q1)                     ['Q1']
    _subarr_48 = arg(4, name=_subarr_48)     ['_subarr_48']
    _subarr_53 = arg(5, name=_subarr_53)     ['_subarr_53']
    _subarr_58 = arg(6, name=_subarr_58)     ['_subarr_58']
    _subarr_63 = arg(7, name=_subarr_63)     ['_subarr_63']
    $2load_global.0.730 = global(dpex: <module 'numba_dpex' from '/localdisk/work/diptorup/devel/numba-dpex/numba_dpex/__init__.py'>) ['$2load_global.0.730']
    $4load_method.1.731 = getattr(value=$2load_global.0.730, attr=get_global_id) ['$2load_global.0.730', '$4load_method.1.731']
    $const6.2.732 = const(int, 0)            ['$const6.2.732']
    parfor__index_68 = call $4load_method.1.731($const6.2.732, func=$4load_method.1.731, args=[Var($const6.2.732, <string>:2)], kws=(), vararg=None, varkwarg=None, target=None) ['$4load_method.1.731', '$const6.2.732', 'parfor__index_68']
    $12load_global.4.733 = global(dpex: <module 'numba_dpex' from '/localdisk/work/diptorup/devel/numba-dpex/numba_dpex/__init__.py'>) ['$12load_global.4.733']
    $14load_method.5.734 = getattr(value=$12load_global.4.733, attr=get_global_id) ['$12load_global.4.733', '$14load_method.5.734']
    $const16.6.735 = const(int, 1)           ['$const16.6.735']
    parfor__index_69 = call $14load_method.5.734($const16.6.735, func=$14load_method.5.734, args=[Var($const16.6.735, <string>:3)], kws=(), vararg=None, varkwarg=None, target=None) ['$14load_method.5.734', '$const16.6.735', 'parfor__index_69']
    jump 6                                   []
label 6:
    $parfor_index_tuple_var.78 = build_tuple(items=[Var(parfor__index_68, driver.py:21), Var(parfor__index_69, driver.py:21)]) ['$parfor_index_tuple_var.78', 'parfor__index_68', 'parfor__index_69']
    ...
    _subarr_48[$parfor_index_tuple_var.78] = $expr_out_var.123 ['$expr_out_var.123', '$parfor_index_tuple_var.78', '_subarr_48']
    $arg_out_var.139 = $expr_out_var.123 * $expr_out_var.92 ['$arg_out_var.139', '$expr_out_var.123', '$expr_out_var.92']
    $push_global_to_block.724 = global(dpnp: <module 'dpnp' from '/localdisk/work/diptorup/.conda/envs/dpex-devel/lib/python3.10/site-packages/dpnp-0.11.1-py3.10-linux-x86_64.egg/dpnp/__init__.py'>) ['$push_global_to_block.724']
    $90load_method.41.144 = getattr(value=$push_global_to_block.724, attr=sin) ['$90load_method.41.144', '$push_global_to_block.724']
    $arg_out_var.142 = call $90load_method.41.144($expr_out_var.110, func=$90load_method.41.144, args=[Var($expr_out_var.110, driver.py:23)], kws=(), vararg=None, varkwarg=None, target=None) ['$90load_method.41.144', '$arg_out_var.142', '$expr_out_var.110']
    $expr_out_var.137 = $arg_out_var.139 * $arg_out_var.142 ['$arg_out_var.139', '$arg_out_var.142', '$expr_out_var.137']
    _subarr_53[$parfor_index_tuple_var.78] = $expr_out_var.137 ['$expr_out_var.137', '$parfor_index_tuple_var.78', '_subarr_53']
    $arg_out_var.156 = $expr_out_var.123 * $expr_out_var.92 ['$arg_out_var.156', '$expr_out_var.123', '$expr_out_var.92']
    $push_global_to_block.725 = global(dpnp: <module 'dpnp' from '/localdisk/work/diptorup/.conda/envs/dpex-devel/lib/python3.10/site-packages/dpnp-0.11.1-py3.10-linux-x86_64.egg/dpnp/__init__.py'>) ['$push_global_to_block.725']
    $126load_method.60.161 = getattr(value=$push_global_to_block.725, attr=cos) ['$126load_method.60.161', '$push_global_to_block.725']
    $arg_out_var.159 = call $126load_method.60.161($expr_out_var.110, func=$126load_method.60.161, args=[Var($expr_out_var.110, driver.py:23)], kws=(), vararg=None, varkwarg=None, target=None) ['$126load_method.60.161', '$arg_out_var.159', '$expr_out_var.110']
    $expr_out_var.154 = $arg_out_var.156 * $arg_out_var.159 ['$arg_out_var.156', '$arg_out_var.159', '$expr_out_var.154']
    _subarr_58[$parfor_index_tuple_var.78] = $expr_out_var.154 ['$expr_out_var.154', '$parfor_index_tuple_var.78', '_subarr_58']
    $expr_out_var.171 = $expr_out_var.123 * $expr_out_var.77 ['$expr_out_var.123', '$expr_out_var.171', '$expr_out_var.77']
    _subarr_63[$parfor_index_tuple_var.78] = $expr_out_var.171 ['$expr_out_var.171', '$parfor_index_tuple_var.78', '_subarr_63']
    jump 7                                   []
label 7:
    $const26.9.736 = const(NoneType, None)   ['$const26.9.736']
    $28return_value.10.737 = cast(value=$const26.9.736) ['$28return_value.10.737', '$const26.9.736']
    return $28return_value.10.737            ['$28return_value.10.737']

The issue stems from incorrect index computation for the output array. The index is computed using just i (get_global_id(0)) and j (get_global_id(1)), disregarding the third index variable. Refer the following instruction:

 $parfor_index_tuple_var.78 = build_tuple(items=[Var(parfor__index_68, driver.py:21), Var(parfor__index_69, driver.py:21)]) ['$parfor_index_tuple_var.78', 'parfor__index_68', 'parfor__index_69']

diptorupd · 2023-04-19T21:39:36Z

On further analysis, for a parfor the slicing operation of output is hoisted out and each slice is passed as a separate input argument to the parfor. For that reason, the indexing operation works.

 $parfor_index_tuple_var.78 = build_tuple(items=[Var(parfor__index_68, driver.py:21), Var(parfor__index_69, driver.py:21)]) ['$parfor_index_tuple_var.78', 'parfor__index_68', 'parfor__index_69']

In dpjit's case, either the slice is done incorrectly or the sliced arrays are passed in to the parfor incorrectly.

diptorupd · 2023-04-20T02:14:32Z

A minimal reproducer

import dpnp
import numba as nb
from numba_dpex import dpjit
import numpy


@dpjit
def foo(a):
    return a[1:5]


a = dpnp.arange(10)
b = foo(a)
print("Dpnp input ", a)
print("Dpnp slice [1:5] ", b)


@nb.njit
def bar(a):
    return a[1:5]


na = numpy.arange(10)
nb = bar(na)
print("NumPy input ", na)
print("NumPy slice [1:5] ", nb)

Output

Dpnp input [0 1 2 3 4 5 6 7 8 9]
Dpnp slice [1:5] [1 1 1 1]
NumPy input [0 1 2 3 4 5 6 7 8 9]
NumPy slice [1:5] [1 2 3 4]

diptorupd · 2023-04-20T03:08:21Z

LLVM IR generated for the foo function:

; ModuleID = 'foo.ll'
source_filename = "foo.ll"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

@_ZN08NumbaEnv8__main__3fooB2v1B1cB2_8B1tB1JB1TB1CB3_2fB1WB1QB1AB1lB1bB1WB2_1B1yB1BB1CB2_0B1oB1RB2_6B1GB1EB1LB1EB1UB1MB1EB1LB1YB1SB1PB1GB1rB1IB1QB1MB1VB1jB1AB1QB1nB1iB1QB1cB1IB1XB1KB1QB1IB1MB1VB1wB1oB1OB1GB1KB1oB1QB1DB1DB1VB1QB1QB1RB2_1B1NB1HB1AB1SB2_2B1FB1QB2_9B1XB1gB1SB1sB2_8B1wB1mB2_4B1oB1gB1LB1EB2_0B1AE11DpnpNdArrayIxLi1E1C7mutable7alignedE = common local_unnamed_addr global i8* null

define i32 @_ZN8__main__3foo(
    { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }* noalias nocapture %retptr,
    { i8*, i32, i8* }** noalias nocapture readnone %excinfo,
    i8* %arg.a.0,
    i8* %arg.a.1,
    i64 %arg.a.2,
    i64 %arg.a.3,
    i64* %arg.a.4,
    i64 %arg.a.5.0,
    i64 %arg.a.6.0
) local_unnamed_addr {
entry:
  tail call void @NRT_incref(i8* %arg.a.0)
  %0 = icmp slt i64 %arg.a.5.0, 1
  %.65.sroa.0.2 = select i1 %0, i64 %arg.a.5.0, i64 1, !prof !0
  %1 = icmp slt i64 %arg.a.5.0, 5
  %.65.sroa.13.0 = select i1 %1, i64 %arg.a.5.0, i64 5, !prof !0
  %.153 = sub i64 %.65.sroa.13.0, %.65.sroa.0.2
  %.160.inv = icmp sgt i64 %.153, 0
  %.162 = select i1 %.160.inv, i64 %.153, i64 0
  %.178 = getelementptr i64, i64* %arg.a.4, i64 %.65.sroa.0.2
  tail call void @NRT_incref(i8* %arg.a.0)
  tail call void @NRT_decref(i8* null)
  tail call void @NRT_decref(i8* %arg.a.0)
  tail call void @NRT_incref(i8* %arg.a.0)
  tail call void @NRT_decref(i8* null)
  tail call void @NRT_decref(i8* %arg.a.0)
  %retptr.repack = getelementptr inbounds { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }, { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }* %retptr, i64 0, i32 0
  store i8* %arg.a.0, i8** %retptr.repack, align 8
  %retptr.repack73 = getelementptr inbounds { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }, { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }* %retptr, i64 0, i32 1
  store i8* %arg.a.1, i8** %retptr.repack73, align 8
  %retptr.repack75 = getelementptr inbounds { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }, { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }* %retptr, i64 0, i32 2
  store i64 %.162, i64* %retptr.repack75, align 8
  %retptr.repack77 = getelementptr inbounds { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }, { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }* %retptr, i64 0, i32 3
  store i64 %arg.a.3, i64* %retptr.repack77, align 8
  %retptr.repack79 = getelementptr inbounds { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }, { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }* %retptr, i64 0, i32 4
  store i64* %.178, i64** %retptr.repack79, align 8
  %2 = getelementptr inbounds { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }, { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }* %retptr, i64 0, i32 5, i64 0
  store i64 %.162, i64* %2, align 8
  %3 = getelementptr inbounds { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }, { i8*, i8*, i64, i64, i64*, [1 x i64], [1 x i64] }* %retptr, i64 0, i32 6, i64 0
  store i64 %arg.a.6.0, i64* %3, align 8
  ret i32 0
}

declare void @NRT_incref(i8* noalias nocapture) local_unnamed_addr

declare void @NRT_decref(i8* noalias nocapture) local_unnamed_addr

!0 = !{!"branch_weights", i32 1, i32 99}

The relevant lines of code are:

%.178 = getelementptr i64, i64* %arg.a.4, i64 %.65.sroa.0.2 : creates a pointer equivalent to a.data+4.
store i64* %.178, i64** %retptr.repack79, align 8: sets the data pointer for the returned slice to a.data+4,
%.162 = select i1 %.160.inv, i64 %.153, i64 0: Equivalent to 4.
store i64 %.162, i64* %2, align 8: Sets the shape of the returned slice to 4.

All these look fine at first glace, with the only caveat that a.data is a USM pointer.

diptorupd · 2023-04-21T02:43:21Z

A minimal reproducer

import dpnp
import numba as nb
from numba_dpex import dpjit
import numpy


@dpjit
def foo(a):
    return a[1:5]


a = dpnp.arange(10)
b = foo(a)
print("Dpnp input ", a)
print("Dpnp slice [1:5] ", b)


@nb.njit
def bar(a):
    return a[1:5]


na = numpy.arange(10)
nb = bar(na)
print("NumPy input ", na)
print("NumPy slice [1:5] ", nb)

The above issue was caused by a bug in unboxing of a dpnp.ndarray and has been fixed by:

c367015

diptorupd · 2023-04-21T03:04:15Z

One more reproducer:

import dpnp
from numba_dpex import dpjit

def foo2(a,b):
    b[:,:,0] = a

a1 = dpnp.arange(64)
a1 = a1.reshape(4,16)
b1 = dpnp.empty((4,16,4))
dpjit(foo2)(a1, b1)

print(b1)

The above example still produces incorrect results even after c367015. The reason seems to be incorrect indexing in the kernel generated for the expression b[:,:,0] = a.

diptorupd · 2023-04-21T04:02:46Z

@adarshyoga @mingjie-intel I am yet to fully confirm, but here is the most likely scenario causing the bug:

The slice function is handled by Numba and we get back an offset pointer into a multi-dimensional array.
The assignment of values into the slice based off the offset pointer is done in a kernel that is generated via a parfor.
However, as the slice does not necessarily represent contiguous data, the store into the slice inside the parfor-kernel should account for strides. Currently, we are not doing that and ending up with wrong results.

This was referenced Apr 4, 2023

Adding implementations of Rambo using numpy-like vector operations IntelPython/dpbench#196

Merged

Rambo workload's numba-dpex implementation with dpnp vector operations IntelPython/dpbench#201

Merged

diptorupd mentioned this issue Apr 28, 2023

Fixes stride calculation when unboxing a dpnp ndarray #1023

Merged

4 tasks

diptorupd closed this as completed in #1023 Apr 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numba-dpex implementation of Rambo fails validation #978

Numba-dpex implementation of Rambo fails validation #978

adarshyoga commented Mar 21, 2023

diptorupd commented Apr 19, 2023

diptorupd commented Apr 19, 2023 •

edited

Loading

diptorupd commented Apr 19, 2023

diptorupd commented Apr 20, 2023

diptorupd commented Apr 20, 2023

diptorupd commented Apr 21, 2023

diptorupd commented Apr 21, 2023

diptorupd commented Apr 21, 2023

Numba-dpex implementation of Rambo fails validation #978

Numba-dpex implementation of Rambo fails validation #978

Comments

adarshyoga commented Mar 21, 2023

diptorupd commented Apr 19, 2023

diptorupd commented Apr 19, 2023 • edited Loading

diptorupd commented Apr 19, 2023

diptorupd commented Apr 20, 2023

diptorupd commented Apr 20, 2023

diptorupd commented Apr 21, 2023

diptorupd commented Apr 21, 2023

diptorupd commented Apr 21, 2023

diptorupd commented Apr 19, 2023 •

edited

Loading