sporadic inaccurate results relative to numpy if atomic add is used #711

geexie · 2022-02-17T15:51:50Z

I'm running on Gen9 and dppy 17.4 and have sporadic inaccurate results relative to numpy for the following code

import argparse
import math
import time

import dpctl
import numba
import numpy as np
import numpy.random as rnd
import numba_dppy as dppy
import numba_dppy

from numba_dppy import kernel, atomic, DEFAULT_LOCAL_SIZE
atomic_add = atomic.add

SEED = 777777
DTYPE = np.float32

@kernel(access_types={"read_only": ["a", "b"], "write_only": ["c"]})
def l2_distance_kernel(a, b, c):
    i = numba_dppy.get_global_id(0)
    j = numba_dppy.get_global_id(1)
    sub = a[i, j] - b[i, j]
    sq = sub**2
    atomic_add(c, 0, sq)

def gen_data(nopt, dims, dtype=DTYPE):
    rnd.seed(SEED)
    return rnd.random((nopt, dims)).astype(dtype), rnd.random((nopt, dims)).astype(dtype)

def l2_distance_python(a, b):
    return np.linalg.norm(a - b)

def run(sizes=3, step=2, nopt=2**20):
    parser = argparse.ArgumentParser(description="Black-Scholes")
    parser.add_argument("--iter", dest="iter", type=int, default=10)
    args = parser.parse_args()

    dims = 1

    for _ in range(sizes):

        # Use the environment variable SYCL_DEVICE_FILTER to change the default device.
        # See https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md#sycl_device_filter.
        device = dpctl.select_default_device()
        print("Using device ...", device)
        device.print_device_info()

        X, Y = gen_data(nopt, dims, np.float32)
        distance = np.asarray([0.0]).astype(np.float32)
        p_dis = l2_distance_python(X, Y)

        n_dis = 0
        with dpctl.device_context(device):
            l2_distance_kernel[(X.shape[0], X.shape[1]), DEFAULT_LOCAL_SIZE](X, Y, distance)
            if int(distance) >= 0:
                n_dis = math.sqrt(distance)

        if np.allclose(n_dis, p_dis, rtol=1e-05 * np.sqrt(nopt)):
            print("Test succeeded for size", nopt, ". Python dis: ", p_dis, " Numba dis: ", n_dis, "\n")
        else:
            print("Test failed for size", nopt, ". Python dis: ", p_dis, " Numba dis: ", n_dis, "\n")

        nopt *= step

    print("Done...")


if __name__ == "__main__":
    run()

the results is the following

(dppy_bench) geexie@geek-box:~/code/dpbench$ IGC_ShaderDumpEnable=1 IGC_DumpToCurrentDir=1 ICG_DumpCompilerStats=1 NUMBA_DPPY_OFFLOAD_DIAGNOSTICS=1 NUMBA_DPPY_SAVE_IR_FILES=1 NUMBA_DPPY_FALLBACK_ON_CPU=0 python l2_distance.py 
Using device ... <dpctl.SyclDevice [backend_type.level_zero, device_type.gpu,  Intel(R) UHD Graphics [0x9bca]] at 0x7f9484631bb0>
    Name            Intel(R) UHD Graphics [0x9bca]
    Driver version  1.2.21786
    Vendor          Intel(R) Corporation
    Profile         FULL_PROFILE
    Filter string   level_zero:gpu:0

 
================================================================================
 Parallel Accelerator Optimizing:  Function l2_distance_kernel, 
/localdisk/dpbench/l2_distance.py (31)  
================================================================================
No source available
------------------------------ After Optimisation ------------------------------
Parallel structure is already optimal.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
 
------------------------------- Auto-offloading --------------------------------
Parallel structure is already optimal.
Device - 'level_zero:gpu:0'
--------------------------------------------------------------------------------
-------------------------------Generated LLVM IR--------------------------------
generated_llvm.ir
================================================================================
-----------------------------Generated LLVM Bitcode-----------------------------
generated_llvm.bc
================================================================================
--------------------------------Generated SPIRV---------------------------------
generated_spirv.spir
================================================================================
Test succeeded for size 1048576 . Python dis:  417.9472  Numba dis:  417.8261263671768 

Using device ... <dpctl.SyclDevice [backend_type.level_zero, device_type.gpu,  Intel(R) UHD Graphics [0x9bca]] at 0x7f9484631470>
    Name            Intel(R) UHD Graphics [0x9bca]
    Driver version  1.2.21786
    Vendor          Intel(R) Corporation
    Profile         FULL_PROFILE
    Filter string   level_zero:gpu:0

Test failed for size 2097152 . Python dis:  591.58044  Numba dis:  723.4867612817804 

Using device ... <dpctl.SyclDevice [backend_type.level_zero, device_type.gpu,  Intel(R) UHD Graphics [0x9bca]] at 0x7f948459e5b0>
    Name            Intel(R) UHD Graphics [0x9bca]
    Driver version  1.2.21786
    Vendor          Intel(R) Corporation
    Profile         FULL_PROFILE
    Filter string   level_zero:gpu:0

Test failed for size 4194304 . Python dis:  835.9003  Numba dis:  1100.8355917211252 

Done...

Full code of the benchmark you can find here

The text was updated successfully, but these errors were encountered:

akharche · 2022-03-16T12:49:27Z

This problem is reproduced only on Gen9 with dppy 17.4 and 18.0.

ZzEeKkAa · 2023-08-02T17:34:34Z

After updating the script to catch up all the changes it just freezes:

import argparse
import math
import time

import dpctl
import dpnp
import numba
import numpy as np
import numpy.random as rnd
import numba_dpex as dppy
import numba_dpex as numba_dppy

from numba_dpex import kernel, atomic, DEFAULT_LOCAL_SIZE
atomic_add = atomic.add

SEED = 777777
DTYPE = np.float32

#@kernel(access_types={"read_only": ["a", "b"], "write_only": ["c"]})
@kernel
def l2_distance_kernel(a, b, c):
    i = numba_dppy.get_global_id(0)
    j = numba_dppy.get_global_id(1)
    sub = a[i, j] - b[i, j]
    sq = sub**2
    atomic_add(c, 0, sq)

def gen_data(nopt, dims, dtype=DTYPE):
    rnd.seed(SEED)
    return rnd.random((nopt, dims)).astype(dtype), rnd.random((nopt, dims)).astype(dtype)

def l2_distance_python(a, b):
    return np.linalg.norm(a - b)

def run(sizes=3, step=2, nopt=2**20):
    parser = argparse.ArgumentParser(description="Black-Scholes")
    parser.add_argument("--iter", dest="iter", type=int, default=10)
    args = parser.parse_args()

    dims = 1

    for _ in range(sizes):

        # Use the environment variable SYCL_DEVICE_FILTER to change the default device.
        # See https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md#sycl_device_filter.
        device = dpctl.select_default_device()
        print("Using device ...", device)
        device.print_device_info()

        X, Y = gen_data(nopt, dims, np.float32)
        distance = np.asarray([0.0]).astype(np.float32)
        p_dis = l2_distance_python(X, Y)
        X, Y, distance = dpnp.array(X), dpnp.array(Y), dpnp.array(distance)

        n_dis = 0
        with dpctl.device_context(device):
            l2_distance_kernel[numba_dppy.Range(X.shape[0], X.shape[1])](X, Y, distance)
            if int(distance) >= 0:
                n_dis = math.sqrt(distance)

        if np.allclose(n_dis, p_dis, rtol=1e-05 * np.sqrt(nopt)):
            print("Test succeeded for size", nopt, ". Python dis: ", p_dis, " Numba dis: ", n_dis, "\n")
        else:
            print("Test failed for size", nopt, ". Python dis: ", p_dis, " Numba dis: ", n_dis, "\n")

        nopt *= step

    print("Done...")


if __name__ == "__main__":
    run()

diptorupd · 2024-03-30T04:35:25Z

Updated the reproducer to latest API and I can reproduce the freeze/deadlock reported previously:

import argparse
import math
import dpctl
import dpnp
import numpy as np
import numpy.random as rnd
from numba_dpex import kernel_api as kapi

from numba_dpex import kernel, call_kernel


SEED = 777777
DTYPE = np.float32


@kernel
def l2_distance_kernel(item, a, b, c):
    i = item.get_id(0)
    j = item.get_id(1)
    sub = a[i, j] - b[i, j]
    sq = sub**2
    sq_aref = kapi.AtomicRef(c, 0)
    sq_aref.fetch_add(sq)


def gen_data(nopt, dims, dtype=DTYPE):
    rnd.seed(SEED)
    return rnd.random((nopt, dims)).astype(dtype), rnd.random(
        (nopt, dims)
    ).astype(dtype)


def l2_distance_python(a, b):
    return np.linalg.norm(a - b)


def run(sizes=3, step=2, nopt=2**20):
    parser = argparse.ArgumentParser(description="Black-Scholes")
    parser.add_argument("--iter", dest="iter", type=int, default=10)
    args = parser.parse_args()

    dims = 1

    for _ in range(sizes):

        # Use the environment variable SYCL_DEVICE_FILTER to change the default device.
        # See https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md#sycl_device_filter.
        device = dpctl.select_default_device()
        print("Using device ...", device)
        device.print_device_info()

        X, Y = gen_data(nopt, dims, np.float32)
        distance = np.asarray([0.0]).astype(np.float32)
        p_dis = l2_distance_python(X, Y)
        X, Y, distance = dpnp.array(X), dpnp.array(Y), dpnp.array(distance)

        n_dis = 0
        print(distance)
        print("0000000000000000000000000000")
        call_kernel(
            l2_distance_kernel,
            kapi.Range(X.shape[0], X.shape[1]),
            X,
            Y,
            distance,
        )
        print("10000000000000000000000000000")
        if int(distance) >= 0:
            n_dis = math.sqrt(distance)

        if np.allclose(n_dis, p_dis, rtol=1e-05 * np.sqrt(nopt)):
            print(
                "Test succeeded for size",
                nopt,
                ". Python dis: ",
                p_dis,
                " Numba dis: ",
                n_dis,
                "\n",
            )
        else:
            print(
                "Test failed for size",
                nopt,
                ". Python dis: ",
                p_dis,
                " Numba dis: ",
                n_dis,
                "\n",
            )

        nopt *= step

    print("Done...")


if __name__ == "__main__":
    run()

diptorupd · 2024-03-30T14:13:00Z

Updated the reproducer to latest API and I can reproduce the freeze/deadlock reported previously:

I experience the issue on a Gen9 integrated graphics only at problem size 2**18 and higher. One a Gen9 Xeon CPU the latest code version I added works as expected without inaccuracy or freeze.

I think the issue right now is that for 2**18 or higher problem sizes the number of work items that get launched leads to either slowness due to contention for the global atomic operation or some other problem.

I will next write a dpc++ example to verify what happens when we run a similar code in C++.

diptorupd self-assigned this Oct 5, 2022

diptorupd mentioned this issue Oct 5, 2022

Improvements to atomic operations in numba-dpex kernels #783

Closed

8 tasks

diptorupd added the user User submitted issue label Oct 18, 2022

diptorupd mentioned this issue Jul 31, 2023

Remove atomic emulation #1103

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sporadic inaccurate results relative to numpy if atomic add is used #711

sporadic inaccurate results relative to numpy if atomic add is used #711

geexie commented Feb 17, 2022

akharche commented Mar 16, 2022

ZzEeKkAa commented Aug 2, 2023

diptorupd commented Mar 30, 2024

diptorupd commented Mar 30, 2024

sporadic inaccurate results relative to numpy if atomic add is used #711

sporadic inaccurate results relative to numpy if atomic add is used #711

Comments

geexie commented Feb 17, 2022

akharche commented Mar 16, 2022

ZzEeKkAa commented Aug 2, 2023

diptorupd commented Mar 30, 2024

diptorupd commented Mar 30, 2024