Indexing performance #1249

npolina4 · 2023-06-14T23:59:05Z

import dpctl.tensor as dpt
a = dpt.ones((8192, 8192), device='cpu', dtype='f4')
b = dpt.ones((8192, 8192), device='cpu', dtype=bool)
%timeit a[b]
#211 ms ± 6.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

import numpy
a_np = numpy.ones((8192, 8192), dtype='f4')
b_np = numpy.ones((8192, 8192), dtype=bool)
%timeit a_np[b_np]
#87.1 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

oleksandr-pavlyk · 2023-07-27T12:48:18Z

This should be improved by changes in gh-1300. @npolina4 could you please post timeit results on the same machine you used to obtain reported numbers in the original comment?

npolina4 · 2023-07-28T17:46:07Z

Result with changes in #1300
Size: 8192, 8192
numpy: 105 ms
cpu: 205 ms
gpu: 115 ms

Size: 4096, 4096
numpy: 24.5 ms
cpu: 45~80 ms
gpu: 21.4 ms

Changed hyperparameter choices to be different for CPU and GPU, resulting in 20% performance gain on GPU. The non-recursive implementation allows to avoid repeated USM allocations, resulting in performance gains for large arrays. Furthermore, corrected base step kernel to accumulate in outputT rather than in size_t, which additionally realizes savings when int32 is used as accumulator type. Using example from gh-1249, previously, on my Iris Xe laptop: ``` In [1]: import dpctl.tensor as dpt ...: ag = dpt.ones((8192, 8192), device='gpu', dtype='f4') ...: bg = dpt.ones((8192, 8192), device='gpu', dtype=bool) In [2]: cg = ag[bg] In [3]: dpt.all(cg == dpt.reshape(ag, -1)) Out[3]: usm_ndarray(True) In [4]: %timeit -n 10 -r 3 cg = ag[bg] 212 ms ± 56 ms per loop (mean ± std. dev. of 3 runs, 10 loops each) ``` while with this change: ``` In [4]: %timeit -n 10 -r 3 cg = ag[bg] 178 ms ± 24.2 ms per loop (mean ± std. dev. of 3 runs, 10 loops each) ```

The chunk update kernels processed consecutive elements in contiguous memory, hence sub-group memory access pattern was sub-optimal (no coalescing). This PR changes these kernels to process n_wi elements which are sub-group size apart, improving memory access patern. Running a micro-benchmark based on code from gh-1249 (for shape =(n, n,) where n = 4096) with this change: ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.010703916665753004 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.01079747307597211 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.010864820314088353 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index.py 0.023878061203975922 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index.py 0.023666468500677083 ``` while before: ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.011415911812542213 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.011722088705196424 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu python index.py 0.030126182353813893 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu python index.py 0.030459783371986338 ``` Running the same code using NumPy (same size): ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index_np.py 0.01416253090698134 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index_np.py 0.014979530811413296 ``` The reason Level-Zero device is slower has to do with slow allocation/deallocation bug. OpenCL device has better timing. With this change: ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py 0.015038836885381627 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py 0.01527448468496678 ``` before: ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py 0.01758851639115838 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py 0.017089676241286926 ```

oleksandr-pavlyk · 2025-01-03T17:35:10Z

Using Core Ultra 7 155U laptop with integrated Arc GPU, and discrete NVidia GPU, and running on WSL (Ubuntu):

SYCL platform listings

[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Graphics [0x7d45] 12.70.4 [1.6.31294.120000]
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 7 155U OpenCL 3.0 (Build 0) [2024.18.12.0.05_160000]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Graphics [0x7d45] OpenCL 3.0 NEO  [24.39.31294.12]
[cuda:gpu][cuda:0] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 3050 4GB Laptop GPU 8.6 [CUDA 12.6]

Scripts used for timing

# index.py
import dpctl.tensor as dpt
import time

n = 4096
shape = (n, n,)

a = dpt.ones(shape, dtype='f4')
m = dpt.ones(shape, dtype='b1')

reps = 54
t0 = time.perf_counter()
for _ in range(reps):
    r = a[m]
a.sycl_queue.wait()
t1 = time.perf_counter()

print((t1-t0)/reps)

# index_np.py
import numpy as np
import time

n = 4096
shape = (n, n,)

a = np.ones(shape, dtype='f4')
m = np.ones(shape, dtype='b1')

reps = 54
t0 = time.perf_counter()
for _ in range(reps):
    r = a[m]
t1 = time.perf_counter()

print((t1-t0)/reps)

I am getting about 16ms run-time for stock NumPy (version 2.2.1):

$ python -c "import numpy as np; print(np.__version__)"
2.2.1
$ python index_np.py
0.017485273816553806
$ for i in `seq 0 5`; do python index_np.py; done
0.022518442094291526
0.017075111110763694
0.015882405314456532
0.01638638164796349
0.015575324814697658
0.01593349300135203

Here is the summary of script runtimes per device:

Device	opencl:cpu	opencl:gpu	level_zero:gpu	cuda:gpu
Time, ms	38	14	25	9.5

Details

$ ONEAPI_DEVICE_SELECTOR=opencl:cpu python index.py
0.038533492259577744

$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py
0.014323775669456355

$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu python index.py
0.02539329398078499

$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu python index.py
0.02539329398078499

As such, I consider this issue resolved on the dpctl side.

oleksandr-pavlyk added the performance Code performance label Aug 15, 2023

oleksandr-pavlyk mentioned this issue Dec 6, 2024

Boolean indexing improvements #1923

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing performance #1249

Indexing performance #1249

npolina4 commented Jun 14, 2023 •

edited

Loading

oleksandr-pavlyk commented Jul 27, 2023

npolina4 commented Jul 28, 2023

oleksandr-pavlyk commented Jan 3, 2025

Indexing performance #1249

Indexing performance #1249

Comments

npolina4 commented Jun 14, 2023 • edited Loading

oleksandr-pavlyk commented Jul 27, 2023

npolina4 commented Jul 28, 2023

oleksandr-pavlyk commented Jan 3, 2025

npolina4 commented Jun 14, 2023 •

edited

Loading