Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing performance #1249

Open
npolina4 opened this issue Jun 14, 2023 · 3 comments
Open

Indexing performance #1249

npolina4 opened this issue Jun 14, 2023 · 3 comments
Labels
performance Code performance

Comments

@npolina4
Copy link
Collaborator

npolina4 commented Jun 14, 2023

import dpctl.tensor as dpt
a = dpt.ones((8192, 8192), device='cpu', dtype='f4')
b = dpt.ones((8192, 8192), device='cpu', dtype=bool)
%timeit a[b]
#211 ms ± 6.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

import numpy
a_np = numpy.ones((8192, 8192), dtype='f4')
b_np = numpy.ones((8192, 8192), dtype=bool)
%timeit a_np[b_np]
#87.1 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
@oleksandr-pavlyk
Copy link
Collaborator

This should be improved by changes in gh-1300. @npolina4 could you please post timeit results on the same machine you used to obtain reported numbers in the original comment?

@npolina4
Copy link
Collaborator Author

Result with changes in #1300
Size: 8192, 8192
numpy: 105 ms
cpu: 205 ms
gpu: 115 ms

Size: 4096, 4096
numpy: 24.5 ms
cpu: 45~80 ms
gpu: 21.4 ms

@oleksandr-pavlyk oleksandr-pavlyk added the performance Code performance label Aug 15, 2023
oleksandr-pavlyk added a commit that referenced this issue Dec 15, 2023
Changed hyperparameter choices to be different for CPU and GPU, resulting
in 20% performance gain on GPU.

The non-recursive implementation allows to avoid repeated USM allocations,
resulting in performance gains for large arrays.

Furthermore, corrected base step kernel to accumulate in outputT rather than
in size_t, which additionally realizes savings when int32 is used as
accumulator type.

Using example from gh-1249, previously, on my Iris Xe laptop:

```
In [1]: import dpctl.tensor as dpt
   ...: ag = dpt.ones((8192, 8192), device='gpu', dtype='f4')
   ...: bg = dpt.ones((8192, 8192), device='gpu', dtype=bool)

In [2]: cg = ag[bg]

In [3]: dpt.all(cg == dpt.reshape(ag, -1))
Out[3]: usm_ndarray(True)

In [4]: %timeit -n 10 -r 3 cg = ag[bg]
212 ms ± 56 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
```

while with this change:

```
In [4]: %timeit -n 10 -r 3 cg = ag[bg]
178 ms ± 24.2 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
```
oleksandr-pavlyk added a commit that referenced this issue Dec 19, 2023
Changed hyperparameter choices to be different for CPU and GPU, resulting
in 20% performance gain on GPU.

The non-recursive implementation allows to avoid repeated USM allocations,
resulting in performance gains for large arrays.

Furthermore, corrected base step kernel to accumulate in outputT rather than
in size_t, which additionally realizes savings when int32 is used as
accumulator type.

Using example from gh-1249, previously, on my Iris Xe laptop:

```
In [1]: import dpctl.tensor as dpt
   ...: ag = dpt.ones((8192, 8192), device='gpu', dtype='f4')
   ...: bg = dpt.ones((8192, 8192), device='gpu', dtype=bool)

In [2]: cg = ag[bg]

In [3]: dpt.all(cg == dpt.reshape(ag, -1))
Out[3]: usm_ndarray(True)

In [4]: %timeit -n 10 -r 3 cg = ag[bg]
212 ms ± 56 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
```

while with this change:

```
In [4]: %timeit -n 10 -r 3 cg = ag[bg]
178 ms ± 24.2 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
```
oleksandr-pavlyk added a commit that referenced this issue Jan 8, 2024
Changed hyperparameter choices to be different for CPU and GPU, resulting
in 20% performance gain on GPU.

The non-recursive implementation allows to avoid repeated USM allocations,
resulting in performance gains for large arrays.

Furthermore, corrected base step kernel to accumulate in outputT rather than
in size_t, which additionally realizes savings when int32 is used as
accumulator type.

Using example from gh-1249, previously, on my Iris Xe laptop:

```
In [1]: import dpctl.tensor as dpt
   ...: ag = dpt.ones((8192, 8192), device='gpu', dtype='f4')
   ...: bg = dpt.ones((8192, 8192), device='gpu', dtype=bool)

In [2]: cg = ag[bg]

In [3]: dpt.all(cg == dpt.reshape(ag, -1))
Out[3]: usm_ndarray(True)

In [4]: %timeit -n 10 -r 3 cg = ag[bg]
212 ms ± 56 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
```

while with this change:

```
In [4]: %timeit -n 10 -r 3 cg = ag[bg]
178 ms ± 24.2 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
```
oleksandr-pavlyk added a commit that referenced this issue Dec 9, 2024
The chunk update kernels processed consecutive elements in contiguous
memory, hence sub-group memory access pattern was sub-optimal (no
coalescing).

This PR changes these kernels to process n_wi elements which are
sub-group size apart, improving memory access patern.

Running a micro-benchmark based on code from gh-1249 (for
shape =(n, n,) where n = 4096) with this change:

```
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py
0.010703916665753004
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py
0.01079747307597211
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py
0.010864820314088353
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index.py
0.023878061203975922
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index.py
0.023666468500677083
```

while before:

```
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py
0.011415911812542213
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py
0.011722088705196424
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu python index.py
0.030126182353813893
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu python index.py
0.030459783371986338
```

Running the same code using NumPy (same size):

```
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index_np.py
0.01416253090698134
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index_np.py
0.014979530811413296
```

The reason Level-Zero device is slower has to do with slow allocation/deallocation bug.

OpenCL device has better timing. With this change:

```
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py
0.015038836885381627
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py
0.01527448468496678
```

before:

```
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py
0.01758851639115838
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py
0.017089676241286926
```
oleksandr-pavlyk added a commit that referenced this issue Dec 10, 2024
The chunk update kernels processed consecutive elements in contiguous
memory, hence sub-group memory access pattern was sub-optimal (no
coalescing).

This PR changes these kernels to process n_wi elements which are
sub-group size apart, improving memory access patern.

Running a micro-benchmark based on code from gh-1249 (for
shape =(n, n,) where n = 4096) with this change:

```
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py
0.010703916665753004
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py
0.01079747307597211
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py
0.010864820314088353
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index.py
0.023878061203975922
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index.py
0.023666468500677083
```

while before:

```
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py
0.011415911812542213
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py
0.011722088705196424
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu python index.py
0.030126182353813893
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu python index.py
0.030459783371986338
```

Running the same code using NumPy (same size):

```
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index_np.py
0.01416253090698134
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index_np.py
0.014979530811413296
```

The reason Level-Zero device is slower has to do with slow allocation/deallocation bug.

OpenCL device has better timing. With this change:

```
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py
0.015038836885381627
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py
0.01527448468496678
```

before:

```
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py
0.01758851639115838
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py
0.017089676241286926
```
@oleksandr-pavlyk
Copy link
Collaborator

Using Core Ultra 7 155U laptop with integrated Arc GPU, and discrete NVidia GPU, and running on WSL (Ubuntu):

SYCL platform listings
[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Graphics [0x7d45] 12.70.4 [1.6.31294.120000]
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 7 155U OpenCL 3.0 (Build 0) [2024.18.12.0.05_160000]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Graphics [0x7d45] OpenCL 3.0 NEO  [24.39.31294.12]
[cuda:gpu][cuda:0] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 3050 4GB Laptop GPU 8.6 [CUDA 12.6]
Scripts used for timing
# index.py
import dpctl.tensor as dpt
import time

n = 4096
shape = (n, n,)

a = dpt.ones(shape, dtype='f4')
m = dpt.ones(shape, dtype='b1')

reps = 54
t0 = time.perf_counter()
for _ in range(reps):
    r = a[m]
a.sycl_queue.wait()
t1 = time.perf_counter()

print((t1-t0)/reps)
# index_np.py
import numpy as np
import time

n = 4096
shape = (n, n,)

a = np.ones(shape, dtype='f4')
m = np.ones(shape, dtype='b1')

reps = 54
t0 = time.perf_counter()
for _ in range(reps):
    r = a[m]
t1 = time.perf_counter()

print((t1-t0)/reps)

I am getting about 16ms run-time for stock NumPy (version 2.2.1):

$ python -c "import numpy as np; print(np.__version__)"
2.2.1
$ python index_np.py
0.017485273816553806
$ for i in `seq 0 5`; do python index_np.py; done
0.022518442094291526
0.017075111110763694
0.015882405314456532
0.01638638164796349
0.015575324814697658
0.01593349300135203

Here is the summary of script runtimes per device:

Device opencl:cpu opencl:gpu level_zero:gpu cuda:gpu
Time, ms 38 14 25 9.5
Details
$ ONEAPI_DEVICE_SELECTOR=opencl:cpu python index.py
0.038533492259577744
$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py
0.014323775669456355
$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu python index.py
0.02539329398078499
$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu python index.py
0.02539329398078499

As such, I consider this issue resolved on the dpctl side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Code performance
Projects
None yet
Development

No branches or pull requests

2 participants