-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexing performance #1249
Comments
Result with changes in #1300 Size: 4096, 4096 |
Changed hyperparameter choices to be different for CPU and GPU, resulting in 20% performance gain on GPU. The non-recursive implementation allows to avoid repeated USM allocations, resulting in performance gains for large arrays. Furthermore, corrected base step kernel to accumulate in outputT rather than in size_t, which additionally realizes savings when int32 is used as accumulator type. Using example from gh-1249, previously, on my Iris Xe laptop: ``` In [1]: import dpctl.tensor as dpt ...: ag = dpt.ones((8192, 8192), device='gpu', dtype='f4') ...: bg = dpt.ones((8192, 8192), device='gpu', dtype=bool) In [2]: cg = ag[bg] In [3]: dpt.all(cg == dpt.reshape(ag, -1)) Out[3]: usm_ndarray(True) In [4]: %timeit -n 10 -r 3 cg = ag[bg] 212 ms ± 56 ms per loop (mean ± std. dev. of 3 runs, 10 loops each) ``` while with this change: ``` In [4]: %timeit -n 10 -r 3 cg = ag[bg] 178 ms ± 24.2 ms per loop (mean ± std. dev. of 3 runs, 10 loops each) ```
Changed hyperparameter choices to be different for CPU and GPU, resulting in 20% performance gain on GPU. The non-recursive implementation allows to avoid repeated USM allocations, resulting in performance gains for large arrays. Furthermore, corrected base step kernel to accumulate in outputT rather than in size_t, which additionally realizes savings when int32 is used as accumulator type. Using example from gh-1249, previously, on my Iris Xe laptop: ``` In [1]: import dpctl.tensor as dpt ...: ag = dpt.ones((8192, 8192), device='gpu', dtype='f4') ...: bg = dpt.ones((8192, 8192), device='gpu', dtype=bool) In [2]: cg = ag[bg] In [3]: dpt.all(cg == dpt.reshape(ag, -1)) Out[3]: usm_ndarray(True) In [4]: %timeit -n 10 -r 3 cg = ag[bg] 212 ms ± 56 ms per loop (mean ± std. dev. of 3 runs, 10 loops each) ``` while with this change: ``` In [4]: %timeit -n 10 -r 3 cg = ag[bg] 178 ms ± 24.2 ms per loop (mean ± std. dev. of 3 runs, 10 loops each) ```
Changed hyperparameter choices to be different for CPU and GPU, resulting in 20% performance gain on GPU. The non-recursive implementation allows to avoid repeated USM allocations, resulting in performance gains for large arrays. Furthermore, corrected base step kernel to accumulate in outputT rather than in size_t, which additionally realizes savings when int32 is used as accumulator type. Using example from gh-1249, previously, on my Iris Xe laptop: ``` In [1]: import dpctl.tensor as dpt ...: ag = dpt.ones((8192, 8192), device='gpu', dtype='f4') ...: bg = dpt.ones((8192, 8192), device='gpu', dtype=bool) In [2]: cg = ag[bg] In [3]: dpt.all(cg == dpt.reshape(ag, -1)) Out[3]: usm_ndarray(True) In [4]: %timeit -n 10 -r 3 cg = ag[bg] 212 ms ± 56 ms per loop (mean ± std. dev. of 3 runs, 10 loops each) ``` while with this change: ``` In [4]: %timeit -n 10 -r 3 cg = ag[bg] 178 ms ± 24.2 ms per loop (mean ± std. dev. of 3 runs, 10 loops each) ```
The chunk update kernels processed consecutive elements in contiguous memory, hence sub-group memory access pattern was sub-optimal (no coalescing). This PR changes these kernels to process n_wi elements which are sub-group size apart, improving memory access patern. Running a micro-benchmark based on code from gh-1249 (for shape =(n, n,) where n = 4096) with this change: ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.010703916665753004 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.01079747307597211 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.010864820314088353 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index.py 0.023878061203975922 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index.py 0.023666468500677083 ``` while before: ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.011415911812542213 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.011722088705196424 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu python index.py 0.030126182353813893 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu python index.py 0.030459783371986338 ``` Running the same code using NumPy (same size): ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index_np.py 0.01416253090698134 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index_np.py 0.014979530811413296 ``` The reason Level-Zero device is slower has to do with slow allocation/deallocation bug. OpenCL device has better timing. With this change: ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py 0.015038836885381627 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py 0.01527448468496678 ``` before: ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py 0.01758851639115838 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py 0.017089676241286926 ```
The chunk update kernels processed consecutive elements in contiguous memory, hence sub-group memory access pattern was sub-optimal (no coalescing). This PR changes these kernels to process n_wi elements which are sub-group size apart, improving memory access patern. Running a micro-benchmark based on code from gh-1249 (for shape =(n, n,) where n = 4096) with this change: ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.010703916665753004 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.01079747307597211 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.010864820314088353 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index.py 0.023878061203975922 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index.py 0.023666468500677083 ``` while before: ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.011415911812542213 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.011722088705196424 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu python index.py 0.030126182353813893 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu python index.py 0.030459783371986338 ``` Running the same code using NumPy (same size): ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index_np.py 0.01416253090698134 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index_np.py 0.014979530811413296 ``` The reason Level-Zero device is slower has to do with slow allocation/deallocation bug. OpenCL device has better timing. With this change: ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py 0.015038836885381627 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py 0.01527448468496678 ``` before: ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py 0.01758851639115838 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py 0.017089676241286926 ```
Using Core Ultra 7 155U laptop with integrated Arc GPU, and discrete NVidia GPU, and running on WSL (Ubuntu): SYCL platform listings
Scripts used for timing# index.py
import dpctl.tensor as dpt
import time
n = 4096
shape = (n, n,)
a = dpt.ones(shape, dtype='f4')
m = dpt.ones(shape, dtype='b1')
reps = 54
t0 = time.perf_counter()
for _ in range(reps):
r = a[m]
a.sycl_queue.wait()
t1 = time.perf_counter()
print((t1-t0)/reps) # index_np.py
import numpy as np
import time
n = 4096
shape = (n, n,)
a = np.ones(shape, dtype='f4')
m = np.ones(shape, dtype='b1')
reps = 54
t0 = time.perf_counter()
for _ in range(reps):
r = a[m]
t1 = time.perf_counter()
print((t1-t0)/reps) I am getting about 16ms run-time for stock NumPy (version 2.2.1):
Here is the summary of script runtimes per device:
Details
As such, I consider this issue resolved on the |
The text was updated successfully, but these errors were encountered: