-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance: in-place dpctl.tensor.add with strides #1278
Comments
This was addressed and should be closed. |
Checked timings again, for Xeon CPU: In [2]: import dpctl.tensor as dpt
...: a = dpt.ones((8192, 8192), dtype='i4', device='cpu')
...: b = dpt.ones((8192 + 2, 8192 + 2), dtype='i4', device='cpu')
In [3]: q = a.sycl_queue
In [4]: %timeit b[2:, 2:] += a; q.wait()
6.59 ms ± 748 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) for i7-1185G7: In [2]: import dpctl.tensor as dpt
...: a = dpt.ones((8192, 8192), dtype='i4', device='cpu')
...: b = dpt.ones((8192 + 2, 8192 + 2), dtype='i4', device='cpu')
In [3]: q = a.sycl_queue
In [4]: %timeit b[2:, 2:] += a; q.wait()
72.2 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) @oleksandr-pavlyk should this be closed? |
I agree, the performance had improved. I'd think systematic way to decide whether there are any improvements to be had is to collect the |
The text was updated successfully, but these errors were encountered: