Performance: in-place dpctl.tensor.add with strides #1278

npolina4 · 2023-07-12T16:57:56Z

import dpctl.tensor as dpt
a = dpt.ones((8192, 8192), dtype='i4', device='cpu')
b = dpt.ones((8192 + 2, 8192 + 2), dtype='i4', device='cpu')
%timeit b[2:, 2:]+=a
#209 ms ± 36.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

import numpy
a_np = numpy.ones((8192, 8192), dtype='i4')
b_np = numpy.ones((8192 + 2, 8192 + 2), dtype='i4')
%timeit b_np[2:, 2:]+=a_np
#75.7 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

oleksandr-pavlyk · 2023-07-20T15:52:54Z

This was addressed and should be closed.

Provides an alternative implementation of std::abs for complex types via std::hypot which is used on Windows.

ndgrigorian · 2024-12-23T09:03:45Z

Checked timings again, for Xeon CPU:

In [2]: import dpctl.tensor as dpt
   ...: a = dpt.ones((8192, 8192), dtype='i4', device='cpu')
   ...: b = dpt.ones((8192 + 2, 8192 + 2), dtype='i4', device='cpu')

In [3]: q = a.sycl_queue

In [4]: %timeit b[2:, 2:] += a; q.wait()
6.59 ms ± 748 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

for i7-1185G7:

In [2]: import dpctl.tensor as dpt
   ...: a = dpt.ones((8192, 8192), dtype='i4', device='cpu')
   ...: b = dpt.ones((8192 + 2, 8192 + 2), dtype='i4', device='cpu')

In [3]: q = a.sycl_queue

In [4]: %timeit b[2:, 2:] += a; q.wait()
72.2 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@oleksandr-pavlyk should this be closed?

oleksandr-pavlyk · 2024-12-23T17:32:47Z

I agree, the performance had improved.

I'd think systematic way to decide whether there are any improvements to be had is to collect the numpy timing on the same machine, and then with dpctl using taskset -c 0 ipython, taskset -c 0-1 ipython, taskset -c 0-3 ipython, taskset -c 0-7 ipython, and plain ipython to see whether performance of CPU device scales.

npolina4 changed the title ~~Performance: in-place dpctl.tensor.add with strides performance~~ Performance: in-place dpctl.tensor.add with strides Jul 12, 2023

oleksandr-pavlyk added a commit that referenced this issue Aug 15, 2023

Closes gh-1278

345fdaa

Provides an alternative implementation of std::abs for complex types via std::hypot which is used on Windows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: in-place dpctl.tensor.add with strides #1278

Performance: in-place dpctl.tensor.add with strides #1278

npolina4 commented Jul 12, 2023

oleksandr-pavlyk commented Jul 20, 2023

ndgrigorian commented Dec 23, 2024

oleksandr-pavlyk commented Dec 23, 2024

Performance: in-place dpctl.tensor.add with strides #1278

Performance: in-place dpctl.tensor.add with strides #1278

Comments

npolina4 commented Jul 12, 2023

oleksandr-pavlyk commented Jul 20, 2023

ndgrigorian commented Dec 23, 2024

oleksandr-pavlyk commented Dec 23, 2024