Dedicated code to copy array to C-contig/F-contig destinations #1850

oleksandr-pavlyk · 2024-09-23T19:19:11Z

This PR adds specialized kernels to copy usm_ndarray to C-/F-contiguous destinations of the same shape and the same dtype.

It also adds dedicated kernels to copy batches of square matrices (which are views of F-contig matrices) to C-contiguous destinations, and batches of square matrices which are views of C-contig matrices to F-contiguous destinations. The intended usage is to speed-up conversion from C-contig batch of square matrices to F-contig batch of square matrices.

Tests are added.

Have you provided a meaningful PR description?
Have you added a test, reproducer or referred to an issue with a reproducer?
Have you tested your changes locally for CPU and GPU devices?
Have you made sure that new changes do not introduce compiler warnings?
Have you checked performance impact of proposed changes?
Have you added documentation for your changes, if necessary?
Have you added your changes to the changelog?
If this PR is a work in progress, are you opening the PR as a draft?

github-actions · 2024-09-23T19:54:12Z

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞

github-actions · 2024-09-23T19:59:39Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_72 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

github-actions · 2024-09-23T20:00:35Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_73 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

coveralls · 2024-09-23T20:02:43Z

coverage: 87.907%. remained the same
when pulling d088227 on add-as-contig-specialization
into 4d3ddf9 on master.

oleksandr-pavlyk · 2024-09-24T14:06:32Z

Examples:

import dpctl.tensor as dpt
x = dpt.ones((3, 10, 10), order='F');
y = dpt.empty_like(x, order='C'); 
# now uses generic kernel to copy to contiguous destination
y[:] = x  

x2 = dpt.moveaxis(dpt.ones((10, 10, 3), order='F'), 2, 0)
# Because x2 has shape (3, 10, 10), and strides (100, 1, 10)
# x2 is a batch of F-contig square matrices, and the following code uses
# faster kernel for copying
y2 = dpt.asarray(x2, order='C')

Here is demonstration on laptop with Iris Xe integrated GPU:

Python 3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:38:13) [GCC 12.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.24.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import dpctl.tensor as dpt

In [2]: x = dpt.ones((3, 1000, 1000), order='F');

In [3]: y = dpt.empty_like(x, order='C');

In [4]: %timeit y[:] = x; y.sycl_queue.wait()
2.23 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit y[:] = x; y.sycl_queue.wait()
2.24 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: x2 = dpt.moveaxis(dpt.ones((1000, 1000, 3), order='F'), 2, 0)

In [7]: y2 = dpt.empty_like(x2, order='C')

In [8]: %timeit y2[:] = x2; y2.sycl_queue.wait()
1.32 ms ± 31.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [9]: %timeit y2[:] = x2; y2.sycl_queue.wait()
1.3 ms ± 58 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [10]: x3 = dpt.ones((3, 1000, 1000), order='F', dtype="i4")

In [11]: y3 = dpt.empty_like(x3, order='C', dtype="u4")

In [12]: %timeit y3[:] = x3; y3.sycl_queue.wait()
2.31 ms ± 21 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [13]: %timeit y3[:] = x3; y3.sycl_queue.wait()
2.33 ms ± 59.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

On GPU Max the difference between timing in In[12]/In[13] (about the same as legacy timing before this PR) and In[4]/In[5] is more pronounced (25%), as well as difference between In[12]/In[13] and In[8]/In[9].

This is done more efficiently than generic copy-and-cast kernel. It is also done yet more efficiently for the batch of square matrices. Copy from (batch of views into C-contig matrices) to F-contig array of the same shape. src.shape = (n, n, ....) src.strides = (ld_src, 1, ...) Copy from (batch of views into F-contig matrices) to C-contig array of the same shape src.shape = (..., n, n) src.strides = (..., 1, ld_src)

github-actions · 2024-09-24T17:16:17Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_74 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

dpctl/tensor/libtensor/source/copy_as_contig.cpp

github-actions · 2024-09-26T22:58:37Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_75 ran successfully.
Passed: 895
Failed: 0
Skipped: 119

github-actions · 2024-09-27T02:03:27Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_76 ran successfully.
Passed: 895
Failed: 0
Skipped: 119

github-actions · 2024-09-27T02:56:22Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_75 ran successfully.
Passed: 895
Failed: 0
Skipped: 119

github-actions · 2024-09-27T02:56:43Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_76 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

dpctl/tensor/libtensor/source/copy_as_contig.cpp

ndgrigorian

I've tested the branch out, I haven't run into any issues, including after running the copy tests in libtensor, no failures.

LGTM

github-actions · 2024-09-27T18:15:55Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_77 ran successfully.
Passed: 895
Failed: 0
Skipped: 119

vtavana · 2024-09-30T20:12:26Z

All tests for dpnp were passed using this branch

oleksandr-pavlyk requested a review from ndgrigorian as a code owner September 23, 2024 19:19

oleksandr-pavlyk added a commit that referenced this pull request Sep 23, 2024

Add gh-1850 to change-log

005eaf7

oleksandr-pavlyk requested review from antonwolfy and vlad-perevezentsev September 23, 2024 19:21

oleksandr-pavlyk added 4 commits September 24, 2024 11:32

Small tweaks to copy_numpy_ndarray_into_usm_ndarray validation code

2ba9829

Tests to exercise specialized code-paths for as_c_contig/as_f_contig

69e17be

Add gh-1850 to change-log

f4705c0

oleksandr-pavlyk force-pushed the add-as-contig-specialization branch from 005eaf7 to f4705c0 Compare September 24, 2024 16:33