Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dedicated code to copy array to C-contig/F-contig destinations #1850

Merged
merged 7 commits into from
Sep 30, 2024

Conversation

oleksandr-pavlyk
Copy link
Collaborator

This PR adds specialized kernels to copy usm_ndarray to C-/F-contiguous destinations of the same shape and the same dtype.

It also adds dedicated kernels to copy batches of square matrices (which are views of F-contig matrices) to C-contiguous destinations, and batches of square matrices which are views of C-contig matrices to F-contiguous destinations. The intended usage is to speed-up conversion from C-contig batch of square matrices to F-contig batch of square matrices.

Tests are added.

  • Have you provided a meaningful PR description?
  • Have you added a test, reproducer or referred to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • Have you checked performance impact of proposed changes?
  • Have you added documentation for your changes, if necessary?
  • Have you added your changes to the changelog?
  • If this PR is a work in progress, are you opening the PR as a draft?

Copy link

github-actions bot commented Sep 23, 2024

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞

Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_72 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_73 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

@coveralls
Copy link
Collaborator

coveralls commented Sep 23, 2024

Coverage Status

coverage: 87.907%. remained the same
when pulling d088227 on add-as-contig-specialization
into 4d3ddf9 on master.

@oleksandr-pavlyk
Copy link
Collaborator Author

Examples:

import dpctl.tensor as dpt
x = dpt.ones((3, 10, 10), order='F');
y = dpt.empty_like(x, order='C'); 
# now uses generic kernel to copy to contiguous destination
y[:] = x  

x2 = dpt.moveaxis(dpt.ones((10, 10, 3), order='F'), 2, 0)
# Because x2 has shape (3, 10, 10), and strides (100, 1, 10)
# x2 is a batch of F-contig square matrices, and the following code uses
# faster kernel for copying
y2 = dpt.asarray(x2, order='C')

Here is demonstration on laptop with Iris Xe integrated GPU:

Python 3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:38:13) [GCC 12.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.24.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import dpctl.tensor as dpt

In [2]: x = dpt.ones((3, 1000, 1000), order='F');

In [3]: y = dpt.empty_like(x, order='C');

In [4]: %timeit y[:] = x; y.sycl_queue.wait()
2.23 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit y[:] = x; y.sycl_queue.wait()
2.24 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: x2 = dpt.moveaxis(dpt.ones((1000, 1000, 3), order='F'), 2, 0)

In [7]: y2 = dpt.empty_like(x2, order='C')

In [8]: %timeit y2[:] = x2; y2.sycl_queue.wait()
1.32 ms ± 31.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [9]: %timeit y2[:] = x2; y2.sycl_queue.wait()
1.3 ms ± 58 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [10]: x3 = dpt.ones((3, 1000, 1000), order='F', dtype="i4")

In [11]: y3 = dpt.empty_like(x3, order='C', dtype="u4")

In [12]: %timeit y3[:] = x3; y3.sycl_queue.wait()
2.31 ms ± 21 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [13]: %timeit y3[:] = x3; y3.sycl_queue.wait()
2.33 ms ± 59.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

On GPU Max the difference between timing in In[12]/In[13] (about the same as legacy timing before this PR) and In[4]/In[5] is more pronounced (25%), as well as difference between In[12]/In[13] and In[8]/In[9].

This is done more efficiently than generic copy-and-cast kernel.
It is also done yet more efficiently for the batch of square matrices.

Copy from (batch of views into C-contig matrices) to
    F-contig array of the same shape.

    src.shape = (n, n, ....)
    src.strides = (ld_src, 1, ...)

Copy from (batch of views into F-contig matrices) to
    C-contig array of the same shape

    src.shape = (..., n, n)
    src.strides = (..., 1, ld_src)
@oleksandr-pavlyk oleksandr-pavlyk force-pushed the add-as-contig-specialization branch from 005eaf7 to f4705c0 Compare September 24, 2024 16:33
Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_74 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_75 ran successfully.
Passed: 895
Failed: 0
Skipped: 119

Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_76 ran successfully.
Passed: 895
Failed: 0
Skipped: 119

@oleksandr-pavlyk oleksandr-pavlyk force-pushed the add-as-contig-specialization branch from 779083c to 610d88d Compare September 27, 2024 02:17
Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_75 ran successfully.
Passed: 895
Failed: 0
Skipped: 119

Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_76 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

Copy link
Collaborator

@ndgrigorian ndgrigorian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested the branch out, I haven't run into any issues, including after running the copy tests in libtensor, no failures.

LGTM

Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_77 ran successfully.
Passed: 895
Failed: 0
Skipped: 119

@vtavana
Copy link
Collaborator

vtavana commented Sep 30, 2024

All tests for dpnp were passed using this branch

@oleksandr-pavlyk oleksandr-pavlyk merged commit 9085854 into master Sep 30, 2024
50 of 51 checks passed
@oleksandr-pavlyk oleksandr-pavlyk deleted the add-as-contig-specialization branch September 30, 2024 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants