Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Vectorize np.partition and np.argpartition using AVX-512 #24201

Merged
merged 23 commits into from
Oct 1, 2023

Conversation

r-devulap
Copy link
Member

Provides a speed up of 25x for 16-bit, 17x for 32-bit dtypes and about 8x speed up for 64-bit dtypes. Benchmark numbers on an array of size 100000 for varying values of k (10, 100 and 1000).

Benchmark for random data on Intel TGL:

       before           after         ratio
     [7d18effb]       [0c8fed1f]
     <main>           <np-partition>
-        1.15±0ms          193±1μs     0.17  bench_function_base.Partition.time_partition('float64', ('random',), 10)
-        1.15±0ms        192±0.9μs     0.17  bench_function_base.Partition.time_partition('float64', ('random',), 1000)
-     1.15±0.01ms          192±2μs     0.17  bench_function_base.Partition.time_partition('float64', ('random',), 100)
-        1.13±0ms          135±2μs     0.12  bench_function_base.Partition.time_partition('float32', ('random',), 100)
-        1.13±0ms          135±2μs     0.12  bench_function_base.Partition.time_partition('float32', ('random',), 10)
-         965±1μs          116±2μs     0.12  bench_function_base.Partition.time_partition('int64', ('random',), 100)
-     1.13±0.01ms          135±2μs     0.12  bench_function_base.Partition.time_partition('float32', ('random',), 1000)
-         970±2μs          115±1μs     0.12  bench_function_base.Partition.time_partition('int64', ('random',), 1000)
-        973±10μs        114±0.9μs     0.12  bench_function_base.Partition.time_partition('int64', ('random',), 10)
-         898±2μs       52.2±0.9μs     0.06  bench_function_base.Partition.time_partition('uint32', ('random',), 100)
-         902±6μs       52.2±0.9μs     0.06  bench_function_base.Partition.time_partition('int32', ('random',), 100)
-         904±3μs         52.3±2μs     0.06  bench_function_base.Partition.time_partition('int32', ('random',), 1000)
-         902±6μs         51.8±1μs     0.06  bench_function_base.Partition.time_partition('uint32', ('random',), 10)
-         900±4μs       51.6±0.9μs     0.06  bench_function_base.Partition.time_partition('int32', ('random',), 10)
-         906±4μs         51.9±2μs     0.06  bench_function_base.Partition.time_partition('uint32', ('random',), 1000)
-        1.02±0ms       43.5±0.2μs     0.04  bench_function_base.Partition.time_partition('float16', ('random',), 100)
-        1.02±0ms       43.3±0.2μs     0.04  bench_function_base.Partition.time_partition('float16', ('random',), 10)
-        1.04±0ms       43.4±0.2μs     0.04  bench_function_base.Partition.time_partition('float16', ('random',), 1000)
-        1.14±0ms      45.1±0.09μs     0.04  bench_function_base.Partition.time_partition('int16', ('random',), 1000)
-        1.14±0ms       45.2±0.1μs     0.04  bench_function_base.Partition.time_partition('int16', ('random',), 100)
-        1.14±0ms       45.1±0.3μs     0.04  bench_function_base.Partition.time_partition('int16', ('random',), 10)

Benchmark numbers on other kinds of array can be seen here.

@r-devulap
Copy link
Member Author

Looks like all the CI failures are related to 2 tests:

  1. numpy/lib/tests/test_shape_base.py::test_argequivalent.py
  2. numpy/core/tests/test_multiarray.py::test_partition.py

where both the tests expect the output of np.partition(arr, k) to match arr[np.argpartition(arr, k)]. Output of np.partition and np.argpartition ae not unique and hence I assume this isn't a requirement, right?

@r-devulap r-devulap changed the title ENH: Use avx512_qselect for 16, 32 and 64-bit dtype np.partition ENH: Vectorize np.partition and np.argpartition using AVX-512 Jul 24, 2023
@r-devulap
Copy link
Member Author

I guess that problem solved itself when we vectorize both partition and argpartition :)

@r-devulap
Copy link
Member Author

r-devulap commented Jul 24, 2023

Benchmarks for np.argpartition on a TGL laptop: About a 6x speed up for 32-bit and 64-bit dtypes.

       before           after         ratio
     [7d18b532]       [88696f5a]
     <main>           <np-partition>
-     1.10±0.01ms          208±1μs     0.19  bench_function_base.Partition.time_argpartition('int32', ('random',), 100)
-     1.11±0.01ms          207±1μs     0.19  bench_function_base.Partition.time_argpartition('int32', ('random',), 1000)
-     1.11±0.01ms          207±1μs     0.19  bench_function_base.Partition.time_argpartition('int32', ('random',), 10)
-        1.16±0ms          209±2μs     0.18  bench_function_base.Partition.time_argpartition('int64', ('random',), 100)
-        1.16±0ms          207±1μs     0.18  bench_function_base.Partition.time_argpartition('int64', ('random',), 10)
-        1.17±0ms          206±1μs     0.18  bench_function_base.Partition.time_argpartition('int64', ('random',), 1000)
-        1.34±0ms        224±0.9μs     0.17  bench_function_base.Partition.time_argpartition('float64', ('random',), 100)
-        1.34±0ms          223±2μs     0.17  bench_function_base.Partition.time_argpartition('float64', ('random',), 10)
-     1.35±0.01ms        223±0.9μs     0.16  bench_function_base.Partition.time_argpartition('float64', ('random',), 1000)
-     1.34±0.01ms          208±1μs     0.16  bench_function_base.Partition.time_argpartition('float32', ('random',), 100)
-     1.34±0.01ms          208±1μs     0.16  bench_function_base.Partition.time_argpartition('float32', ('random',), 10)
-     1.35±0.01ms          207±1μs     0.15  bench_function_base.Partition.time_argpartition('float32', ('random',), 1000)

inline bool quickselect_dispatch(T* v, npy_intp num, npy_intp kth)
{
return false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Windows will never get the faster version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quicksort patch ran into some problem with WIN32 builds and I never quite figured it out. Let me try running the partition patch through the CI, may be this one has better luck on windows.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this one had no trouble. Windows will get the faster version too :) Might be worth checking if quicksort can be enabled, but will do it in a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to work. Should we circle back to enabling AVX512 quicksort for windows?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, separate PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry, separate PR would be good.

Copy link
Member Author

@r-devulap r-devulap Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AVX512 quicksort, argsort, partition and argpartition are all enabled on windows (64-bit only though).

@mattip
Copy link
Member

mattip commented Jul 26, 2023

LGTM. I did not review the subreo. @seiko2plus any thoughts?

@r-devulap
Copy link
Member Author

Rebasing with main.

@r-devulap
Copy link
Member Author

Something went wrong with the Travis CI, I think unrelated to this patch.

@charris
Copy link
Member

charris commented Aug 2, 2023

I think unrelated to this patch.

Yes, unrelated. That failure happens now and then, the give away is "and 1 action required checks".

@r-devulap
Copy link
Member Author

bah, that Travis CI failed again.

@r-devulap
Copy link
Member Author

friendly ping :)

@seiko2plus seiko2plus self-requested a review September 6, 2023 16:55
@r-devulap r-devulap force-pushed the np-partition branch 2 times, most recently from 29cd42b to 97e6714 Compare September 8, 2023 20:06
@r-devulap
Copy link
Member Author

Phew! Ready for review now.

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great performance improvement, Thank you! Just need to fix the float16 dispatching. Although our current tests should catch it, unfortunately, none of our CI workers support AVX512/ICL at least, also intel SDE still broken.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

kth = np.float16(0.1558), k = np.int64(29)
arr_part = array([ 0.001953,  0.007324,  0.01172 ,  0.012695,  0.0249  ,  0.0381  ,
        0.03857 ,  0.07666 ,  0.07764 ,  0.07... , -0.2622  , -0.2915  , -0.3003  ,
       -0.3047  , -0.3323  , -0.3682  , -0.4297  , -0.4526  ],
      dtype=float16)

    def assert_arr_partitioned(kth, k, arr_part):
>       assert_equal(arr_part[k], kth)
E       AssertionError: 
E       Items are not equal:
E        ACTUAL: np.float16(-0.2046)
E        DESIRED: np.float16(0.1558)

arr_part   = array([ 0.001953,  0.007324,  0.01172 ,  0.012695,  0.0249  ,  0.0381  ,
        0.03857 ,  0.07666 ,  0.07764 ,  0.07... , -0.2622  , -0.2915  , -0.3003  ,
       -0.3047  , -0.3323  , -0.3682  , -0.4297  , -0.4526  ],
      dtype=float16)
k          = np.int64(29)
kth        = np.float16(0.1558)

numpy/core/tests/test_multiarray.py:51: AssertionError
================ short test summary info ================
FAILED numpy/core/tests/test_multiarray.py::test_partition_fp[float16-N0]
FAILED numpy/core/tests/test_multiarray.py::test_partition_fp[float16-N1]
....

numpy/core/src/npysort/selection.cpp Outdated Show resolved Hide resolved
numpy/core/src/npysort/selection.cpp Show resolved Hide resolved
@r-devulap
Copy link
Member Author

@seiko2plus Fixed the bug and passes all the tests on a Tigerlake. We should have an SDE version with the bug fix out soon and we should run CI on TGL and SPR.

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thank you!

@mattip mattip merged commit ac5c664 into numpy:main Oct 1, 2023
@mattip
Copy link
Member

mattip commented Oct 1, 2023

Thanks @r-devulap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants