ENH: Vectorize np.partition and np.argpartition using AVX-512 #24201

r-devulap · 2023-07-17T17:48:37Z

Provides a speed up of 25x for 16-bit, 17x for 32-bit dtypes and about 8x speed up for 64-bit dtypes. Benchmark numbers on an array of size 100000 for varying values of k (10, 100 and 1000).

Benchmark for random data on Intel TGL:

       before           after         ratio
     [7d18effb]       [0c8fed1f]
     <main>           <np-partition>
-        1.15±0ms          193±1μs     0.17  bench_function_base.Partition.time_partition('float64', ('random',), 10)
-        1.15±0ms        192±0.9μs     0.17  bench_function_base.Partition.time_partition('float64', ('random',), 1000)
-     1.15±0.01ms          192±2μs     0.17  bench_function_base.Partition.time_partition('float64', ('random',), 100)
-        1.13±0ms          135±2μs     0.12  bench_function_base.Partition.time_partition('float32', ('random',), 100)
-        1.13±0ms          135±2μs     0.12  bench_function_base.Partition.time_partition('float32', ('random',), 10)
-         965±1μs          116±2μs     0.12  bench_function_base.Partition.time_partition('int64', ('random',), 100)
-     1.13±0.01ms          135±2μs     0.12  bench_function_base.Partition.time_partition('float32', ('random',), 1000)
-         970±2μs          115±1μs     0.12  bench_function_base.Partition.time_partition('int64', ('random',), 1000)
-        973±10μs        114±0.9μs     0.12  bench_function_base.Partition.time_partition('int64', ('random',), 10)
-         898±2μs       52.2±0.9μs     0.06  bench_function_base.Partition.time_partition('uint32', ('random',), 100)
-         902±6μs       52.2±0.9μs     0.06  bench_function_base.Partition.time_partition('int32', ('random',), 100)
-         904±3μs         52.3±2μs     0.06  bench_function_base.Partition.time_partition('int32', ('random',), 1000)
-         902±6μs         51.8±1μs     0.06  bench_function_base.Partition.time_partition('uint32', ('random',), 10)
-         900±4μs       51.6±0.9μs     0.06  bench_function_base.Partition.time_partition('int32', ('random',), 10)
-         906±4μs         51.9±2μs     0.06  bench_function_base.Partition.time_partition('uint32', ('random',), 1000)
-        1.02±0ms       43.5±0.2μs     0.04  bench_function_base.Partition.time_partition('float16', ('random',), 100)
-        1.02±0ms       43.3±0.2μs     0.04  bench_function_base.Partition.time_partition('float16', ('random',), 10)
-        1.04±0ms       43.4±0.2μs     0.04  bench_function_base.Partition.time_partition('float16', ('random',), 1000)
-        1.14±0ms      45.1±0.09μs     0.04  bench_function_base.Partition.time_partition('int16', ('random',), 1000)
-        1.14±0ms       45.2±0.1μs     0.04  bench_function_base.Partition.time_partition('int16', ('random',), 100)
-        1.14±0ms       45.1±0.3μs     0.04  bench_function_base.Partition.time_partition('int16', ('random',), 10)

Benchmark numbers on other kinds of array can be seen here.

r-devulap · 2023-07-17T22:01:28Z

Looks like all the CI failures are related to 2 tests:

numpy/lib/tests/test_shape_base.py::test_argequivalent.py
numpy/core/tests/test_multiarray.py::test_partition.py

where both the tests expect the output of np.partition(arr, k) to match arr[np.argpartition(arr, k)]. Output of np.partition and np.argpartition ae not unique and hence I assume this isn't a requirement, right?

r-devulap · 2023-07-24T19:44:21Z

I guess that problem solved itself when we vectorize both partition and argpartition :)

r-devulap · 2023-07-24T19:46:36Z

Benchmarks for np.argpartition on a TGL laptop: About a 6x speed up for 32-bit and 64-bit dtypes.

       before           after         ratio
     [7d18b532]       [88696f5a]
     <main>           <np-partition>
-     1.10±0.01ms          208±1μs     0.19  bench_function_base.Partition.time_argpartition('int32', ('random',), 100)
-     1.11±0.01ms          207±1μs     0.19  bench_function_base.Partition.time_argpartition('int32', ('random',), 1000)
-     1.11±0.01ms          207±1μs     0.19  bench_function_base.Partition.time_argpartition('int32', ('random',), 10)
-        1.16±0ms          209±2μs     0.18  bench_function_base.Partition.time_argpartition('int64', ('random',), 100)
-        1.16±0ms          207±1μs     0.18  bench_function_base.Partition.time_argpartition('int64', ('random',), 10)
-        1.17±0ms          206±1μs     0.18  bench_function_base.Partition.time_argpartition('int64', ('random',), 1000)
-        1.34±0ms        224±0.9μs     0.17  bench_function_base.Partition.time_argpartition('float64', ('random',), 100)
-        1.34±0ms          223±2μs     0.17  bench_function_base.Partition.time_argpartition('float64', ('random',), 10)
-     1.35±0.01ms        223±0.9μs     0.16  bench_function_base.Partition.time_argpartition('float64', ('random',), 1000)
-     1.34±0.01ms          208±1μs     0.16  bench_function_base.Partition.time_argpartition('float32', ('random',), 100)
-     1.34±0.01ms          208±1μs     0.16  bench_function_base.Partition.time_argpartition('float32', ('random',), 10)
-     1.35±0.01ms          207±1μs     0.15  bench_function_base.Partition.time_argpartition('float32', ('random',), 1000)

numpy/core/include/numpy/ndarraytypes.h

mattip · 2023-07-25T08:44:01Z

numpy/core/src/npysort/selection.cpp

+inline bool quickselect_dispatch(T* v, npy_intp num, npy_intp kth)
+{
+    return false;
+}


Windows will never get the faster version?

The quicksort patch ran into some problem with WIN32 builds and I never quite figured it out. Let me try running the partition patch through the CI, may be this one has better luck on windows.

Looks like this one had no trouble. Windows will get the faster version too :) Might be worth checking if quicksort can be enabled, but will do it in a separate PR.

It seems to work. Should we circle back to enabling AVX512 quicksort for windows?

Yup, separate PR?

Yes, sorry, separate PR would be good.

AVX512 quicksort, argsort, partition and argpartition are all enabled on windows (64-bit only though).

mattip · 2023-07-26T17:00:18Z

LGTM. I did not review the subreo. @seiko2plus any thoughts?

r-devulap · 2023-08-01T05:22:40Z

Rebasing with main.

numpy/core/src/npysort/selection.cpp

r-devulap · 2023-08-02T20:14:04Z

Something went wrong with the Travis CI, I think unrelated to this patch.

charris · 2023-08-02T20:38:43Z

I think unrelated to this patch.

Yes, unrelated. That failure happens now and then, the give away is "and 1 action required checks".

r-devulap · 2023-08-08T22:51:52Z

bah, that Travis CI failed again.

r-devulap · 2023-09-05T17:56:32Z

friendly ping :)

r-devulap · 2023-09-08T21:49:19Z

Phew! Ready for review now.

seiko2plus

Great performance improvement, Thank you! Just need to fix the float16 dispatching. Although our current tests should catch it, unfortunately, none of our CI workers support AVX512/ICL at least, also intel SDE still broken.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

kth = np.float16(0.1558), k = np.int64(29)
arr_part = array([ 0.001953,  0.007324,  0.01172 ,  0.012695,  0.0249  ,  0.0381  ,
        0.03857 ,  0.07666 ,  0.07764 ,  0.07... , -0.2622  , -0.2915  , -0.3003  ,
       -0.3047  , -0.3323  , -0.3682  , -0.4297  , -0.4526  ],
      dtype=float16)

    def assert_arr_partitioned(kth, k, arr_part):
>       assert_equal(arr_part[k], kth)
E       AssertionError: 
E       Items are not equal:
E        ACTUAL: np.float16(-0.2046)
E        DESIRED: np.float16(0.1558)

arr_part   = array([ 0.001953,  0.007324,  0.01172 ,  0.012695,  0.0249  ,  0.0381  ,
        0.03857 ,  0.07666 ,  0.07764 ,  0.07... , -0.2622  , -0.2915  , -0.3003  ,
       -0.3047  , -0.3323  , -0.3682  , -0.4297  , -0.4526  ],
      dtype=float16)
k          = np.int64(29)
kth        = np.float16(0.1558)

numpy/core/tests/test_multiarray.py:51: AssertionError
================ short test summary info ================
FAILED numpy/core/tests/test_multiarray.py::test_partition_fp[float16-N0]
FAILED numpy/core/tests/test_multiarray.py::test_partition_fp[float16-N1]
....

numpy/core/src/npysort/selection.cpp

numpy/core/src/npysort/simd_qsort.dispatch.cpp

…ytypes.h

…ct dispatch

r-devulap · 2023-09-19T21:32:37Z

@seiko2plus Fixed the bug and passes all the tests on a Tigerlake. We should have an SDE version with the bug fix out soon and we should run CI on TGL and SPR.

seiko2plus

LGTM, Thank you!

mattip · 2023-10-01T13:23:50Z

Thanks @r-devulap

github-actions bot added the 01 - Enhancement label Jul 17, 2023

r-devulap changed the title ~~ENH: Use avx512_qselect for 16, 32 and 64-bit dtype np.partition~~ ENH: Vectorize np.partition and np.argpartition using AVX-512 Jul 24, 2023

mattip reviewed Jul 25, 2023

View reviewed changes

numpy/core/include/numpy/ndarraytypes.h Outdated Show resolved Hide resolved

mattip reviewed Jul 25, 2023

View reviewed changes

r-devulap force-pushed the np-partition branch from f51cb77 to c976da6 Compare August 1, 2023 05:21

seiko2plus reviewed Aug 2, 2023

View reviewed changes

numpy/core/src/npysort/selection.cpp Outdated Show resolved Hide resolved

r-devulap force-pushed the np-partition branch from 0029799 to 1cd3d8b Compare September 5, 2023 17:56

r-devulap force-pushed the np-partition branch from 89f58fc to eefc2b6 Compare September 6, 2023 05:35

seiko2plus self-requested a review September 6, 2023 16:55

r-devulap force-pushed the np-partition branch 2 times, most recently from 29cd42b to 97e6714 Compare September 8, 2023 20:06

seiko2plus requested changes Sep 14, 2023

View reviewed changes

numpy/core/src/npysort/selection.cpp Outdated Show resolved Hide resolved

numpy/core/src/npysort/selection.cpp Show resolved Hide resolved

numpy/core/src/npysort/simd_qsort.dispatch.cpp Show resolved Hide resolved

r-devulap added 8 commits September 18, 2023 10:35

Update submodule to latest

dddc046

Add benchmarks for np.partition

489acbb

ENH: Use avx512_qselect for np.partition

4a9ac25

TST: Add tests for np.partition

12ae67d

BENCH: Add benchmarks for argpartition

234754d

Update x86-simd-sort submodule to latest

f3ac24b

get_partition_func does not need an extra argument

d8643a4

ENH: Use avx512_argselect for np.argpartition

0c86047

r-devulap added 15 commits September 18, 2023 10:35

TST: add tests for np.argpartition

1a3caed

Enable AVX-512 partition and argpartition on Windows

8e58998

Move PyArray_PartitionFunc and PyArray_ArgPartitionFunc out of ndarra…

1999828

…ytypes.h

MAINT: Prevent re-definition of quickselect dispatch on ARM 32-bit

c0bdd18

Avoid using template specializations for quickselect and argquicksele…

ae33785

…ct dispatch

Update submodule to latest

cd9df94

Update x86-simd-sort submodule to latest

25a2efb

Update x86-simd-sort submodule to latest

29a9267

ifdef remove avx512 instantiations for CYGWIN

f3f6111

Enable quicksort on WIN32

fc5b215

BUG: Fix compile error for avx512_qselect

9d986f7

BUG: Fix qselect function declaration argument

fcb6249

Disable avx512_qselect on 32-bit systems

6c1b8c7

Add may vary to output of np.partition and np.argpartition in docs

cb176c1

Use np::Half instead of np_tag::npy_half::type

6cd06c0

r-devulap force-pushed the np-partition branch from 1087f45 to 6cd06c0 Compare September 19, 2023 21:25

seiko2plus approved these changes Sep 21, 2023

View reviewed changes

mattip merged commit ac5c664 into numpy:main Oct 1, 2023

Mousius mentioned this pull request Oct 2, 2023

ENH: Use Highway's VQSort on AArch64 #24018

Merged

glemaitre mentioned this pull request Oct 11, 2023

TST make sure to not have ties in sparse callable NN test scikit-learn/scikit-learn#27567

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Vectorize np.partition and np.argpartition using AVX-512 #24201

ENH: Vectorize np.partition and np.argpartition using AVX-512 #24201

r-devulap commented Jul 17, 2023

r-devulap commented Jul 17, 2023

r-devulap commented Jul 24, 2023

r-devulap commented Jul 24, 2023 •

edited

Loading

mattip Jul 25, 2023

r-devulap Jul 25, 2023

r-devulap Jul 25, 2023

mattip Jul 26, 2023

r-devulap Jul 26, 2023

mattip Jul 26, 2023

r-devulap Sep 8, 2023 •

edited

Loading

mattip commented Jul 26, 2023

r-devulap commented Aug 1, 2023

r-devulap commented Aug 2, 2023

charris commented Aug 2, 2023

r-devulap commented Aug 8, 2023

r-devulap commented Sep 5, 2023

r-devulap commented Sep 8, 2023

seiko2plus left a comment

r-devulap commented Sep 19, 2023

seiko2plus left a comment

mattip commented Oct 1, 2023

ENH: Vectorize np.partition and np.argpartition using AVX-512 #24201

ENH: Vectorize np.partition and np.argpartition using AVX-512 #24201

Conversation

r-devulap commented Jul 17, 2023

r-devulap commented Jul 17, 2023

r-devulap commented Jul 24, 2023

r-devulap commented Jul 24, 2023 • edited Loading

mattip Jul 25, 2023

Choose a reason for hiding this comment

r-devulap Jul 25, 2023

Choose a reason for hiding this comment

r-devulap Jul 25, 2023

Choose a reason for hiding this comment

mattip Jul 26, 2023

Choose a reason for hiding this comment

r-devulap Jul 26, 2023

Choose a reason for hiding this comment

mattip Jul 26, 2023

Choose a reason for hiding this comment

r-devulap Sep 8, 2023 • edited Loading

Choose a reason for hiding this comment

mattip commented Jul 26, 2023

r-devulap commented Aug 1, 2023

r-devulap commented Aug 2, 2023

charris commented Aug 2, 2023

r-devulap commented Aug 8, 2023

r-devulap commented Sep 5, 2023

r-devulap commented Sep 8, 2023

seiko2plus left a comment

Choose a reason for hiding this comment

r-devulap commented Sep 19, 2023

seiko2plus left a comment

Choose a reason for hiding this comment

mattip commented Oct 1, 2023

r-devulap commented Jul 24, 2023 •

edited

Loading

r-devulap Sep 8, 2023 •

edited

Loading