Updating Vector<T> to support opt-in 512-bit widths #97460

tannergooding · 2024-01-24T17:54:53Z

No description provided.

ghost · 2024-01-24T17:55:06Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author:	tannergooding
Assignees:	tannergooding
Labels:	`area-CodeGen-coreclr`
Milestone:	-

stephentoub · 2024-01-24T18:03:28Z

Do we have any places in our own use of Vector<T> where we take a dependency on it not being larger than 256? I know we have one here:

runtime/src/libraries/System.Linq/src/System/Linq/Range.SpeedOpt.cs

Lines 45 to 48 in 95bf3d9

    
               Vector<int>.Count <= 8 && 
        
               destination.Length >= Vector<int>.Count) 
        
           { 
        
               Vector<int> init = new Vector<int>((ReadOnlySpan<int>)[0, 1, 2, 3, 4, 5, 6, 7]);

but we explicitly guard it on Count because we wrote it knowing this might be coming (though we could also update that now to construct the Vector with a larger sequence so it lights up on 512.) I'm wondering if we might have any others that would need to be tweaked.

tannergooding · 2024-01-24T18:08:50Z

but we explicitly guard it on Count because we wrote it knowing this might be coming (though we could also update that now to construct the Vector with a larger sequence so it lights up on 512.) I'm wondering if we might have any others that would need to be tweaked.

I know we have a few places throughout the BCL which guard against it being larger than 256-bits and will use an alternative path if it is. Those will definitely need to be found and updated where relevant, but shouldn't be blocked on this PR (especially since its opt-in).

Some of the patterns like new Vector<int>((ReadOnlySpan<int>)[0, 1, 2, 3, 4, 5, 6, 7]); are ones I'd like to "solve" by exposing new Create APIs. Something like CreateSequence(int start, int step) (but with better naming). That should help avoid creating unnecessarily large RVA statics and make the code more portable.

kunalspathak

LGTM. Thanks. The arm changes will come handy for SVE work, right?

kunalspathak · 2024-01-24T19:02:17Z

src/coreclr/jit/simdashwintrinsiclistarm64.h

@@ -202,99 +202,99 @@ SIMD_AS_HWINTRINSIC_ID(Vector4,     WithElement,
 //                                                                                                     {TYP_BYTE,                                      TYP_UBYTE,                                      TYP_SHORT,                                      TYP_USHORT,                                     TYP_INT,                                        TYP_UINT,                                       TYP_LONG,                                       TYP_ULONG,                                      TYP_FLOAT,                                      TYP_DOUBLE}
 // *************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
 //  Vector<T> Intrinsics
-SIMD_AS_HWINTRINSIC_ID(VectorT128,  Abs,                                                    1,         {NI_VectorT128_Abs,                             NI_VectorT128_Abs,                              NI_VectorT128_Abs,                              NI_VectorT128_Abs,                              NI_VectorT128_Abs,                              NI_VectorT128_Abs,                              NI_VectorT128_Abs,                              NI_VectorT128_Abs,                              NI_VectorT128_Abs,                              NI_VectorT128_Abs},                             SimdAsHWIntrinsicFlag::None)


the changes in this and simdashwintrinsiclistxarch.h are mainly renaming VectorT128 to VectorT?

Right, that's the only change here which should simplify adding future sizes to Vector<T> as we no longer have to duplicate table entries per size.

kunalspathak · 2024-01-24T19:08:53Z

src/coreclr/jit/simdashwintrinsic.cpp

@@ -682,30 +641,11 @@ GenTree* Compiler::impSimdAsHWIntrinsicSpecial(NamedIntrinsic       intrinsic,
            break;
        }

-        case NI_VectorT128_Sum:
-        {
-            if (varTypeIsFloating(simdBaseType))


I see that the case for VectorT_Sum is moved below. wondering we don't need InstructionSet_* checks anymore?

The implementation was updated to not use the horizontal operations anymore, as it has better performance on modern hardware and was needed for Vector512_Sum anyways.

ryujit-bot · 2024-01-24T19:36:47Z

Diff results for #97460

Throughput diffs

Throughput diffs for linux/x64 ran on linux/x64

Overall (-0.05% to -0.00%)

Collection	PDIFF
realworld.run.linux.x64.checked.mch	-0.05%

FullOpts (-0.05% to -0.00%)

Collection	PDIFF
realworld.run.linux.x64.checked.mch	-0.05%

Details here

Throughput diffs for linux/x64 ran on windows/x64

Overall (-0.05% to +0.02%)

Collection	PDIFF
benchmarks.run.linux.x64.checked.mch	+0.02%
benchmarks.run_pgo.linux.x64.checked.mch	+0.02%
benchmarks.run_tiered.linux.x64.checked.mch	+0.01%
libraries.crossgen2.linux.x64.checked.mch	+0.02%
libraries.pmi.linux.x64.checked.mch	+0.02%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	+0.02%
realworld.run.linux.x64.checked.mch	-0.05%
smoke_tests.nativeaot.linux.x64.checked.mch	+0.02%

MinOpts (+0.01% to +0.03%)

Collection	PDIFF
benchmarks.run.linux.x64.checked.mch	+0.01%
benchmarks.run_pgo.linux.x64.checked.mch	+0.01%
benchmarks.run_tiered.linux.x64.checked.mch	+0.01%
libraries.crossgen2.linux.x64.checked.mch	+0.03%
libraries.pmi.linux.x64.checked.mch	+0.01%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	+0.01%
realworld.run.linux.x64.checked.mch	+0.02%
smoke_tests.nativeaot.linux.x64.checked.mch	+0.02%

FullOpts (-0.05% to +0.02%)

Collection	PDIFF
benchmarks.run.linux.x64.checked.mch	+0.02%
benchmarks.run_pgo.linux.x64.checked.mch	+0.02%
benchmarks.run_tiered.linux.x64.checked.mch	+0.02%
libraries.crossgen2.linux.x64.checked.mch	+0.02%
libraries.pmi.linux.x64.checked.mch	+0.02%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	+0.02%
realworld.run.linux.x64.checked.mch	-0.05%
smoke_tests.nativeaot.linux.x64.checked.mch	+0.02%

Details here

ryujit-bot · 2024-01-24T21:10:18Z

Diff results for #97460

Assembly diffs

Assembly diffs for linux/x64 ran on windows/x64

Diffs are based on 1,915,317 contexts (623,081 MinOpts, 1,292,236 FullOpts).

MISSED contexts: base: 0 (0.00%), diff: 174 (0.01%)

Overall (-166 bytes)

Collection	Base size (bytes)	Diff size (bytes)
realworld.run.linux.x64.checked.mch	13,145,429	-166

FullOpts (-166 bytes)

Collection	Base size (bytes)	Diff size (bytes)
realworld.run.linux.x64.checked.mch	12,758,519	-166

Details here

Throughput diffs

Throughput diffs for linux/x64 ran on linux/x64

Overall (-0.05% to -0.00%)

Collection	PDIFF
realworld.run.linux.x64.checked.mch	-0.05%

FullOpts (-0.05% to +0.00%)

Collection	PDIFF
realworld.run.linux.x64.checked.mch	-0.05%

Details here

ryujit-bot · 2024-01-25T00:16:13Z

Diff results for #97460

Assembly diffs

Assembly diffs for linux/x64 ran on windows/x64

Diffs are based on 2,512,082 contexts (977,766 MinOpts, 1,534,316 FullOpts).

MISSED contexts: base: 0 (0.00%), diff: 180 (0.01%)

Overall (-166 bytes)

Collection	Base size (bytes)	Diff size (bytes)
realworld.run.linux.x64.checked.mch	13,145,429	-166

FullOpts (-166 bytes)

Collection	Base size (bytes)	Diff size (bytes)
realworld.run.linux.x64.checked.mch	12,758,519	-166

Assembly diffs for windows/x64 ran on windows/x64

Diffs are based on 2,373,018 contexts (928,740 MinOpts, 1,444,278 FullOpts).

MISSED contexts: base: 0 (0.00%), diff: 183 (0.01%)

Overall (+407 bytes)

Collection	Base size (bytes)	Diff size (bytes)
realworld.run.windows.x64.checked.mch	14,193,402	+407

FullOpts (+407 bytes)

Collection	Base size (bytes)	Diff size (bytes)
realworld.run.windows.x64.checked.mch	13,803,697	+407

Details here

Assembly diffs for windows/x86 ran on windows/x86

Diffs are based on 2,298,941 contexts (840,452 MinOpts, 1,458,489 FullOpts).

MISSED contexts: base: 7 (0.00%), diff: 187 (0.01%)

Overall (-50 bytes)

Collection	Base size (bytes)	Diff size (bytes)
realworld.run.windows.x86.checked.mch	11,368,246	-50

FullOpts (-50 bytes)

Collection	Base size (bytes)	Diff size (bytes)
realworld.run.windows.x86.checked.mch	11,072,546	-50

Details here

Throughput diffs

Throughput diffs for linux/x64 ran on windows/x64

Overall (-0.05% to +0.02%)

Collection	PDIFF
benchmarks.run.linux.x64.checked.mch	+0.02%
benchmarks.run_pgo.linux.x64.checked.mch	+0.02%
benchmarks.run_tiered.linux.x64.checked.mch	+0.01%
coreclr_tests.run.linux.x64.checked.mch	+0.02%
libraries.crossgen2.linux.x64.checked.mch	+0.02%
libraries.pmi.linux.x64.checked.mch	+0.02%
libraries_tests.run.linux.x64.Release.mch	+0.02%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	+0.02%
realworld.run.linux.x64.checked.mch	-0.05%
smoke_tests.nativeaot.linux.x64.checked.mch	+0.02%

MinOpts (+0.01% to +0.03%)

Collection	PDIFF
benchmarks.run.linux.x64.checked.mch	+0.01%
benchmarks.run_pgo.linux.x64.checked.mch	+0.01%
benchmarks.run_tiered.linux.x64.checked.mch	+0.01%
coreclr_tests.run.linux.x64.checked.mch	+0.01%
libraries.crossgen2.linux.x64.checked.mch	+0.03%
libraries.pmi.linux.x64.checked.mch	+0.01%
libraries_tests.run.linux.x64.Release.mch	+0.01%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	+0.01%
realworld.run.linux.x64.checked.mch	+0.02%
smoke_tests.nativeaot.linux.x64.checked.mch	+0.02%

FullOpts (-0.05% to +0.02%)

Collection	PDIFF
benchmarks.run.linux.x64.checked.mch	+0.02%
benchmarks.run_pgo.linux.x64.checked.mch	+0.02%
benchmarks.run_tiered.linux.x64.checked.mch	+0.02%
coreclr_tests.run.linux.x64.checked.mch	+0.02%
libraries.crossgen2.linux.x64.checked.mch	+0.02%
libraries.pmi.linux.x64.checked.mch	+0.02%
libraries_tests.run.linux.x64.Release.mch	+0.02%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	+0.02%
realworld.run.linux.x64.checked.mch	-0.05%
smoke_tests.nativeaot.linux.x64.checked.mch	+0.02%

Throughput diffs for windows/x64 ran on windows/x64

Overall (-0.05% to +0.02%)

Collection	PDIFF
benchmarks.run.windows.x64.checked.mch	+0.02%
benchmarks.run_pgo.windows.x64.checked.mch	+0.02%
benchmarks.run_tiered.windows.x64.checked.mch	+0.02%
coreclr_tests.run.windows.x64.checked.mch	+0.02%
libraries.crossgen2.windows.x64.checked.mch	+0.02%
libraries.pmi.windows.x64.checked.mch	+0.02%
libraries_tests.run.windows.x64.Release.mch	+0.02%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch	+0.02%
realworld.run.windows.x64.checked.mch	-0.05%
smoke_tests.nativeaot.windows.x64.checked.mch	+0.02%

MinOpts (+0.01% to +0.03%)

Collection	PDIFF
benchmarks.run.windows.x64.checked.mch	+0.01%
benchmarks.run_pgo.windows.x64.checked.mch	+0.01%
benchmarks.run_tiered.windows.x64.checked.mch	+0.01%
coreclr_tests.run.windows.x64.checked.mch	+0.01%
libraries.crossgen2.windows.x64.checked.mch	+0.03%
libraries.pmi.windows.x64.checked.mch	+0.01%
libraries_tests.run.windows.x64.Release.mch	+0.01%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch	+0.01%
realworld.run.windows.x64.checked.mch	+0.01%
smoke_tests.nativeaot.windows.x64.checked.mch	+0.02%

FullOpts (-0.05% to +0.02%)

Collection	PDIFF
benchmarks.run.windows.x64.checked.mch	+0.02%
benchmarks.run_pgo.windows.x64.checked.mch	+0.02%
benchmarks.run_tiered.windows.x64.checked.mch	+0.02%
coreclr_tests.run.windows.x64.checked.mch	+0.02%
libraries.crossgen2.windows.x64.checked.mch	+0.02%
libraries.pmi.windows.x64.checked.mch	+0.02%
libraries_tests.run.windows.x64.Release.mch	+0.02%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch	+0.02%
realworld.run.windows.x64.checked.mch	-0.05%
smoke_tests.nativeaot.windows.x64.checked.mch	+0.02%

Details here

Throughput diffs for windows/x86 ran on windows/x86

Overall (-0.05% to +0.00%)

Collection	PDIFF
realworld.run.windows.x86.checked.mch	-0.05%

FullOpts (-0.05% to +0.00%)

Collection	PDIFF
realworld.run.windows.x86.checked.mch	-0.05%

Details here

Updating Vector<T> to support opt-in 512-bit widths

a5737fa

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 24, 2024

ghost assigned tannergooding Jan 24, 2024

Apply formatting patch

5ad47f0

stephentoub mentioned this pull request Jan 24, 2024

Update Enumerable.Range to support Vector<int>.Count <= 16 #97464

Merged

kunalspathak approved these changes Jan 24, 2024

View reviewed changes

tannergooding merged commit be6c9f6 into dotnet:main Jan 24, 2024
139 checks passed

tannergooding deleted the vectort-512 branch January 24, 2024 23:28

EgorBo mentioned this pull request Jan 30, 2024

[Perf] Windows/x64: 13 Regressions on 1/25/2024 1:30:33 AM #97714

Closed

github-actions bot locked and limited conversation to collaborators Feb 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating Vector<T> to support opt-in 512-bit widths #97460

Updating Vector<T> to support opt-in 512-bit widths #97460

tannergooding commented Jan 24, 2024

ghost commented Jan 24, 2024

stephentoub commented Jan 24, 2024

tannergooding commented Jan 24, 2024

kunalspathak left a comment

kunalspathak Jan 24, 2024

tannergooding Jan 24, 2024

kunalspathak Jan 24, 2024

tannergooding Jan 24, 2024

ryujit-bot commented Jan 24, 2024

Throughput diffs

Throughput diffs for linux/x64 ran on linux/x64

Throughput diffs for linux/x64 ran on windows/x64

ryujit-bot commented Jan 24, 2024

Assembly diffs

Assembly diffs for linux/x64 ran on windows/x64

Throughput diffs

Throughput diffs for linux/x64 ran on linux/x64

ryujit-bot commented Jan 25, 2024

Assembly diffs

Assembly diffs for linux/x64 ran on windows/x64

Assembly diffs for windows/x64 ran on windows/x64

Assembly diffs for windows/x86 ran on windows/x86

Throughput diffs

Throughput diffs for linux/x64 ran on windows/x64

Throughput diffs for windows/x64 ran on windows/x64

Throughput diffs for windows/x86 ran on windows/x86

Updating Vector<T> to support opt-in 512-bit widths #97460

Updating Vector<T> to support opt-in 512-bit widths #97460

Conversation

tannergooding commented Jan 24, 2024

ghost commented Jan 24, 2024

stephentoub commented Jan 24, 2024

tannergooding commented Jan 24, 2024

kunalspathak left a comment

Choose a reason for hiding this comment

kunalspathak Jan 24, 2024

Choose a reason for hiding this comment

tannergooding Jan 24, 2024

Choose a reason for hiding this comment

kunalspathak Jan 24, 2024

Choose a reason for hiding this comment

tannergooding Jan 24, 2024

Choose a reason for hiding this comment

ryujit-bot commented Jan 24, 2024

Throughput diffs

Throughput diffs for linux/x64 ran on linux/x64

Throughput diffs for linux/x64 ran on windows/x64

ryujit-bot commented Jan 24, 2024

Assembly diffs

Assembly diffs for linux/x64 ran on windows/x64

Throughput diffs

Throughput diffs for linux/x64 ran on linux/x64

ryujit-bot commented Jan 25, 2024

Assembly diffs

Assembly diffs for linux/x64 ran on windows/x64

Assembly diffs for windows/x64 ran on windows/x64

Assembly diffs for windows/x86 ran on windows/x86

Throughput diffs

Throughput diffs for linux/x64 ran on windows/x64

Throughput diffs for windows/x64 ran on windows/x64

Throughput diffs for windows/x86 ran on windows/x86