Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating Vector<T> to support opt-in 512-bit widths #97460

Merged
merged 2 commits into from
Jan 24, 2024

Conversation

tannergooding
Copy link
Member

No description provided.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 24, 2024
@ghost ghost assigned tannergooding Jan 24, 2024
@ghost
Copy link

ghost commented Jan 24, 2024

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author: tannergooding
Assignees: tannergooding
Labels:

area-CodeGen-coreclr

Milestone: -

@stephentoub
Copy link
Member

Do we have any places in our own use of Vector<T> where we take a dependency on it not being larger than 256? I know we have one here:

Vector<int>.Count <= 8 &&
destination.Length >= Vector<int>.Count)
{
Vector<int> init = new Vector<int>((ReadOnlySpan<int>)[0, 1, 2, 3, 4, 5, 6, 7]);

but we explicitly guard it on Count because we wrote it knowing this might be coming (though we could also update that now to construct the Vector with a larger sequence so it lights up on 512.) I'm wondering if we might have any others that would need to be tweaked.

@tannergooding
Copy link
Member Author

but we explicitly guard it on Count because we wrote it knowing this might be coming (though we could also update that now to construct the Vector with a larger sequence so it lights up on 512.) I'm wondering if we might have any others that would need to be tweaked.

I know we have a few places throughout the BCL which guard against it being larger than 256-bits and will use an alternative path if it is. Those will definitely need to be found and updated where relevant, but shouldn't be blocked on this PR (especially since its opt-in).

Some of the patterns like new Vector<int>((ReadOnlySpan<int>)[0, 1, 2, 3, 4, 5, 6, 7]); are ones I'd like to "solve" by exposing new Create APIs. Something like CreateSequence(int start, int step) (but with better naming). That should help avoid creating unnecessarily large RVA statics and make the code more portable.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks. The arm changes will come handy for SVE work, right?

@@ -202,99 +202,99 @@ SIMD_AS_HWINTRINSIC_ID(Vector4, WithElement,
// {TYP_BYTE, TYP_UBYTE, TYP_SHORT, TYP_USHORT, TYP_INT, TYP_UINT, TYP_LONG, TYP_ULONG, TYP_FLOAT, TYP_DOUBLE}
// *************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
// Vector<T> Intrinsics
SIMD_AS_HWINTRINSIC_ID(VectorT128, Abs, 1, {NI_VectorT128_Abs, NI_VectorT128_Abs, NI_VectorT128_Abs, NI_VectorT128_Abs, NI_VectorT128_Abs, NI_VectorT128_Abs, NI_VectorT128_Abs, NI_VectorT128_Abs, NI_VectorT128_Abs, NI_VectorT128_Abs}, SimdAsHWIntrinsicFlag::None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the changes in this and simdashwintrinsiclistxarch.h are mainly renaming VectorT128 to VectorT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that's the only change here which should simplify adding future sizes to Vector<T> as we no longer have to duplicate table entries per size.

@@ -682,30 +641,11 @@ GenTree* Compiler::impSimdAsHWIntrinsicSpecial(NamedIntrinsic intrinsic,
break;
}

case NI_VectorT128_Sum:
{
if (varTypeIsFloating(simdBaseType))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the case for VectorT_Sum is moved below. wondering we don't need InstructionSet_* checks anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation was updated to not use the horizontal operations anymore, as it has better performance on modern hardware and was needed for Vector512_Sum anyways.

@ryujit-bot
Copy link

Diff results for #97460

Throughput diffs

Throughput diffs for linux/x64 ran on linux/x64

Overall (-0.05% to -0.00%)
Collection PDIFF
realworld.run.linux.x64.checked.mch -0.05%
FullOpts (-0.05% to -0.00%)
Collection PDIFF
realworld.run.linux.x64.checked.mch -0.05%

Details here


Throughput diffs for linux/x64 ran on windows/x64

Overall (-0.05% to +0.02%)
Collection PDIFF
benchmarks.run.linux.x64.checked.mch +0.02%
benchmarks.run_pgo.linux.x64.checked.mch +0.02%
benchmarks.run_tiered.linux.x64.checked.mch +0.01%
libraries.crossgen2.linux.x64.checked.mch +0.02%
libraries.pmi.linux.x64.checked.mch +0.02%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch +0.02%
realworld.run.linux.x64.checked.mch -0.05%
smoke_tests.nativeaot.linux.x64.checked.mch +0.02%
MinOpts (+0.01% to +0.03%)
Collection PDIFF
benchmarks.run.linux.x64.checked.mch +0.01%
benchmarks.run_pgo.linux.x64.checked.mch +0.01%
benchmarks.run_tiered.linux.x64.checked.mch +0.01%
libraries.crossgen2.linux.x64.checked.mch +0.03%
libraries.pmi.linux.x64.checked.mch +0.01%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch +0.01%
realworld.run.linux.x64.checked.mch +0.02%
smoke_tests.nativeaot.linux.x64.checked.mch +0.02%
FullOpts (-0.05% to +0.02%)
Collection PDIFF
benchmarks.run.linux.x64.checked.mch +0.02%
benchmarks.run_pgo.linux.x64.checked.mch +0.02%
benchmarks.run_tiered.linux.x64.checked.mch +0.02%
libraries.crossgen2.linux.x64.checked.mch +0.02%
libraries.pmi.linux.x64.checked.mch +0.02%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch +0.02%
realworld.run.linux.x64.checked.mch -0.05%
smoke_tests.nativeaot.linux.x64.checked.mch +0.02%

Details here


@ryujit-bot
Copy link

Diff results for #97460

Assembly diffs

Assembly diffs for linux/x64 ran on windows/x64

Diffs are based on 1,915,317 contexts (623,081 MinOpts, 1,292,236 FullOpts).

MISSED contexts: base: 0 (0.00%), diff: 174 (0.01%)

Overall (-166 bytes)
Collection Base size (bytes) Diff size (bytes)
realworld.run.linux.x64.checked.mch 13,145,429 -166
FullOpts (-166 bytes)
Collection Base size (bytes) Diff size (bytes)
realworld.run.linux.x64.checked.mch 12,758,519 -166

Details here


Throughput diffs

Throughput diffs for linux/x64 ran on linux/x64

Overall (-0.05% to -0.00%)
Collection PDIFF
realworld.run.linux.x64.checked.mch -0.05%
FullOpts (-0.05% to +0.00%)
Collection PDIFF
realworld.run.linux.x64.checked.mch -0.05%

Details here


@tannergooding tannergooding merged commit be6c9f6 into dotnet:main Jan 24, 2024
139 checks passed
@tannergooding tannergooding deleted the vectort-512 branch January 24, 2024 23:28
@ryujit-bot
Copy link

Diff results for #97460

Assembly diffs

Assembly diffs for linux/x64 ran on windows/x64

Diffs are based on 2,512,082 contexts (977,766 MinOpts, 1,534,316 FullOpts).

MISSED contexts: base: 0 (0.00%), diff: 180 (0.01%)

Overall (-166 bytes)
Collection Base size (bytes) Diff size (bytes)
realworld.run.linux.x64.checked.mch 13,145,429 -166
FullOpts (-166 bytes)
Collection Base size (bytes) Diff size (bytes)
realworld.run.linux.x64.checked.mch 12,758,519 -166

Assembly diffs for windows/x64 ran on windows/x64

Diffs are based on 2,373,018 contexts (928,740 MinOpts, 1,444,278 FullOpts).

MISSED contexts: base: 0 (0.00%), diff: 183 (0.01%)

Overall (+407 bytes)
Collection Base size (bytes) Diff size (bytes)
realworld.run.windows.x64.checked.mch 14,193,402 +407
FullOpts (+407 bytes)
Collection Base size (bytes) Diff size (bytes)
realworld.run.windows.x64.checked.mch 13,803,697 +407

Details here


Assembly diffs for windows/x86 ran on windows/x86

Diffs are based on 2,298,941 contexts (840,452 MinOpts, 1,458,489 FullOpts).

MISSED contexts: base: 7 (0.00%), diff: 187 (0.01%)

Overall (-50 bytes)
Collection Base size (bytes) Diff size (bytes)
realworld.run.windows.x86.checked.mch 11,368,246 -50
FullOpts (-50 bytes)
Collection Base size (bytes) Diff size (bytes)
realworld.run.windows.x86.checked.mch 11,072,546 -50

Details here


Throughput diffs

Throughput diffs for linux/x64 ran on windows/x64

Overall (-0.05% to +0.02%)
Collection PDIFF
benchmarks.run.linux.x64.checked.mch +0.02%
benchmarks.run_pgo.linux.x64.checked.mch +0.02%
benchmarks.run_tiered.linux.x64.checked.mch +0.01%
coreclr_tests.run.linux.x64.checked.mch +0.02%
libraries.crossgen2.linux.x64.checked.mch +0.02%
libraries.pmi.linux.x64.checked.mch +0.02%
libraries_tests.run.linux.x64.Release.mch +0.02%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch +0.02%
realworld.run.linux.x64.checked.mch -0.05%
smoke_tests.nativeaot.linux.x64.checked.mch +0.02%
MinOpts (+0.01% to +0.03%)
Collection PDIFF
benchmarks.run.linux.x64.checked.mch +0.01%
benchmarks.run_pgo.linux.x64.checked.mch +0.01%
benchmarks.run_tiered.linux.x64.checked.mch +0.01%
coreclr_tests.run.linux.x64.checked.mch +0.01%
libraries.crossgen2.linux.x64.checked.mch +0.03%
libraries.pmi.linux.x64.checked.mch +0.01%
libraries_tests.run.linux.x64.Release.mch +0.01%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch +0.01%
realworld.run.linux.x64.checked.mch +0.02%
smoke_tests.nativeaot.linux.x64.checked.mch +0.02%
FullOpts (-0.05% to +0.02%)
Collection PDIFF
benchmarks.run.linux.x64.checked.mch +0.02%
benchmarks.run_pgo.linux.x64.checked.mch +0.02%
benchmarks.run_tiered.linux.x64.checked.mch +0.02%
coreclr_tests.run.linux.x64.checked.mch +0.02%
libraries.crossgen2.linux.x64.checked.mch +0.02%
libraries.pmi.linux.x64.checked.mch +0.02%
libraries_tests.run.linux.x64.Release.mch +0.02%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch +0.02%
realworld.run.linux.x64.checked.mch -0.05%
smoke_tests.nativeaot.linux.x64.checked.mch +0.02%

Throughput diffs for windows/x64 ran on windows/x64

Overall (-0.05% to +0.02%)
Collection PDIFF
benchmarks.run.windows.x64.checked.mch +0.02%
benchmarks.run_pgo.windows.x64.checked.mch +0.02%
benchmarks.run_tiered.windows.x64.checked.mch +0.02%
coreclr_tests.run.windows.x64.checked.mch +0.02%
libraries.crossgen2.windows.x64.checked.mch +0.02%
libraries.pmi.windows.x64.checked.mch +0.02%
libraries_tests.run.windows.x64.Release.mch +0.02%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +0.02%
realworld.run.windows.x64.checked.mch -0.05%
smoke_tests.nativeaot.windows.x64.checked.mch +0.02%
MinOpts (+0.01% to +0.03%)
Collection PDIFF
benchmarks.run.windows.x64.checked.mch +0.01%
benchmarks.run_pgo.windows.x64.checked.mch +0.01%
benchmarks.run_tiered.windows.x64.checked.mch +0.01%
coreclr_tests.run.windows.x64.checked.mch +0.01%
libraries.crossgen2.windows.x64.checked.mch +0.03%
libraries.pmi.windows.x64.checked.mch +0.01%
libraries_tests.run.windows.x64.Release.mch +0.01%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +0.01%
realworld.run.windows.x64.checked.mch +0.01%
smoke_tests.nativeaot.windows.x64.checked.mch +0.02%
FullOpts (-0.05% to +0.02%)
Collection PDIFF
benchmarks.run.windows.x64.checked.mch +0.02%
benchmarks.run_pgo.windows.x64.checked.mch +0.02%
benchmarks.run_tiered.windows.x64.checked.mch +0.02%
coreclr_tests.run.windows.x64.checked.mch +0.02%
libraries.crossgen2.windows.x64.checked.mch +0.02%
libraries.pmi.windows.x64.checked.mch +0.02%
libraries_tests.run.windows.x64.Release.mch +0.02%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +0.02%
realworld.run.windows.x64.checked.mch -0.05%
smoke_tests.nativeaot.windows.x64.checked.mch +0.02%

Details here


Throughput diffs for windows/x86 ran on windows/x86

Overall (-0.05% to +0.00%)
Collection PDIFF
realworld.run.windows.x86.checked.mch -0.05%
FullOpts (-0.05% to +0.00%)
Collection PDIFF
realworld.run.windows.x86.checked.mch -0.05%

Details here


@github-actions github-actions bot locked and limited conversation to collaborators Feb 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants