Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace use of target dependent TestZ intrinsic #104488

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

xtqqczze
Copy link
Contributor

@xtqqczze xtqqczze commented Jul 5, 2024

Contributes to #101251.

@xtqqczze xtqqczze changed the title Replace uses of TestZ intrinsic Replace use of target dependent TestZ intrinsic Jul 6, 2024
@jkotas jkotas added area-System.Runtime.Intrinsics and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jul 6, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

@xtqqczze xtqqczze marked this pull request as ready for review July 6, 2024 14:35
@@ -1521,22 +1521,15 @@ private static bool VectorContainsNonAsciiChar(Vector128<byte> asciiVector)
internal static bool VectorContainsNonAsciiChar(Vector128<ushort> utf16Vector)
{
// prefer architecture specific intrinsic as they offer better perf
if (Sse2.IsSupported)
#pragma warning disable IntrinsicsInSystemPrivateCoreLibConditionParsing // A negated IsSupported condition isn't parseable by the intrinsics analyzer
if (Sse2.IsSupported && !Sse41.IsSupported)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning exists because this is going to flag the code as depending on SSE4.1

We should likely not consider the SSE2 path and just have SSE4.1, as such hardware is increasingly irrelevant (more than 17 years old at this point) and even crossgen is defaulting to an opportunistic SSE4.1 target today

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is, remove this path entirely and just have if (AdvSimd.Arm64.IsSupported) { } else { }

This will do the right thing for SSE4.1 already and while it might slightly pessimize an SSE2 only machine, such machines are old/rare enough that's acceptable IMO. We're talking about what ends up being an instruction or so difference in practice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes can be made in a follow-up PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xtqqczze, I thought I had responded to this, but apparently missed it.

This isn't "just" about the analyzer, but also pertains to how the trimmer and other tooling works. We're going to need to fix this before the PR can be merged or it will cause regressions and the PR will be reverted by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding Is d21415f still problematic?

@xtqqczze
Copy link
Contributor Author

Depends on #102705.

@xtqqczze
Copy link
Contributor Author

@MihuBot

@xtqqczze
Copy link
Contributor Author

xtqqczze commented Jul 11, 2024

@tannergooding There are unexpected regressions, see MihuBot/runtime-utils#501

System.Text.Ascii:IsValidCore[ubyte](byref,int):ubyte

 G_M58774_IG07:
        cmp      esi, 64
        jg       SHORT G_M58774_IG08
        vmovups  ymm0, ymmword ptr [rdi]
-       vpor     ymm0, ymm0, ymmword ptr [rax-0x20]
-       vmovups  ymm1, ymmword ptr [reloc @RWD32]
-       vptest   ymm0, ymm1
+       vmovups  ymm1, ymmword ptr [rax-0x20]
+       vmovups  ymm2, ymmword ptr [reloc @RWD32]
+       vpternlogd ymm0, ymm1, ymm2, -88
+       vptest   ymm0, ymm0
        sete     cl
        movzx    rcx, cl
        jmp      SHORT G_M58774_IG04
-						;; size=35 bbWeight=0.50 PerfScore 10.75
+						;; size=42 bbWeight=0.50 PerfScore 12.00

@tannergooding
Copy link
Member

That particular case should be mostly resolved with #104517, vpternlog was missing general support for ensuring we select optimal containment

There's more improvements that could be had as well, but its a step in the right direction overall

@xtqqczze
Copy link
Contributor Author

xtqqczze commented Jul 11, 2024

Depends on #104517.

@xtqqczze
Copy link
Contributor Author

xtqqczze commented Jul 12, 2024

I'm trying a refactoring using ISimdVector in xtqqczze@a410294.

@xtqqczze
Copy link
Contributor Author

@MihuBot

@xtqqczze
Copy link
Contributor Author

xtqqczze commented Jul 13, 2024

That particular case should be mostly resolved with #104517, vpternlog was missing general support for ensuring we select optimal containment

There's more improvements that could be had as well, but its a step in the right direction overall

@tannergooding Yes that case is mostly resolved, see MihuBot/runtime-utils#519

System.Text.Ascii:IsValidCore[ubyte](byref,int):ubyte
 G_M58774_IG07:
        cmp      esi, 64
        jg       SHORT G_M58774_IG08
        vmovups  ymm0, ymmword ptr [rdi]
-       vpor     ymm0, ymm0, ymmword ptr [rax-0x20]
        vmovups  ymm1, ymmword ptr [reloc @RWD32]
-       vptest   ymm0, ymm1
+       vpternlogd ymm0, ymm1, ymmword ptr [rax-0x20], -56
+       vptest   ymm0, ymm0
        sete     cl
        movzx    rcx, cl
        jmp      SHORT G_M58774_IG04
-						;; size=35 bbWeight=0.50 PerfScore 10.75
+						;; size=38 bbWeight=0.50 PerfScore 10.75

Analysis with llvm-mca shows 22 vs 15 cycles of latency, an increase of 7, see https://analysis.godbolt.org/z/Yj3Ydoees

@xtqqczze
Copy link
Contributor Author

xtqqczze commented Jul 13, 2024

@tannergooding After #104517, still seeing some (not new) regressions, see MihuBot/runtime-utils#519

System.Text.Ascii:GetIndexOfFirstNonAsciiChar_Intrinsified(ulong,ulong):ulong
 G_M29265_IG06:
        vmovups  xmm0, xmmword ptr [rdi]
        vmovups  xmm2, xmmword ptr [rdi+0x10]
-       vpor     xmm3, xmm0, xmm2
-       vptest   xmm3, xmm1
+       vmovaps  xmm3, xmm0
+       vpternlogd xmm3, xmm1, xmm2, -56
+       vptest   xmm3, xmm3
        jne      G_M29265_IG16
        add      rdi, 32
        cmp      rdi, rcx
        jbe      SHORT G_M29265_IG06
-						;; size=33 bbWeight=4 PerfScore 55.33
+						;; size=40 bbWeight=4 PerfScore 57.00

Analysis with llvm-mca shows 14 vs 14 cycles of latency, no increase, see https://analysis.godbolt.org/z/se74h8nvq

@tannergooding
Copy link
Member

It's not really a regression per say and in other cases could be an optimization.

Either way, it shouldn't be impactful to this PR, I'll just finish the remaining work here and not fold in the case the AND is part of an op_Equality check

@xtqqczze
Copy link
Contributor Author

xtqqczze commented Jul 16, 2024

@tannergooding Depends on #104944.

@tannergooding
Copy link
Member

I wouldn't say this depends on #10944, I think it's fine to take as is. If you could mark it ready for review then I can give it a final pass and merge

@xtqqczze
Copy link
Contributor Author

@tannergooding I've suggested we make the suggested changes in a follow-up PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Runtime.Intrinsics community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants