-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace use of target dependent TestZ
intrinsic
#104488
base: main
Are you sure you want to change the base?
Conversation
d58b185
to
05211ad
Compare
TestZ
intrinsicTestZ
intrinsic
Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics |
@@ -1521,22 +1521,15 @@ private static bool VectorContainsNonAsciiChar(Vector128<byte> asciiVector) | |||
internal static bool VectorContainsNonAsciiChar(Vector128<ushort> utf16Vector) | |||
{ | |||
// prefer architecture specific intrinsic as they offer better perf | |||
if (Sse2.IsSupported) | |||
#pragma warning disable IntrinsicsInSystemPrivateCoreLibConditionParsing // A negated IsSupported condition isn't parseable by the intrinsics analyzer | |||
if (Sse2.IsSupported && !Sse41.IsSupported) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning exists because this is going to flag the code as depending on SSE4.1
We should likely not consider the SSE2 path and just have SSE4.1, as such hardware is increasingly irrelevant (more than 17 years old at this point) and even crossgen is defaulting to an opportunistic SSE4.1 target today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is, remove this path entirely and just have if (AdvSimd.Arm64.IsSupported) { } else { }
This will do the right thing for SSE4.1 already and while it might slightly pessimize an SSE2 only machine, such machines are old/rare enough that's acceptable IMO. We're talking about what ends up being an instruction or so difference in practice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes can be made in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xtqqczze, I thought I had responded to this, but apparently missed it.
This isn't "just" about the analyzer, but also pertains to how the trimmer and other tooling works. We're going to need to fix this before the PR can be merged or it will cause regressions and the PR will be reverted by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding Is d21415f still problematic?
Depends on #102705. |
@tannergooding There are unexpected regressions, see MihuBot/runtime-utils#501
|
That particular case should be mostly resolved with #104517, There's more improvements that could be had as well, but its a step in the right direction overall |
|
I'm trying a refactoring using |
System.Text.Ascii:IsValidCore[ubyte](byref,int):ubyte G_M58774_IG07:
cmp esi, 64
jg SHORT G_M58774_IG08
vmovups ymm0, ymmword ptr [rdi]
- vpor ymm0, ymm0, ymmword ptr [rax-0x20]
vmovups ymm1, ymmword ptr [reloc @RWD32]
- vptest ymm0, ymm1
+ vpternlogd ymm0, ymm1, ymmword ptr [rax-0x20], -56
+ vptest ymm0, ymm0
sete cl
movzx rcx, cl
jmp SHORT G_M58774_IG04
- ;; size=35 bbWeight=0.50 PerfScore 10.75
+ ;; size=38 bbWeight=0.50 PerfScore 10.75 Analysis with llvm-mca shows 22 vs 15 cycles of latency, an increase of 7, see https://analysis.godbolt.org/z/Yj3Ydoees |
@tannergooding After #104517, still seeing some (not new) regressions, see MihuBot/runtime-utils#519 System.Text.Ascii:GetIndexOfFirstNonAsciiChar_Intrinsified(ulong,ulong):ulong G_M29265_IG06:
vmovups xmm0, xmmword ptr [rdi]
vmovups xmm2, xmmword ptr [rdi+0x10]
- vpor xmm3, xmm0, xmm2
- vptest xmm3, xmm1
+ vmovaps xmm3, xmm0
+ vpternlogd xmm3, xmm1, xmm2, -56
+ vptest xmm3, xmm3
jne G_M29265_IG16
add rdi, 32
cmp rdi, rcx
jbe SHORT G_M29265_IG06
- ;; size=33 bbWeight=4 PerfScore 55.33
+ ;; size=40 bbWeight=4 PerfScore 57.00 Analysis with llvm-mca shows 14 vs 14 cycles of latency, no increase, see https://analysis.godbolt.org/z/se74h8nvq |
It's not really a regression per say and in other cases could be an optimization. Either way, it shouldn't be impactful to this PR, I'll just finish the remaining work here and not fold in the case the |
|
I wouldn't say this depends on #10944, I think it's fine to take as is. If you could mark it |
@tannergooding I've suggested we make the suggested changes in a follow-up PR. |
Contributes to #101251.