Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suboptimal x64 codegen for signed Math.BigMul #75594

Open
tevador opened this issue Sep 14, 2022 · 8 comments
Open

Suboptimal x64 codegen for signed Math.BigMul #75594

tevador opened this issue Sep 14, 2022 · 8 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@tevador
Copy link

tevador commented Sep 14, 2022

Description

The JIT generates suboptimal x64 code for Math.BigMul(long, long, out long).

Configuration

  • .NET 6
  • Windows 10
  • AMD Ryzen CPU (x64)

Data

Source code:

[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
public static void TestBigMul2(ref ulong x, ref ulong y)
{
	x = Math.BigMul(x, y, out y);
}

[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
public static void TestBigMul1(ref long x, ref long y)
{
	x = Math.BigMul(x, y, out y);
}

Results in the following machine code:

TestBigMul1(ref ulong, ref ulong):
 push        rax  
 mov         rax,qword ptr [rcx]  
 mov         qword ptr [rsp+18h],rdx  
 mov         r8,qword ptr [rdx]  
 lea         r9,[rsp]  
 mov         rdx,rax  
 mulx        rax,r10,r8  
 mov         qword ptr [r9],r10  
 mov         rdx,qword ptr [rsp]  
 mov         r8,qword ptr [rsp+18h]  
 mov         qword ptr [r8],rdx  
 mov         qword ptr [rcx],rax  
 add         rsp,8  
 ret  

TestBigMul1(ref long, ref long):
 push        rax  
 mov         rax,qword ptr [rcx]  
 mov         qword ptr [rsp+18h],rdx  
 mov         r8,qword ptr [rdx]  
 lea         r9,[rsp]  
 mov         rdx,rax  
 mulx        rdx,r10,r8  
 mov         qword ptr [r9],r10  
 mov         r9,qword ptr [rsp]  
 mov         r10,qword ptr [rsp+18h]  
 mov         qword ptr [r10],r9  
 mov         r9,rax  
 sar         r9,3Fh  
 and         r9,r8  
 sub         rdx,r9  
 sar         r8,3Fh  
 and         rax,r8  
 sub         rdx,rax  
 mov         qword ptr [rcx],rdx  
 add         rsp,8  
 ret

Analysis

The unsigned overload uses a single mulx instruction as expected.

The signed overload also uses mulx with additional 6 instructions (2x sar, and, sub) to adjust the upper half of the result. This increases the latency from 4 cycles to at least 8 cycles in fully inlined code. This is completely unnecessary as the x64 architecture has a dedicated instruction for signed multiplication: the one-operand imul. The whole sequence of mulx, sar, and, sub, sar, and, sub could thus be replaced by a single imul instruction.

Also in this particular case, both methods use the stack unnecessarily, but that's probably a separate issue.

category:cq
theme:floating-point
skill-level:intermediate
cost:medium
impact:small

@tevador tevador added the tenet-performance Performance related issue label Sep 14, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 14, 2022
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Sep 14, 2022
@ghost
Copy link

ghost commented Sep 14, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

The JIT generates suboptimal x64 code for Math.BigMul(long, long, out long).

Configuration

  • .NET 6
  • Windows 10
  • AMD Ryzen CPU (x64)

Data

Source code:

[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
public static void TestBigMul2(ref ulong x, ref ulong y)
{
	x = Math.BigMul(x, y, out y);
}

[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
public static void TestBigMul1(ref long x, ref long y)
{
	x = Math.BigMul(x, y, out y);
}

Results in the following machine code:

TestBigMul1(ref ulong, ref ulong):
 push        rax  
 mov         rax,qword ptr [rcx]  
 mov         qword ptr [rsp+18h],rdx  
 mov         r8,qword ptr [rdx]  
 lea         r9,[rsp]  
 mov         rdx,rax  
 mulx        rax,r10,r8  
 mov         qword ptr [r9],r10  
 mov         rdx,qword ptr [rsp]  
 mov         r8,qword ptr [rsp+18h]  
 mov         qword ptr [r8],rdx  
 mov         qword ptr [rcx],rax  
 add         rsp,8  
 ret  

TestBigMul1(ref long, ref long):
 push        rax  
 mov         rax,qword ptr [rcx]  
 mov         qword ptr [rsp+18h],rdx  
 mov         r8,qword ptr [rdx]  
 lea         r9,[rsp]  
 mov         rdx,rax  
 mulx        rdx,r10,r8  
 mov         qword ptr [r9],r10  
 mov         r9,qword ptr [rsp]  
 mov         r10,qword ptr [rsp+18h]  
 mov         qword ptr [r10],r9  
 mov         r9,rax  
 sar         r9,3Fh  
 and         r9,r8  
 sub         rdx,r9  
 sar         r8,3Fh  
 and         rax,r8  
 sub         rdx,rax  
 mov         qword ptr [rcx],rdx  
 add         rsp,8  
 ret

Analysis

The unsigned overload uses a single mulx instruction as expected.

The signed overload also uses mulx with additional 6 instructions (2x sar, and, sub) to adjust the upper half of the result. This increases the latency from 4 cycles to at least 8 cycles in fully inlined code. This is completely unnecessary as the x64 architecture has a dedicated instruction for signed multiplication: the one-operand imul. The whole sequence of mulx, sar, and, sub, sar, and, sub could thus be replaced by a single imul instruction.

Also in this particular case, both methods use the stack unnecessarily, but that's probably a separate issue.

Author: tevador
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr, untriaged

Milestone: -

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Sep 14, 2022
@JulieLeeMSFT JulieLeeMSFT added this to the Future milestone Sep 14, 2022
@JulieLeeMSFT
Copy link
Member

cc @dotnet/jit-contrib @tannergooding.

@tannergooding
Copy link
Member

This is a similar issue to #5213, #27292, #68207, #58263, etc

The root issue is that the JIT doesn't have any code to optimize/special-case 32 * 32 = 64 or 64 * 64 = 128 scenarios. This is partially because it requires specialized logic since it represents an instruction with "multiple register returns".

Carol Eidt added some support for multiple register returns a while back, but its not been picked up many places in the JIT yet. I think @kunalspathak and @EgorBo are currently the two with the most context as to what would be required to make this work here.

@kunalspathak
Copy link
Member

Taking a quick look, we do have imul instructions that are generated for GT_MUL, but seems that for BigMul case, we end up using hardware intrinsics variant of MultiplyNoFlags and generate mulx.

if (Bmi2.X64.IsSupported)
{
ulong tmp;
ulong high = Bmi2.X64.MultiplyNoFlags(a, b, &tmp);
low = tmp;
return high;
}

The whole sequence of mulx, sar, and, sub, sar, and, sub could thus be replaced by a single imul instruction.

Yeah, currently it is getting generated because that's how we implemented it.

ulong high = BigMul((ulong)a, (ulong)b, out ulong ulow);
low = (long)ulow;
return (long)high - ((a >> 63) & b) - ((b >> 63) & a);

@tannergooding
Copy link
Member

Taking a quick look, we do have imul instructions that are generated for GT_MUL

Right, but much like with div and idiv, we only handle "one register" of the output. We don't have the handling required to get out both EAX and EDX and put them in the respective return slots for the result.

This is why we do the mulx approach instead as that's the "next closest thing". Ideally we'd mark Math.BigMul as intrinsic and have the JIT treat it like we do for mulx, but emitting imul instead so it computes the correct sign for upper.

@SingleAccretion
Copy link
Contributor

I think this would be straightforward to fix by implementing #58263, which should be very simple after #66551 is merged.

@tevador
Copy link
Author

tevador commented Sep 14, 2022

So the main issue is the lack of x86 intrinsics for wide multiplication except of Bmi2.X64.MultiplyNoFlags, which is not part of the base x64 instruction set. Older x64 CPUs such as Intel Ivy Bridge will get the software fallback even through they are capable of performing the calculation with one instruction (mul or imul).

@EgorBo
Copy link
Member

EgorBo commented Oct 5, 2022

A proper fix for this problem will speed up benchmarks for XxHash3: #76641 on x64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

6 participants