Suboptimal x64 codegen for signed Math.BigMul #75594

tevador · 2022-09-14T08:44:16Z

Description

The JIT generates suboptimal x64 code for Math.BigMul(long, long, out long).

Configuration

.NET 6
Windows 10
AMD Ryzen CPU (x64)

Data

Source code:

[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
public static void TestBigMul2(ref ulong x, ref ulong y)
{
	x = Math.BigMul(x, y, out y);
}

[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
public static void TestBigMul1(ref long x, ref long y)
{
	x = Math.BigMul(x, y, out y);
}

Results in the following machine code:

TestBigMul1(ref ulong, ref ulong):
 push        rax  
 mov         rax,qword ptr [rcx]  
 mov         qword ptr [rsp+18h],rdx  
 mov         r8,qword ptr [rdx]  
 lea         r9,[rsp]  
 mov         rdx,rax  
 mulx        rax,r10,r8  
 mov         qword ptr [r9],r10  
 mov         rdx,qword ptr [rsp]  
 mov         r8,qword ptr [rsp+18h]  
 mov         qword ptr [r8],rdx  
 mov         qword ptr [rcx],rax  
 add         rsp,8  
 ret  

TestBigMul1(ref long, ref long):
 push        rax  
 mov         rax,qword ptr [rcx]  
 mov         qword ptr [rsp+18h],rdx  
 mov         r8,qword ptr [rdx]  
 lea         r9,[rsp]  
 mov         rdx,rax  
 mulx        rdx,r10,r8  
 mov         qword ptr [r9],r10  
 mov         r9,qword ptr [rsp]  
 mov         r10,qword ptr [rsp+18h]  
 mov         qword ptr [r10],r9  
 mov         r9,rax  
 sar         r9,3Fh  
 and         r9,r8  
 sub         rdx,r9  
 sar         r8,3Fh  
 and         rax,r8  
 sub         rdx,rax  
 mov         qword ptr [rcx],rdx  
 add         rsp,8  
 ret

Analysis

The unsigned overload uses a single mulx instruction as expected.

The signed overload also uses mulx with additional 6 instructions (2x sar, and, sub) to adjust the upper half of the result. This increases the latency from 4 cycles to at least 8 cycles in fully inlined code. This is completely unnecessary as the x64 architecture has a dedicated instruction for signed multiplication: the one-operand imul. The whole sequence of mulx, sar, and, sub, sar, and, sub could thus be replaced by a single imul instruction.

Also in this particular case, both methods use the stack unnecessarily, but that's probably a separate issue.

category:cq
theme:floating-point
skill-level:intermediate
cost:medium
impact:small

The text was updated successfully, but these errors were encountered:

ghost · 2022-09-14T08:44:26Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

The JIT generates suboptimal x64 code for Math.BigMul(long, long, out long).

Configuration

.NET 6
Windows 10
AMD Ryzen CPU (x64)

Data

Source code:

[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
public static void TestBigMul2(ref ulong x, ref ulong y)
{
	x = Math.BigMul(x, y, out y);
}

[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
public static void TestBigMul1(ref long x, ref long y)
{
	x = Math.BigMul(x, y, out y);
}

Results in the following machine code:

TestBigMul1(ref ulong, ref ulong):
 push        rax  
 mov         rax,qword ptr [rcx]  
 mov         qword ptr [rsp+18h],rdx  
 mov         r8,qword ptr [rdx]  
 lea         r9,[rsp]  
 mov         rdx,rax  
 mulx        rax,r10,r8  
 mov         qword ptr [r9],r10  
 mov         rdx,qword ptr [rsp]  
 mov         r8,qword ptr [rsp+18h]  
 mov         qword ptr [r8],rdx  
 mov         qword ptr [rcx],rax  
 add         rsp,8  
 ret  

TestBigMul1(ref long, ref long):
 push        rax  
 mov         rax,qword ptr [rcx]  
 mov         qword ptr [rsp+18h],rdx  
 mov         r8,qword ptr [rdx]  
 lea         r9,[rsp]  
 mov         rdx,rax  
 mulx        rdx,r10,r8  
 mov         qword ptr [r9],r10  
 mov         r9,qword ptr [rsp]  
 mov         r10,qword ptr [rsp+18h]  
 mov         qword ptr [r10],r9  
 mov         r9,rax  
 sar         r9,3Fh  
 and         r9,r8  
 sub         rdx,r9  
 sar         r8,3Fh  
 and         rax,r8  
 sub         rdx,rax  
 mov         qword ptr [rcx],rdx  
 add         rsp,8  
 ret

Analysis

The unsigned overload uses a single mulx instruction as expected.

The signed overload also uses mulx with additional 6 instructions (2x sar, and, sub) to adjust the upper half of the result. This increases the latency from 4 cycles to at least 8 cycles in fully inlined code. This is completely unnecessary as the x64 architecture has a dedicated instruction for signed multiplication: the one-operand imul. The whole sequence of mulx, sar, and, sub, sar, and, sub could thus be replaced by a single imul instruction.

Also in this particular case, both methods use the stack unnecessarily, but that's probably a separate issue.

Author:	tevador
Assignees:	-
Labels:	`tenet-performance`, `area-CodeGen-coreclr`, `untriaged`
Milestone:	-

JulieLeeMSFT · 2022-09-14T14:52:56Z

cc @dotnet/jit-contrib @tannergooding.

tannergooding · 2022-09-14T15:01:02Z

This is a similar issue to #5213, #27292, #68207, #58263, etc

The root issue is that the JIT doesn't have any code to optimize/special-case 32 * 32 = 64 or 64 * 64 = 128 scenarios. This is partially because it requires specialized logic since it represents an instruction with "multiple register returns".

Carol Eidt added some support for multiple register returns a while back, but its not been picked up many places in the JIT yet. I think @kunalspathak and @EgorBo are currently the two with the most context as to what would be required to make this work here.

kunalspathak · 2022-09-14T19:08:24Z

Taking a quick look, we do have imul instructions that are generated for GT_MUL, but seems that for BigMul case, we end up using hardware intrinsics variant of MultiplyNoFlags and generate mulx.

runtime/src/libraries/System.Private.CoreLib/src/System/Math.cs

Lines 177 to 183 in 522f808

    
           if (Bmi2.X64.IsSupported) 
        
           { 
        
               ulong tmp; 
        
               ulong high = Bmi2.X64.MultiplyNoFlags(a, b, &tmp); 
        
               low = tmp; 
        
               return high; 
        
           }

The whole sequence of mulx, sar, and, sub, sar, and, sub could thus be replaced by a single imul instruction.

Yeah, currently it is getting generated because that's how we implemented it.

runtime/src/libraries/System.Private.CoreLib/src/System/Math.cs

Lines 229 to 231 in 522f808

    
           ulong high = BigMul((ulong)a, (ulong)b, out ulong ulow); 
        
           low = (long)ulow; 
        
           return (long)high - ((a >> 63) & b) - ((b >> 63) & a);

tannergooding · 2022-09-14T19:13:49Z

Taking a quick look, we do have imul instructions that are generated for GT_MUL

Right, but much like with div and idiv, we only handle "one register" of the output. We don't have the handling required to get out both EAX and EDX and put them in the respective return slots for the result.

This is why we do the mulx approach instead as that's the "next closest thing". Ideally we'd mark Math.BigMul as intrinsic and have the JIT treat it like we do for mulx, but emitting imul instead so it computes the correct sign for upper.

SingleAccretion · 2022-09-14T19:25:57Z

I think this would be straightforward to fix by implementing #58263, which should be very simple after #66551 is merged.

tevador · 2022-09-14T19:30:57Z

So the main issue is the lack of x86 intrinsics for wide multiplication except of Bmi2.X64.MultiplyNoFlags, which is not part of the base x64 instruction set. Older x64 CPUs such as Intel Ivy Bridge will get the software fallback even through they are capable of performing the calculation with one instruction (mul or imul).

EgorBo · 2022-10-05T15:11:38Z

A proper fix for this problem will speed up benchmarks for XxHash3: #76641 on x64

tevador added the tenet-performance Performance related issue label Sep 14, 2022

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 14, 2022

ghost added the untriaged New issue has not been triaged by the area owner label Sep 14, 2022

JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Sep 14, 2022

JulieLeeMSFT added this to the Future milestone Sep 14, 2022

EgorBo mentioned this issue Oct 5, 2022

Add XxHash3 to System.IO.Hashing #76641

Merged

pentp mentioned this issue Nov 6, 2023

Use BigMul for 32x32=64 in decimal #93345

Merged

lilinus mentioned this issue Apr 11, 2024

Use BigMul in BigMul #100918

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suboptimal x64 codegen for signed Math.BigMul #75594

Suboptimal x64 codegen for signed Math.BigMul #75594

tevador commented Sep 14, 2022 •

edited by BruceForstall

Loading

ghost commented Sep 14, 2022

Description

Configuration

Data

Analysis

JulieLeeMSFT commented Sep 14, 2022

tannergooding commented Sep 14, 2022

kunalspathak commented Sep 14, 2022

tannergooding commented Sep 14, 2022

SingleAccretion commented Sep 14, 2022

tevador commented Sep 14, 2022

EgorBo commented Oct 5, 2022

Suboptimal x64 codegen for signed Math.BigMul #75594

Suboptimal x64 codegen for signed Math.BigMul #75594

Comments

tevador commented Sep 14, 2022 • edited by BruceForstall Loading

Description

Configuration

Data

Analysis

ghost commented Sep 14, 2022

Description

Configuration

Data

Analysis

JulieLeeMSFT commented Sep 14, 2022

tannergooding commented Sep 14, 2022

kunalspathak commented Sep 14, 2022

tannergooding commented Sep 14, 2022

SingleAccretion commented Sep 14, 2022

tevador commented Sep 14, 2022

EgorBo commented Oct 5, 2022

tevador commented Sep 14, 2022 •

edited by BruceForstall

Loading