-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suboptimal x64 codegen for signed Math.BigMul #75594
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsDescriptionThe JIT generates suboptimal x64 code for Configuration
DataSource code: [MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
public static void TestBigMul2(ref ulong x, ref ulong y)
{
x = Math.BigMul(x, y, out y);
}
[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
public static void TestBigMul1(ref long x, ref long y)
{
x = Math.BigMul(x, y, out y);
} Results in the following machine code: TestBigMul1(ref ulong, ref ulong):
push rax
mov rax,qword ptr [rcx]
mov qword ptr [rsp+18h],rdx
mov r8,qword ptr [rdx]
lea r9,[rsp]
mov rdx,rax
mulx rax,r10,r8
mov qword ptr [r9],r10
mov rdx,qword ptr [rsp]
mov r8,qword ptr [rsp+18h]
mov qword ptr [r8],rdx
mov qword ptr [rcx],rax
add rsp,8
ret
TestBigMul1(ref long, ref long):
push rax
mov rax,qword ptr [rcx]
mov qword ptr [rsp+18h],rdx
mov r8,qword ptr [rdx]
lea r9,[rsp]
mov rdx,rax
mulx rdx,r10,r8
mov qword ptr [r9],r10
mov r9,qword ptr [rsp]
mov r10,qword ptr [rsp+18h]
mov qword ptr [r10],r9
mov r9,rax
sar r9,3Fh
and r9,r8
sub rdx,r9
sar r8,3Fh
and rax,r8
sub rdx,rax
mov qword ptr [rcx],rdx
add rsp,8
ret AnalysisThe unsigned overload uses a single The signed overload also uses Also in this particular case, both methods use the stack unnecessarily, but that's probably a separate issue.
|
cc @dotnet/jit-contrib @tannergooding. |
This is a similar issue to #5213, #27292, #68207, #58263, etc The root issue is that the JIT doesn't have any code to optimize/special-case Carol Eidt added some support for multiple register returns a while back, but its not been picked up many places in the JIT yet. I think @kunalspathak and @EgorBo are currently the two with the most context as to what would be required to make this work here. |
Taking a quick look, we do have runtime/src/libraries/System.Private.CoreLib/src/System/Math.cs Lines 177 to 183 in 522f808
Yeah, currently it is getting generated because that's how we implemented it. runtime/src/libraries/System.Private.CoreLib/src/System/Math.cs Lines 229 to 231 in 522f808
|
Right, but much like with This is why we do the |
So the main issue is the lack of x86 intrinsics for wide multiplication except of |
A proper fix for this problem will speed up benchmarks for XxHash3: #76641 on x64 |
Description
The JIT generates suboptimal x64 code for
Math.BigMul(long, long, out long)
.Configuration
Data
Source code:
Results in the following machine code:
Analysis
The unsigned overload uses a single
mulx
instruction as expected.The signed overload also uses
mulx
with additional 6 instructions (2xsar, and, sub
) to adjust the upper half of the result. This increases the latency from 4 cycles to at least 8 cycles in fully inlined code. This is completely unnecessary as the x64 architecture has a dedicated instruction for signed multiplication: the one-operand imul. The whole sequence ofmulx, sar, and, sub, sar, and, sub
could thus be replaced by a singleimul
instruction.Also in this particular case, both methods use the stack unnecessarily, but that's probably a separate issue.
category:cq
theme:floating-point
skill-level:intermediate
cost:medium
impact:small
The text was updated successfully, but these errors were encountered: