Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Cast UInt64 to Single directly during const folding #106419

Merged
merged 14 commits into from
Aug 20, 2024

Conversation

amanasifkhalid
Copy link
Member

Fixes #106338 (comment).

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Aug 14, 2024
@amanasifkhalid amanasifkhalid added this to the 9.0.0 milestone Aug 14, 2024
@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented Aug 14, 2024

@dotnet/jit-contrib PTAL. The new test is failing on mono:

Xunit.Sdk.EqualException: Assert.Equal() Failure: Values differ
Expected: 1600094603
Actual:   1600094604

So this might need to be fixed there, as well? I can make this test CoreCLR-only for now.

SPMI is failing because it doesn't have any collections to run.

@tannergooding
Copy link
Member

I can make this test CoreCLR-only for now.

This is the right thing for .NET 9; Mono has a tracking issue to make the behavior changes and standardize to the same implementation as RyuJIT: #100368

@amanasifkhalid
Copy link
Member Author

This is the right thing for .NET 9; Mono has a tracking issue to make the behavior changes and standardize to the same implementation as RyuJIT: #100368

Got it. I'll let this CI run finish, and then push a change to disable it on Mono.

@amanasifkhalid
Copy link
Member Author

Looks like the test is running on non-ARM64 legs, too -- probably because I forgot <RequiresProcessIsolation>, which is needed for <CLRTestTargetUnsupported>.

@amanasifkhalid
Copy link
Member Author

@dotnet/jit-contrib PTAL, the new test is passing.

Comment on lines 2275 to 2281
#ifdef TARGET_ARM64
// ARM64 supports casting directly to float
return (float)u64;
#else // !TARGET_ARM64
double d = convertUInt64ToDouble(u64);
return (float)d;
#endif // !TARGET_ARM64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this bug exists on all platforms, not just Arm64.

The code in the test is

ulong vr10 = 16105307123914158031UL;
float vr11 = 4294967295U | vr10;

This boils down to 4294967295U | 16105307123914158031UL which is 16105307124325679103

The correct float result is then 16105306574569865216.0f, or rather the raw bits 1600094603 (the DEBUG output).

The result produced by the two step conversion, however, is 16105307674081492992.0f, or rather the raw bits 1600094604 (the RELEASE output).

I would expect that all our current compilers (MSVC, Clang, and GCC) are producing correct results for return (float)u64 and we no longer need to do a two step conversion. If any are still producing incorrect results, we should log bugs against them and we'd need to hand roll an implementation that does the conversion instead (I can give reference to one if we need it, but I don't think we do).

Copy link
Member Author

@amanasifkhalid amanasifkhalid Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I tried both the one- and two-step conversion on x64, and in Debug and Release, I get 1600094604. So it looks like the MSVC/Clang/GCC codegen on the const-folding path is matching what RyuJIT currently emits in Debug for x64. For this test case, is the definition of "correct" architecture-dependent? I'm guessing there's some subtle difference in rounding behavior between ARM64 and x64?

Regardless of the answer to that, I think you're right that we can update the cast logic for all platforms. I can do that in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried both the one- and two-step conversion on x64, and in Debug and Release, I get 1600094604

Are you sure? I don't see the same behavior in godbolt: https://godbolt.org/z/s5MrTWaYM
-- note that m1 and m2 return the same result (raw bits 0x5f5f818b) while m3 returns one higher (raw bits 0x5f5f818c)

For this test case, is the definition of "correct" architecture-dependent?

No, we should be deterministic across all platforms for such a case; and correspondingly the added test should be enabled for all platforms as well, not just be Arm64 specific

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out the ulong -> float cast on x64 is represented in IR as ulong -> double -> float, while on ARM64, we combine it into one cast during morph. I'm guessing we have to enable that path for all platforms, too.

Copy link
Member

@jakobbotsch jakobbotsch Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the importer is creating ulong -> double -> float for ulong -> float conversions in IL? That sounds wrong given that those are not equivalent (as the example shows). If morph is making that transformation in the opposite direction it likewise sounds wrong.

Copy link
Member

@tannergooding tannergooding Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roslyn is likely relying on the JIT implementation for the constant folding here, not realizing it's incorrect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree the helper has downsides too. But Roslyn could avoid using it when not targeting something with it available. For example, I imagine it could emit the equivalent of

ulong x = ...;
float f = x > long.MaxValue ? (float)(-x) + (float)0x8000000000000000 : (float)(long)x;

in those cases (or whatever the right way is to do the conversion manually). Signed long -> float conversion is representable with just conv.r4, I believe.

I don't think we need to try to fix this in .NET 9, but we may want to unify the debug/release behavior on something. We should probably open an issue for more discussion about the problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to try to fix this in .NET 9, but we may want to unify the debug/release behavior on something. We should probably open an issue for more discussion about the problem.

I've opened #106646 to track this. I think ARM64 was the only platform where Debug/Release behavior could diverge, due to const-folding always doing a two-step cast and ucvtf/scvtf being able to encode ulong/long -> float casts. Should we take this PR as-is for .NET 9?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect this also repros on AVX512 capable hardware for x64.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I can repro it on my AVX512 machine, though this PR's changes seem to fix it. Here's the Debug codegen:

G_M27646_IG01:  ;; offset=0x0000
       push     rbp
       sub      rsp, 48
       lea      rbp, [rsp+0x30]
       xor      eax, eax
       mov      qword ptr [rbp-0x08], rax
       mov      dword ptr [rbp-0x0C], eax
                                                ;; size=19 bbWeight=1 PerfScore 4.00
G_M27646_IG02:  ;; offset=0x0013
       cmp      dword ptr [(reloc 0x7ffeb59917c0)], 0
       je       SHORT G_M27646_IG04
                                                ;; size=9 bbWeight=1 PerfScore 4.00
G_M27646_IG03:  ;; offset=0x001C
       call     CORINFO_HELP_DBG_IS_JUST_MY_CODE
                                                ;; size=5 bbWeight=0.50 PerfScore 0.50
G_M27646_IG04:  ;; offset=0x0021
       nop
       mov      rax, 0xDF818B7FE778AFCF
       mov      qword ptr [rbp-0x08], rax
       mov      eax, -1
       mov      eax, eax
       or       rax, qword ptr [rbp-0x08]
       vcvtusi2ss xmm0, rax
       vmovss   dword ptr [rbp-0x0C], xmm0
       vmovss   xmm0, dword ptr [rbp-0x0C]
       call     [System.BitConverter:SingleToUInt32Bits(float):uint]
       mov      dword ptr [rbp-0x10], eax
       mov      ecx, dword ptr [rbp-0x10]
       call     [System.Console:WriteLine(uint)]
       nop
       nop
                                                ;; size=62 bbWeight=1 PerfScore 22.50
G_M27646_IG05:  ;; offset=0x005F
       add      rsp, 48
       pop      rbp
       ret
                                                ;; size=6 bbWeight=1 PerfScore 1.75

And here's Release:

G_M27646_IG01:  ;; offset=0x0000
       sub      rsp, 40
                                                ;; size=4 bbWeight=1 PerfScore 0.25
G_M27646_IG02:  ;; offset=0x0004
       mov      ecx, 0x5F5F818B
       call     [System.Console:WriteLine(uint)]
       nop
                                                ;; size=12 bbWeight=1 PerfScore 3.50
G_M27646_IG03:  ;; offset=0x0010
       add      rsp, 40
       ret
                                                ;; size=5 bbWeight=1 PerfScore 1.25

Both now output 1600094603. I can try tweaking the test to only run on ARM64, or on x64 if Avx512.IsSupported is true.

@amanasifkhalid amanasifkhalid changed the title JIT: Cast UInt64 to Single directly on ARM64 during const folding JIT: Cast UInt64 to Single directly during const folding Aug 15, 2024
Comment on lines 20 to 27
bool runTest = (RuntimeInformation.ProcessArchitecture == Architecture.Arm64) || Avx512F.IsSupported;

if (runTest)
{
ulong vr10 = 16105307123914158031UL;
float vr11 = 4294967295U | vr10;
Assert.Equal(1600094603U, BitConverter.SingleToUInt32Bits(vr11));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the test rather assert that its producing 1600094604u if (RuntimeInformation.ProcessArchitecture == Architecture.Arm64) || Avx512F.IsSupported is false?

That way we can detect any changes for other platforms or scenarios?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea; I'll update it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just remembered that we don't build this test if the target arch isn't arm64/x64 (or if the runtime isn't CoreCLR). @tannergooding are you ok with only testing those platforms for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that’s fine, but ideally we’d ensure this runs everywhere long term

Comment on lines 6 to 7
<CLRTestTargetUnsupported Condition="'$(TargetArchitecture)' != 'arm64' AND '$(TargetArchitecture)' != 'x64'">true</CLRTestTargetUnsupported>
<CLRTestTargetUnsupported Condition="'$(RuntimeFlavor)' != 'coreclr'">true</CLRTestTargetUnsupported>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only need is we skip this on Mono today, right?
Is that because they're always doing the ulong -> double -> float 2-step behavior?

Can we not instead use [SkipOnMono("https://github.com/dotnet/runtime/issues/#######", TestPlatforms.Any)] and ensure that the test is compiled for all platforms?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Mono seems to also do the two-step cast, so the test was failing across all Mono legs.

Can we not instead use [SkipOnMono("https://github.com/dotnet/runtime/issues/#######", TestPlatforms.Any)] and ensure that the test is compiled for all platforms?

Sure, I'll update it.

@amanasifkhalid
Copy link
Member Author

On x86, we're pretty explicit about keeping the cast two steps in morph: We represent the cast with a helper call to convert the ulong to double, followed by a cast to float. The test's condition needs to explicitly check if the target is an x64 machine with AVX-512 so we don't accidentally go down this path on x86.

@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented Aug 20, 2024

Test is now passing on all CoreCLR Pri0 legs.

@amanasifkhalid amanasifkhalid merged commit 7a50b43 into dotnet:main Aug 20, 2024
105 of 114 checks passed
@amanasifkhalid amanasifkhalid deleted the fp-cast-arm64 branch August 20, 2024 18:51
@amanasifkhalid
Copy link
Member Author

/backport to release/9.0

Copy link
Contributor

Started backporting to release/9.0: https://github.com/dotnet/runtime/actions/runs/10477490137

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JIT: Bad codegen with ulong -> float conversion on arm64
3 participants