Getting peak Binary32 flop/s on CDNA2 using float2 #3447

etiennemlb · 2024-04-12T18:21:02Z

The CDNA2 whitepaper mentions using packed float to fill a whole "lane" instead of wasting half to compute capability. In fact, an MI250X is capable of 23.9 TFlop/s of double and simple precision flop/s per GCD and 47 simple precision Tflop/s per GCD when using packed float2.

Using OpenCL and the -cl-mad-enable flag, one can indeed reach the +40 simple precision TFlop/s per GCD. I just can't begin to get this kind of performance out of HIP. In fact, when I use amdclang++ and float2, I end up with a bunch of v_pk_add and v_pk_mul instruction and not something like v_pk_fma which drops the Flop/s to 20 TFlop/s (indeed we do x2 more instruction and x0.5 more the OP/instruction and x2 Flop/OP, so 2 times less than the 40 TFlop/s I should get). I didnt dump the OpenCL generated machine code.

Any idea on how to get packed fma instructions to be generated ?

The text was updated successfully, but these errors were encountered:

b-sumner · 2024-04-12T19:19:15Z

The compiler will try to form packed operations from arbitrary code and will attempt to form fma when contractions are enabled. But ou can raise the likelihood of packed fma by directly calling fma(float2, float2, float2) and pf other supported packed operations by using float2 type variables.

etiennemlb · 2024-04-12T19:27:01Z

Thanks, Ill try that fma "intrinstic". Note that if I change my kernel to use float instead of float2 (literally a one liner typedef), then the fma do get generated. I hope the fma() func produces them, but the code resembling: { x = y * x + y; y = x * y + x; x = y * x + y; y = x * y + x; } I dont get while clang would not generate FMAs for float2. Cheers

…

On 4/12/24 21:19, b-sumner wrote: The compiler will try to form packed operations from arbitrary code and will attempt to form fma when contractions are enabled. But ou can raise the likelihood of packed fma by directly calling fma(float2, float2, float2) and pf other supported packed operations by using float2 type variables. — Reply to this email directly, view it on GitHub <#3447 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARHSPD2PYUYPFMONHDK6YM3Y5AXURAVCNFSM6AAAAABGES2QV2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJSGM4TINZUGU>. You are receiving this because you authored the thread.Message ID: ***@***.***>

etiennemlb · 2024-04-12T21:34:46Z

I couldnt use fma() on float2. Is there some specific header or flag ?

b-sumner · 2024-04-12T21:37:58Z

This is OpenCL, correct? fma(float2, float2, float2) is a standard OpenCL builtin.

etiennemlb · 2024-04-12T21:40:57Z

I get the correct performance using OpenCL (aka +40 fp32 TFlop/s per GCD). The issue is that I dont get that kind of performance for my HIP code. The HIP ends up as a pile of v_pk_mul/add not, FMAs. Cheers

b-sumner · 2024-04-12T22:26:31Z

You could try -ffp-contract=fast. But unfortunately float2 means something in HIP and Cuda other that what it means in OpenCL. So using scalars may be the best approach.

etiennemlb · 2024-04-13T09:04:53Z

AFAIK, -ffp-contract=fast-honor-pragma is the default for HIP. I tried any way without success.

Its a shame, I'm completely limited by the 1 instruction per clock I can push into the pipeline... Not even memory or compute, just front end stuff.

b-sumner · 2024-04-22T14:10:09Z

@etiennemlb would it be possible for you to provide a minimal HIP application that demonstrates the issue?

etiennemlb · 2024-04-23T08:44:21Z

I'll attach a reproducer for multiple cases which are of interest.
The assembly produced will also be given (generated by hipcc --offload-arch=gfx90a --save-temps -xhip -c fma.hip && cat fma-hip-amdgcn-amd-amdhsa-gfx90a.s | c++filt > gfx90a.s)?.
I used ROCm 5.7.1 and ROCm 6 and the ASM is same.

fma.zip

b-sumner · 2024-04-23T15:12:42Z

Thank you. We now have an internal ticket open for this.

kjayapra-amd assigned b-sumner Apr 22, 2024

ppanchad-amd added the Under Investigation label Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting peak Binary32 flop/s on CDNA2 using float2 #3447

Getting peak Binary32 flop/s on CDNA2 using float2 #3447

etiennemlb commented Apr 12, 2024 •

edited

Loading

b-sumner commented Apr 12, 2024

etiennemlb commented Apr 12, 2024 via email

etiennemlb commented Apr 12, 2024

b-sumner commented Apr 12, 2024

etiennemlb commented Apr 12, 2024 via email

b-sumner commented Apr 12, 2024

etiennemlb commented Apr 13, 2024

b-sumner commented Apr 22, 2024

etiennemlb commented Apr 23, 2024

b-sumner commented Apr 23, 2024

Getting peak Binary32 flop/s on CDNA2 using float2 #3447

Getting peak Binary32 flop/s on CDNA2 using float2 #3447

Comments

etiennemlb commented Apr 12, 2024 • edited Loading

b-sumner commented Apr 12, 2024

etiennemlb commented Apr 12, 2024 via email

etiennemlb commented Apr 12, 2024

b-sumner commented Apr 12, 2024

etiennemlb commented Apr 12, 2024 via email

b-sumner commented Apr 12, 2024

etiennemlb commented Apr 13, 2024

b-sumner commented Apr 22, 2024

etiennemlb commented Apr 23, 2024

b-sumner commented Apr 23, 2024

etiennemlb commented Apr 12, 2024 •

edited

Loading