GPU optimizations #481

DiamonDinoia · 2024-07-15T20:44:41Z

DiamonDinoia
Jul 15, 2024
Maintainer

I spent sometime optimizing the cuFINUFFT.

Using integer arithmetic where possible, small amount of cuda intrinsics results in a speedup.

In 1D, I also maximized the use of shared memory.

Summary:
These tables measure the throughput if the spreader (pts/s).

tolerance	GM (master)	SM (master)	GM (optimized)	SM (optimized)
1E-1	3.078622e+09	3.251791e+09	3.279329e+09	3.554299e+09
1E-2	2.966847e+09	3.228937e+09	3.148929e+09	3.527661e+09
1E-3	2.662661e+09	3.127420e+09	2.723552e+09	3.498755e+09
1E-4	2.580467e+09	3.084408e+09	2.656259e+09	3.469007e+09
1E-5	2.318632e+09	2.985198e+09	2.549808e+09	3.449180e+09
1E-6	2.225468e+09	2.928683e+09	2.445329e+09	3.443179e+09

In 2D I also manages to improve performance the same way:

tolerance	GM (Master)	SM (Master)	GM (Optimized)	SM (Optimized)
1E-1	2.714377e+09	2.064468e+09	5.426941e+09	2.887188e+09
1E-2	1.548559e+09	1.853426e+09	2.533622e+09	2.180461e+09
1E-3	9.146759e+08	1.407898e+09	1.417198e+09	2.089338e+09
1E-4	6.096412e+08	1.061066e+09	9.112261e+08	1.952580e+09
1E-5	4.302309e+08	8.233611e+08	6.340200e+08	1.799009e+09
1E-6	3.258533e+08	6.556694e+08	4.692537e+08	1.364085e+09

3D improve in a similar way:

tolerance	GM (Master)	SM (Master)	GM (Optimized)	SM (Optimized)
1E-1	1.446115e+09	9.760191e+08	1.448913e+09	1.553791e+09
1E-2	8.032534e+08	4.773564e+08	8.002707e+08	1.310774e+09
1E-3	3.516225e+08	3.051322e+08	3.524755e+08	9.904116e+08
1E-4	1.866547e+08	1.758529e+08	1.864724e+08	4.573746e+08
1E-5	1.009733e+08	1.183505e+08	1.011694e+08	2.760289e+08
1E-6	6.673020e+07	7.220783e+07	6.671416e+07	1.289408e+08

However in 2D changing the binsize from 32*32 makes SM slower for tolerance=1e-03 onwards. Not too sure why as smaller padding in relation to the binsize should increase performance. @MelodyShih do you have any suggestions?

My educated guess: since shared memory and L1 cache are a unified memory, using all of the unified memory as shared memory leaves no L1 cache causing the performance degradation. However what the question remains on what exactly is in cache that by being flushed out reduces performance.

ahbarnett · 2024-07-15T22:07:20Z

ahbarnett
Jul 15, 2024
Maintainer

Adding here, with Marco in the room: this is all float32, on A6000.

0 replies

ahbarnett · 2024-07-17T16:03:00Z

ahbarnett
Jul 17, 2024
Maintainer

I see your PR #488 includes 3d. When you get a chance, it would be great to see the 3d version of the above perf comparison (before/after optim).

0 replies

DiamonDinoia · 2024-07-18T20:24:30Z

DiamonDinoia
Jul 18, 2024
Maintainer Author

In 3D SM becomes faster than GM. According to nopt in 3D the limiting factor is shared memory leading to a 33% max theoretical utilization. However, if I reduce the binsize the theoretical utilization goes up to 50% but the achieved goes from 33% to 23% leading to an overall slowdown.

2 replies

ahbarnett Jul 21, 2024
Maintainer

Great - thanks for these details, and testing the varying binsize.

Your optim for 3D tol=1e-3 is very nice, giving nearly 3x speedup, to 1e9 pts/sec, excellent.
Do you know if this is all from using int arithmetic and cuda intrinsics? (Ie, what is the explanation).
Best, Alex

DiamonDinoia Jul 22, 2024
Maintainer Author

Mostly using int <= int32 where possible as they are 64x faster than double. Intrinsics do not have a huge impact in 2/3D but they make a measurable difference in 1D

DiamonDinoia · 2024-07-23T19:17:51Z

DiamonDinoia
Jul 23, 2024
Maintainer Author

I updated horner coeff in cuda, there is not much difference:

tolerance	GM (gencoeff)	SM (gencoeff)	GM (optimized)	SM (optimized)
1E-1	3.280381e+09	3.553347e+09	3.279809e+09	3.555144e+09
1E-2	3.149931e+09	3.529718e+09	3.148736e+09	3.523932e+09
1E-3	2.729857e+09	3.498628e+09	2.721141e+09	3.496001e+09
1E-4	2.660577e+09	3.473154e+09	2.655798e+09	3.476445e+09
1E-5	2.552879e+09	3.451543e+09	2.547522e+09	3.454756e+09
1E-6	2.446233e+09	3.442241e+09	2.442578e+09	3.447524e+09

1 reply

ahbarnett Jul 25, 2024
Maintainer

The degrees didn't change for upsampfac=2.0 (except w=2 which went from nc=5 to 4), which is all GPU uses right now, hence no change here.

DiamonDinoia · 2024-07-25T15:00:35Z

DiamonDinoia
Jul 25, 2024
Maintainer Author

Testing on the H100:

Float 1D:

Tolerance	Optimized GM	Optimized SM	Master GM	Master SM
1E-01	8.367198e+09	8.729107e+09	7.969459e+09	8.560319e+09
1E-02	8.080905e+09	8.662549e+09	7.684589e+09	8.572333e+09
1E-03	6.841884e+09	8.596521e+09	6.309393e+09	8.381137e+09
1E-04	6.671069e+09	8.540193e+09	6.045703e+09	8.419547e+09
1E-05	6.352612e+09	8.503712e+09	4.740678e+09	8.245758e+09
1E-06	6.016723e+09	8.402647e+09	4.535259e+09	8.262283e+09

Float 2D:

Tolerance	Optimized GM	Optimized SM	Master GM	Master SM
1E-01	9.473638e+09	5.756692e+09	3.074717e+09	5.772301e+09
1E-02	5.351919e+09	5.659279e+09	1.514174e+09	5.677317e+09
1E-03	3.240334e+09	5.579640e+09	8.858733e+08	5.573687e+09
1E-04	2.106447e+09	4.413413e+09	5.814967e+08	4.340503e+09
1E-05	1.493391e+09	3.201054e+09	4.082215e+08	3.156124e+09
1E-06	1.099016e+09	2.391639e+09	3.026285e+08	2.361012e+09

Float 3D:

Tolerance	Optimized GM	Optimized SM	Master GM	Master SM
1E-01	3.940038e+09	4.266341e+09	3.229261e+09	4.297240e+09
1E-02	2.410447e+09	3.158854e+09	1.399729e+09	2.904386e+09
1E-03	1.184643e+09	1.540168e+09	5.983336e+08	1.417186e+09
1E-04	6.304829e+08	8.036942e+08	3.154890e+08	7.449651e+08
1E-05	3.706509e+08	4.752499e+08	1.832068e+08	4.446804e+08
1E-06	2.348329e+08	2.941786e+08	1.168698e+08	2.771541e+08

Double 3D (Master failed due to shared memory):

Tolerance	Optimized GM	Optimized SM
1E-01	3.185137e+09	4.141374e+09
1E-02	6.256507e+08	1.132122e+09
1E-03	6.262882e+08	1.132112e+09
1E-04	1.856500e+08	2.901856e+08
1E-05	1.171510e+08	1.370620e+08

Double 2D:

Tolerance	Optimized GM	Optimized SM	Master GM	Master SM
1E-01	5.832072e+09	5.541651e+09	3.753014e+09	5.541967e+09
1E-02	1.790028e+09	4.142644e+09	1.067548e+09	4.117480e+09
1E-03	1.790186e+09	4.142613e+09	1.067536e+09	4.117882e+09
1E-04	8.343184e+08	1.999323e+09	4.861557e+08	1.988101e+09
1E-05	6.196753e+08	1.489163e+09	3.575366e+08	1.479819e+09
1E-06	4.777273e+08	1.150625e+09	2.734755e+08	1.144612e+09
1E-07	4.777688e+08	1.150554e+09	2.734612e+08	1.144811e+09
1E-08	3.012965e+08	7.432617e+08	1.737256e+08	7.393171e+08
1E-09	2.499572e+08	6.165126e+08	1.435394e+08	6.130051e+08
1E-10	2.499444e+08	6.165793e+08	1.435522e+08	6.130184e+08
1E-11	1.828722e+08	4.402013e+08	1.023069e+08	4.382232e+08
1E-12	1.580001e+08	3.804053e+08	8.813359e+07	3.784636e+08
1E-13	1.394493e+08	3.319713e+08	7.693090e+07	3.302459e+08
1E-14	1.229881e+08	2.920699e+08	6.748016e+07	2.906824e+08
1E-15	1.229708e+08	2.920879e+08	6.748785e+07	2.906648e+08

Double 1D:

Tolerance	Optimized GM	Optimized SM	Master GM	Master SM
1E-01	8.758479e+09	8.773794e+09	8.523726e+09	8.790435e+09
1E-02	7.480783e+09	8.497898e+09	6.826578e+09	8.303358e+09
1E-03	7.481968e+09	8.501999e+09	6.827438e+09	8.303901e+09
1E-04	6.360090e+09	8.024475e+09	5.446697e+09	7.914173e+09
1E-05	5.771179e+09	7.893853e+09	4.830891e+09	7.681791e+09
1E-06	5.062978e+09	7.682165e+09	4.287413e+09	7.447216e+09
1E-07	5.068616e+09	7.685585e+09	4.280535e+09	7.449047e+09
1E-08	4.138406e+09	7.280626e+09	3.481220e+09	7.084229e+09
1E-09	3.760137e+09	7.124358e+09	3.169093e+09	6.607590e+09
1E-10	3.759206e+09	7.138667e+09	3.171467e+09	6.607766e+09
1E-11	3.140307e+09	6.838748e+09	2.696061e+09	5.634570e+09
1E-12	2.920735e+09	6.758109e+09	2.505499e+09	5.251762e+09
1E-13	2.726653e+09	6.684021e+09	2.339224e+09	4.913359e+09
1E-14	2.520601e+09	6.567646e+09	2.183162e+09	4.603375e+09
1E-15	2.518337e+09	6.565642e+09	2.180834e+09	4.603168e+09

EDIT: Updated results to more reliable statistics.

1 reply

lu1and10 Jul 25, 2024
Maintainer

It's good that GM get much faster on H100, SM may need extra effort to optimize on H100 as it's fast in 64 bits ops.

Still it's very good that SM is much faster on the GPUs which have 64 bits ops are 64x slower than 32 bits ops, not everyone has A/H 100.

MelodyShih · 2024-08-04T19:10:01Z

MelodyShih
Aug 4, 2024

Hi Diamon, Sorry for the late reply. But this is great performance improvements! Congrats!

However in 2D changing the binsize from 32*32 makes SM slower for tolerance=1e-03 onwards. Not too sure why as smaller padding in relation to the binsize should increase performance.

Could it be an occupancy issue? e.g., before, we can fit 2 threads blocks per SM and due to increase usage of shared memory size, it reduces to 1 thread block.

ps. Nsight compute is usually useful for such analysis. e.g. it reports theoretical / achieved occupancy and also cache hit rates.

0 replies

srikrrish · 2024-10-22T12:47:11Z

srikrrish
Oct 22, 2024

Hi everyone,

I am trying to find the best parameters (bin sizes and maxsubprobsizes) for my problem parameters in GH200 (please see my use case of cuFINUFFT here #398 (comment)). Here is how the current timings look like from the perftest in the repo (I am using FINUFFT v2.3.0). Host compiler GCC/12.3.0, and CUDA cuda_12.2.r12.2

With experiments I found for my parameters the gpu_method 2 (SM) gives the least time and hence I report the findings only for it.

For 3D type 1 transform with my relevant problem parameters

# prec = d
# type = 1
# n_runs = 50
# N1 = 64
# N2 = 64
# N3 = 64
# M = 26214400
# ntransf = 1
# method = 2
# kerevalmethod = 1
# sort = 1
# tol = 1e-07

event	count	tot(ms)	mean(ms)	std(ms)	nupts/s	ns/nupt
host_to_device	1	2.732992	2.732992	0.000000	0.0	0.0
makeplan	1	7.220224	7.220224	0.000000	0.0	0.0
setpts	50	90.183197	1.803664	0.088721	1.4534e+10	0.068804
execute	50	8813.465820	176.269318	0.176275	1.48718e+08	6.724141
device_to_host	1	0.044992	0.044992	0.000000	0.0	0.0
amortized	1	8913.795898	8913.795898	0.000000	1.47044e+08	6.800687

As suggested by @MelodyShih in the post above here is the Nsight compute profile of the spread operation

For type 2

# prec = d
# type = 2
# n_runs = 50
# N1 = 64
# N2 = 64
# N3 = 64
# M = 26214400
# ntransf = 1
# method = 2
# kerevalmethod = 1
# sort = 1
# tol = 1e-07

event	count	tot(ms)	mean(ms)	std(ms)	nupts/s	ns/nupt
host_to_device	1	1.657600	1.657600	0.000000	0.0	0.0
makeplan	1	144.689789	144.689789	0.000000	0.0	0.0
setpts	50	89.779762	1.795595	0.093348	1.45993e+10	0.068497
execute	50	1126.046143	22.520924	0.247284	1.164e+09	0.859105
device_to_host	1	2.492544	2.492544	0.000000	0.0	0.0
amortized	1	1364.815674	1364.815674	0.000000	9.60364e+08	1.041272

Now I am not an expert in profiling but the roofline analysis shows the type 1 transform (with SM) is using 2% of fp64 peak performance and type 2 transform (with SM) is using 16% of fp64 peak performance. I am wondering if these could be improved without major changes in the algorithm itself. For these runs I used the default bin sizes and maxsubprobsize (which if I am not wrong are set as 16x16x2 and 1024 based on heuristics). Is there a way I can tune these to my problem parameters or I just have to do trial and error only? I would be happy to take any other general suggestions you may have also. If you need any other information please let me know. Thanks in advance.

5 replies

DiamonDinoia Oct 24, 2024
Maintainer Author

It seems consistent with what I saw when profiling. In brief no, I did not find a way to optimize this further without change to the algorithm. You can try playing with the binsize, gpu_maxsubprobsize, gpu_maxbatchsize to see what happens. In theory bigger binsize reduce padding but they also reduce parallelism as kernels require more shared memory.

The value I left as a default was the one that performed best. I also encourage to use single precision where possible as it is twice as fast. For eps >= 1e-6 I recommend using single. Even if you have to cast from double back and forth.

srikrrish Oct 24, 2024

Thanks for your reply. I managed to get slightly better results by using 8x8x2 in double precision but need to test more to see if it holds up for other distributions, parameters etc. I agree with the single precision that I also noticed 2X speedup. I will try to use it whenever possible.

DiamonDinoia Oct 24, 2024
Maintainer Author

When compiling from source you can also try disabling alloca by editing the macro. It gives a speedup in certain cases

srikrrish Oct 24, 2024

Thanks. Is it here https://github.com/flatironinstitute/finufft/blob/v2.3.0/include/cufinufft/utils.h#L96 that I need to put 0 for disabling? I also see that this portion is a bit changed in the current master.

DiamonDinoia Oct 24, 2024
Maintainer Author

I usually do #undef ALLOCA_SUPPORTED after that block or delete that block entirely. If alloca is not defined it is disabled

ahbarnett · 2024-12-09T21:23:05Z

ahbarnett
Dec 9, 2024
Maintainer

On this topic, if we can figure out what
https://github.com/jipolanco/NonuniformFFTs.jl/blob/6d435c729ce9e1da8be6cf4bb4a0776a429ff2cf/src/spreading/gpu.jl#L224
is doing in Julia using KernelAbstractions.jl
we have a chance to speed up by a factor of 5 or so for type 1 (at least if we believe these recent tol=1e-6 3d1 timings).
A good challenge for us, and a testament to the power of Julia for GPU programming...

0 replies

ahbarnett · 2024-12-09T21:29:44Z

ahbarnett
Dec 9, 2024
Maintainer

PS Polanco in the above Julia code appears to be using SM for a subproblem but using output-grid-points-driven threading, to avoid atomics when NU-pt-driven. This requires each thread checking each NU pt, but this may be ok since it's local to the subproblem. Good idea.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU optimizations #481

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

GPU optimizations #481

DiamonDinoia Jul 15, 2024 Maintainer

Replies: 9 comments · 9 replies

ahbarnett Jul 15, 2024 Maintainer

ahbarnett Jul 17, 2024 Maintainer

DiamonDinoia Jul 18, 2024 Maintainer Author

ahbarnett Jul 21, 2024 Maintainer

DiamonDinoia Jul 22, 2024 Maintainer Author

DiamonDinoia Jul 23, 2024 Maintainer Author

ahbarnett Jul 25, 2024 Maintainer

DiamonDinoia Jul 25, 2024 Maintainer Author

lu1and10 Jul 25, 2024 Maintainer

MelodyShih Aug 4, 2024

srikrrish Oct 22, 2024

DiamonDinoia Oct 24, 2024 Maintainer Author

srikrrish Oct 24, 2024

DiamonDinoia Oct 24, 2024 Maintainer Author

srikrrish Oct 24, 2024

DiamonDinoia Oct 24, 2024 Maintainer Author

ahbarnett Dec 9, 2024 Maintainer

ahbarnett Dec 9, 2024 Maintainer

DiamonDinoia
Jul 15, 2024
Maintainer

Replies: 9 comments 9 replies

ahbarnett
Jul 15, 2024
Maintainer

ahbarnett
Jul 17, 2024
Maintainer

DiamonDinoia
Jul 18, 2024
Maintainer Author

ahbarnett Jul 21, 2024
Maintainer

DiamonDinoia Jul 22, 2024
Maintainer Author

DiamonDinoia
Jul 23, 2024
Maintainer Author

ahbarnett Jul 25, 2024
Maintainer

DiamonDinoia
Jul 25, 2024
Maintainer Author

lu1and10 Jul 25, 2024
Maintainer

MelodyShih
Aug 4, 2024

srikrrish
Oct 22, 2024

DiamonDinoia Oct 24, 2024
Maintainer Author

DiamonDinoia Oct 24, 2024
Maintainer Author

DiamonDinoia Oct 24, 2024
Maintainer Author

ahbarnett
Dec 9, 2024
Maintainer

ahbarnett
Dec 9, 2024
Maintainer