GPU optimizations #481
Replies: 9 comments 9 replies
-
Adding here, with Marco in the room: this is all float32, on A6000. |
Beta Was this translation helpful? Give feedback.
-
I see your PR #488 includes 3d. When you get a chance, it would be great to see the 3d version of the above perf comparison (before/after optim). |
Beta Was this translation helpful? Give feedback.
-
In 3D SM becomes faster than GM. According to nopt in 3D the limiting factor is shared memory leading to a 33% max theoretical utilization. However, if I reduce the binsize the theoretical utilization goes up to 50% but the achieved goes from 33% to 23% leading to an overall slowdown. |
Beta Was this translation helpful? Give feedback.
-
I updated horner coeff in cuda, there is not much difference:
|
Beta Was this translation helpful? Give feedback.
-
Testing on the H100: Float 1D:
Float 2D:
Float 3D:
Double 3D (Master failed due to shared memory):
Double 2D:
Double 1D:
EDIT: Updated results to more reliable statistics. |
Beta Was this translation helpful? Give feedback.
-
Hi Diamon, Sorry for the late reply. But this is great performance improvements! Congrats!
Could it be an occupancy issue? e.g., before, we can fit 2 threads blocks per SM and due to increase usage of shared memory size, it reduces to 1 thread block. ps. Nsight compute is usually useful for such analysis. e.g. it reports theoretical / achieved occupancy and also cache hit rates. |
Beta Was this translation helpful? Give feedback.
-
Hi everyone, I am trying to find the best parameters (bin sizes and maxsubprobsizes) for my problem parameters in GH200 (please see my use case of cuFINUFFT here #398 (comment)). Here is how the current timings look like from the perftest in the repo (I am using FINUFFT v2.3.0). Host compiler With experiments I found for my parameters the For 3D type 1 transform with my relevant problem parameters
As suggested by @MelodyShih in the post above here is the Nsight compute profile of the spread operation For type 2
Now I am not an expert in profiling but the roofline analysis shows the type 1 transform (with SM) is using 2% of fp64 peak performance and type 2 transform (with SM) is using 16% of fp64 peak performance. I am wondering if these could be improved without major changes in the algorithm itself. For these runs I used the default bin sizes and maxsubprobsize (which if I am not wrong are set as 16x16x2 and 1024 based on heuristics). Is there a way I can tune these to my problem parameters or I just have to do trial and error only? I would be happy to take any other general suggestions you may have also. If you need any other information please let me know. Thanks in advance. |
Beta Was this translation helpful? Give feedback.
-
On this topic, if we can figure out what |
Beta Was this translation helpful? Give feedback.
-
PS Polanco in the above Julia code appears to be using SM for a subproblem but using output-grid-points-driven threading, to avoid atomics when NU-pt-driven. This requires each thread checking each NU pt, but this may be ok since it's local to the subproblem. Good idea. |
Beta Was this translation helpful? Give feedback.
-
I spent sometime optimizing the cuFINUFFT.
Using integer arithmetic where possible, small amount of cuda intrinsics results in a speedup.
In 1D, I also maximized the use of shared memory.
Summary:
These tables measure the throughput if the spreader (pts/s).
In 2D I also manages to improve performance the same way:
3D improve in a similar way:
However in 2D changing the binsize from 32*32 makes SM slower for tolerance=1e-03 onwards. Not too sure why as smaller padding in relation to the binsize should increase performance. @MelodyShih do you have any suggestions?
My educated guess: since shared memory and L1 cache are a unified memory, using all of the unified memory as shared memory leaves no L1 cache causing the performance degradation. However what the question remains on what exactly is in cache that by being flushed out reduces performance.
Beta Was this translation helpful? Give feedback.
All reactions