cpu-gpu-arch/gpu/AMD_Guide.md at main · AlexHe99/cpu-gpu-arch · GitHub

References

Performance

1.1. Understanding GPU context rolls (2018)
1.2. Optimizing GPU occupancy and resource usage with large thread groups (2017)
1.3. Getting the Most Out of Delta Color Compression (2016)
1.4. Life of triangle (2020), [video], [backup]

GCN

2.1. ADVANCED SHADER PROGRAMMING ON GCN

RDNA

3.1. RDNA Performance guide
3.2. How mesh shaders are implemented in an AMD driver
3.3. Task shader driver implementation on AMD HW
3.4. What is NGG and shader culling on AMD RDNA GPUs?

Notes

Performance

In AMD GPUs, a high number of concurrent wavefronts running on the same Compute Unit (CU) enables the GPU to hide the time spent in accessing global memory, which is higher than the time needed to perform a compute operation, with operations performed by other wavefronts.

GCN

GCN devices have both vector (SIMD) units, which maintain different state for each thread in a wave, and a scalar unit, which contains a single state common to all threads within a wave. For each SIMD wave, there is one additional scalar thread running, with its own SGPR file. The scalar registers contain a single value for the whole wave. Thus, SGPRs have 64x lower on-chip storage cost. [1.2]
Hardware have 7 context rolls for parallel execution of draw commands with different render states. [1.1]

RDNA

This flexibility allows the driver to implement every vertex/geometry processing stage using NGG. Vertex, tess eval and geometry shaders can all be compiled to NGG “primitive shaders”. [3.2]
Task shaders are executed on an async compute queue. Because task shaders are executed on a different HW queue, there is some overhead. [3.3]