My CUDA Programming learning journey.
The basic flow in CUDA programming is,
- Initialize data from CPU.
- Transfer data from CPU to GPU.
- Lauch necessary kernel executions on data.
- Tranfer data from GPU to CPU.
- Release memory from CPU and GPU.
Term | Meaning |
---|---|
SISD | Single Instruction Single Data |
SIMD | Single Instruction Multiple Data |
MISD | Multiple Instruction Single Data |
MIMD | Multiple Instruction Multiple Data |
SIMT | Single Instruction Multiple Threads |
Term | Meaning |
---|---|
Host | CPU |
Device | GPU |
SM | Streaming Multiprocessor |
- Essentially, Host runs sequential operations and Device runs parallel operations.
- The CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors. Refer thebeardsage/cuda-streaming-multiprocessors for detailed explanation.
- Multiple thread blocks can execute on same single Streaming Multiprocessor, but one thread block cannot execute on multiple Streaming Multiprocessors.
- The maximum x, y and z dimensions of a block are 1024, 1024 and 64, and it should be allocated such that x × y × z ≤ 1024, which is the maximum number of threads per block. Blocks can be organized into one, two or three-dimensional grids of up to 2^31-1, 65,535 and 65,535 blocks in the x, y and z dimensions respectively. Unlike the maximum threads per block, there is not a blocks per grid limit distinct from the maximum grid dimensions.
- A warp is the basic unit of execution in a cuda program. A warp is a set of 32 threads within a thread block such that all the threads in a warp execute the same instruction. These threads are selected serially by the Streaming Multiprocessor.
- If a set of threads execute different instruction compared to other threads of a warp, then warp divergence occurs. This can reduce performance of the cuda program.
- A lane is just a thread in a block. Each lane in a thread block is indexed by a number in range 0 to 31. Each lane in a warp is unique, but multiple threads in a thread block can have same lane index. (in a 1D block, laneid = threadidx.x % 32)
- Early CUDA programs have been designed in a way that GPU workload was completely in control of Host thread. Programs had to perform a sequence of kernel launches, and for best performance each kernel had to expose enough parallelism to efficiently use the GPU.
- CUDA 5.0 introduced Dynamic Parallelism, which makes it possible to launch kernels from threads running on the device; threads can launch more threads. An application can launch a coarse-grained kernel which in turn launches finer-grained kernels to do work where needed. This avoids unwanted computations while capturing all interesting details.
- This reduces the need to transfer control and data between host and GPU device.
- Kernel executions are classified into Parent and Child grids. Parent grid start execution and dispatches some workload to child grid. Parent grid end the execution when kernel execution is complete. A child grid inherits from the parent grid certain attributes and limits, such as the L1 cache / shared memory configuration and stack size.
- Grid launches in a device thread is visible across all threads in the thread block. Execution of a thread block is not complete untill all child threads created in the block are complete.
- Grids launched with dynamic parallelism are fully nested. This means that child grids always complete before the parent grids that launch them, even if there is no explicit synchronization
Refer CUDA Runtime API documentation for details.
- Registers are fast on-chip memories that are used to store operands for the operations executed by the computing cores.
- In general all scalar variables defined in CUDA code are stored in registers.
- Registers are local to a thread, and each thread has exclusive access to its own registers. Values in registers cannot be accessed by other threads, even from the same block, and are not available for the host. Registers are also not permanent, therefore data stored in registers is only available during the execution of a thread.
- Register Spills: If a kernel uses more registers than the hardware limit, the excess registers will spill over to local memory causing performance deterioration.
- On-chip memory shared/partitioned among thread blocks. Lifetime is lifetime of execution of the thread block.
- Shared memory is allocated per thread block, so all threads in the block have access to the same shared memory. Threads can access data in shared memory loaded from global memory by other threads within the same thread block.
- It's only useful when data needs to be accessed more than once, either within the same thread or from different threads within the same block.
- Shared memory requests are issued per warp. Shared memory is divided into 32 equally sized memory banks, because size of a warp is 32 threads.
- Shared memory access modes: 32bit, 64bit.
- Issue: Shared Memory bank conflict can occur if not properly managed. That is, if consecutive multiple memory access hit same memory bank.
- Variables that cannot be stored in register space are stored in local memory. Memory that cannot be decided at compile time are stored in local memory.
- Memory can also be statically allocated from within a kernel, and according to the CUDA programming model such memory will not be global but local memory.
- Local memory is only visible, and therefore accessible, by the thread allocating it. So all threads executing a kernel will have their own privately allocated local memory.
- Constant memory is used for storing data that will not change over the course of kernel execution. It supports short-latency, high-bandwidth, read-only access by the device when all threads simultaneously access the same location.
- It is better used when all threads in a warp access the same memory location. Lifetime is lifetime of the program.
- Access latency to constant memory is considerably faster than global memory because constant memory is cached but unlike global memory, constant memory cannot be written to from within the kernel.
- As device can only read from constant memory, the data must be initialized from the host i.e. as global variable.
There are other types of memory: Global, Constant, Texture. Refer CUDA Memory Model for details.
- CUDA uses DMA to transfer pinned memory to GPU device. Pageable memory cannot be directly transfered to device. So, first it's copied to pinned (page-locked) memory and then copied to GPU device.
- Pinned to Device transfer is faster than Pageable to Device transfer.
- cudaMallocHost and cudaFreeHost functions can be used to allocate pinned memory directly.
- Refer Nvidia blog for details.
- Refer Nvidia blog.
- Refer Medium article.
- First memory acccess is L1 cache access (termed as normal cached memory access). When memory request comes to L1 cache, and L1 cache misses, then the request will be sent to L2 cache. If L2 cache misses, then the request will be sent to DRAM. Memory load that doesn't use L1 cache are referred to as un-cached memory acccess.
- If L1 cache line is used, then the memory is served in 128 bytes segment.
- If L2 cache is only used, then the memory is served in 32 bytes segment.
- In memory write, only L2 cache is used. It is divided into 32 bytes segment.
- Refer Wiki for explanation on AOS and SOA.
- In CUDA programming, SOA is preferred over AOS for global memory efficiency. This because, in SOA, the array is stored in coalesced fashion reducing number of memory transactions.
- Partition camping occurs when global memory accesses are directed through a subset of partitions, causing requests to queue up at some partitions while other partitions go unused.
- Since partition camping concerns how active thread blocks behave, the issue of how thread blocks are scheduled on multiprocessors is important.
- It is a mechanism to read a thread's register by another thread, when both threads are within the same block. This ensures that there is no explicit copy between thread register and global memory/shared memory.
- This method does not consume extra memory to share/exchange data, and it is much faster than using shared memory.
- NVIDIA Nsight Systems
NVIDIA Nsight™ Systems is a system-wide performance analysis tool designed to visualize an application’s algorithms, help you identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of CPUs and GPUs, from large servers to our smallest system on a chip (SoC). - NVIDIA Nsight Compute
NVIDIA® Nsight™ Compute is an interactive kernel profiler for CUDA applications. It provides detailed performance metrics and API debugging via a user interface and command line tool. In addition, its baseline feature allows users to compare results within the tool. Nsight Compute provides a customizable and data-driven user interface and metric collection and can be extended with analysis scripts for post-processing results.
Number | Repository | Description |
---|---|---|
1 | Hello World | Programmer's induction. Hello World from GPU. |
2 | Print ThreadIdx, BlockIdx, GridDim. | |
3 | Addition | Perform addition operation on GPU. |
4 | Add Arrays | Perform addition of three arrays on GPU. |
5 | Global Index | Calculate Global Index for any dimensional grid and any dimensional block. |
6 | Device properties | Print some GPU device properties. |
7a | Reduce Sum with Loop Unroll | Perform reduction sum operation with loop unroll in GPU kernel. |
7b | Reduce Sum with Warp Unroll | Perform reduction sum operation with warp unroll in GPU kernel. Solution for warp divergence. |
7c | Reduce Sum with Complete Unroll | Perform reduction sum operation with completely unrolled loop and using Shared Memory in GPU kernel. |
8 | Coalesced vs. Un-Coalesced memory pattern | TODO |
9 | Matrix Transpose | Perform Matrix transpose in different fashions. 1. Row Major 2. Column Major 3. Unrolled Loop 4. Diagonal Coordinates 5. Shared Memory |
10 | Static and Dynamic Shared Memory | TODO |
11 | Warp Shuffle | Perform Warp Shuffle on 1D data. 1. __shfl_sync 2. __shfl_up_sync 3. __shfl_down_sync 4. __shfl_xor_sync 5. Reduce Sum with Warp Shuffle loop unrolling |
12 | TODO | TODO |
Streaming Multiprocessor | Grid | Thread Block | Thread | Warp | Kernel | _syncthread | Occupancy | Shared memory | Registers |
Dynamic parallelism | Parallel reduction | Parent | Child | Temporal locality | Spatial locality | Coalesced memory pattern | Un-Coalesced memory pattern | L1 Cache | L2 Cache |
Constant Memory | Warp Shuffle |
Happy Learning! 😄