The onslaught of terms defined during the beginning of the GPU programming journey can get a bit dull.
We will ground ourselves on the main pieces of hardware and how they change during our breakneck history recap of the heart of GPUs, Streaming Multiprocessors (SMs). Much of the traditional machinery of these dieagrams are thrown out. For a more comprehensive understanding, see the references.
Up to 2006, the graphics card design separated tasks into hardware sections representing disjoint parts of the video rendering process. Nvidia engineers would have to guess how much of a GPU die to allocate to each section.
This meant predicting what developers would create with software and in turn the demand placed on each section of the hardware.
This inflexibility alongside the guesswork lead to the overhaul of the design.
Streaming Multiprocessors replaced all these units and with its advent brought:
Contrasting CPUs, GPUs elect to improve computation speed and incur memory overhead, as a tradeoff. In other words, the bottleneck is usually moving information around more than it is working with that information. Cleverly, threads in need of memory make requests in parallel to greatly reduce the overhead, known as "saturating the memory bus".
In 2010, Moore's Law allowed the compression of more cores onto Fermi. The SMs dual warp schedulers, allowing us to request and "saturate" faster.
In 2012, Kepler prioritized energy efficiency. The speed of the core clocks is lowered. Transistors became smaller once more, helping compensate. 4 warp schedulers keep our threads even busier.
In 2014, Maxwell reduced the number of cores and simplifying design became the focus primarily for energy savings.
In 2016, Pascal introduced smaller transistors still and Nvidia shrunk down the size of the SM as a whole.
As we've seen (and continue to see), GPUs improved as throughput speed alongside core density increased. Intuitively, the faster you can cycle/iterate (clock frequency) and the more things are cycling (cores) the more computations you can perform.
Gflops (Giga floating-point operations per second) encapsulates this idea well.
Specialization emerged once more in the form of tensor cores, motivated by the increasing demands from ml workloads.
Ampere introduced third generation tensor cores with fine-grained structured sparsity support. An emphasis was placed on the improved performance of sparse matrices and offers a nice bridge between deep learning and gpu programming concepts.
Loosely, reductions in transistor size caused more cores per SM while bouncing between optimizing for compute and energy usage. As this hit a stagnation point and the computation driven by ml/deep learning workloads continued to grow, specialization emerged once more. Recent advents continue to push the frontier on improving compute tuned to particular uses, such as the transformer architecture.
We'll now begin bridging ourselves to software through the CUDA programming model2.
CUDA is compiled to PTX and eventually the hardware assembly language, SASS.
Blocks are specified in such a way that communication is limited, allowing them to be hardware agnostic and fit differently depending on the amount of SMs at our disposal.
If you were to burn any image into your head from this writing it'd be the following:
In hardware terms, we iteratively send our information from CPU to GPU to be handled in parallel by our cores. In software terms, we allocate memory on our host and device, launch kernel instances to be run in parallel on our threads.
Each SM computes one thread block and if too many instances are launched information is placed in a queue.
kernel<<<gridDim, blockDim>>>(args)
represents the simplest syntax for a kernel launch where gridDim
represents the number of blocks referenced earlier and blockDim
represents the number of threads per block3.
Kernel functions are written from the point of view of a thread as this is the action that will be handled by every thread. Within our kernel function we get access to a number of terms encapsulating where we are within our kernel grid.
int main(int argc, char **argv) { float *h_x, *d_x; int nblocks=2, nthreads=8, nsize=2*8; // Allocating memory h_x = (float *)malloc(nsize*sizeof(float)); cudaMalloc((void **)&d_x, nsize*sizeof(float)); // Kernel Launch kernel<<<nblocks, nthreads>>>(d_x); // Copy results back from device to host cudaMemcpy(h_x, d_x, nsize*sizeof(float), cudaMemcpyDeviceToHost); // Clean up cudaFree(d_x); free(h_x); }
#include <helper_cuda.h> __global__ void kernel(float *x) { // Find "where we are" int tid = threadIdx.x + blockDim.x*blockIdx.x; // Perform thread's operation x[tid] = (float) threadIdx.x; }