Contextualizing and Concreting

Introduction

The onslaught of terms defined during the beginning of the GPU programming journey can get a bit dull.

We will ground ourselves on the main pieces of hardware and how they change during our breakneck history recap of the heart of GPUs, Streaming Multiprocessors (SMs). Much of the traditional machinery of these dieagrams are thrown out. For a more comprehensive understanding, see the references.

Context

Up to 2006, the graphics card design separated tasks into hardware sections representing disjoint parts of the video rendering process. Nvidia engineers would have to guess how much of a GPU die to allocate to each section.

This meant predicting what developers would create with software and in turn the demand placed on each section of the hardware.

This inflexibility alongside the guesswork lead to the overhaul of the design.

Tesla - 2006

Warp Scheduler
Register File
Core x8
SFU x2
Shared Memory

Streaming Multiprocessors replaced all these units and with its advent brought:

Contrasting CPUs, GPUs elect to improve computation speed and incur memory overhead, as a tradeoff. In other words, the bottleneck is usually moving information around more than it is working with that information. Cleverly, threads in need of memory make requests in parallel to greatly reduce the overhead, known as "saturating the memory bus".

Fermi, Kepler, Maxwell, Pascal - 2010 - 2016

Fermi
2010
2 Warp Schedulers
Core x32
Kepler
2012
4 Warp Schedulers
Core x192
Maxwell
2014
4 Warp Schedulers
Core x128
Pascal
2016
4 Warp Schedulers
Core x128

In 2010, Moore's Law allowed the compression of more cores onto Fermi. The SMs dual warp schedulers, allowing us to request and "saturate" faster.

In 2012, Kepler prioritized energy efficiency. The speed of the core clocks is lowered. Transistors became smaller once more, helping compensate. 4 warp schedulers keep our threads even busier.

In 2014, Maxwell reduced the number of cores and simplifying design became the focus primarily for energy savings.

In 2016, Pascal introduced smaller transistors still and Nvidia shrunk down the size of the SM as a whole.

As we've seen (and continue to see), GPUs improved as throughput speed alongside core density increased. Intuitively, the faster you can cycle/iterate (clock frequency) and the more things are cycling (cores) the more computations you can perform.

Gflops (Giga floating-point operations per second) encapsulates this idea well.

Volta, Turing - 2017, 2018

Warp Schedulers x4
Register File
CUDA Cores x64
Tensor Cores x8
Shared Memory

Specialization emerged once more in the form of tensor cores, motivated by the increasing demands from ml workloads.

Ampere - 2020

Ampere introduced third generation tensor cores with fine-grained structured sparsity support. An emphasis was placed on the improved performance of sparse matrices and offers a nice bridge between deep learning and gpu programming concepts.

Thoughts

Loosely, reductions in transistor size caused more cores per SM while bouncing between optimizing for compute and energy usage. As this hit a stagnation point and the computation driven by ml/deep learning workloads continued to grow, specialization emerged once more. Recent advents continue to push the frontier on improving compute tuned to particular uses, such as the transformer architecture.

Bringing Things Together

We'll now begin bridging ourselves to software through the CUDA programming model2.

CUDA is compiled to PTX and eventually the hardware assembly language, SASS.

Blocks are specified in such a way that communication is limited, allowing them to be hardware agnostic and fit differently depending on the amount of SMs at our disposal.

Program
0
1
2
3
4
5
6
7
↓ Different Hardwares ↓
GPU with 2 SMs
SM 0
0
1
4
5
SM 1
2
3
6
7
GPU with 4 SMs
SM 0
0
4
SM 1
1
5
SM 2
2
6
SM 3
3
7

If you were to burn any image into your head from this writing it'd be the following:

Software
Thread
Thread Block
Shared Memory
Kernel Grid
B
B
B
B
B
B
Global Memory
Hardware
Core
Streaming Multiprocessor
L1 Cache
GPU
SM
SM
SM
SM
SM
SM
GPU RAM

In hardware terms, we iteratively send our information from CPU to GPU to be handled in parallel by our cores. In software terms, we allocate memory on our host and device, launch kernel instances to be run in parallel on our threads.

Each SM computes one thread block and if too many instances are launched information is placed in a queue.

Concreting Things

kernel<<<gridDim, blockDim>>>(args) represents the simplest syntax for a kernel launch where gridDim represents the number of blocks referenced earlier and blockDim represents the number of threads per block3.

Kernel functions are written from the point of view of a thread as this is the action that will be handled by every thread. Within our kernel function we get access to a number of terms encapsulating where we are within our kernel grid.

Host Code
int main(int argc, char **argv) {
    float *h_x, *d_x;        
    int   nblocks=2, nthreads=8, nsize=2*8;

    // Allocating memory
    h_x = (float *)malloc(nsize*sizeof(float));
    cudaMalloc((void **)&d_x, nsize*sizeof(float));

    // Kernel Launch
    kernel<<<nblocks, nthreads>>>(d_x);

    // Copy results back from device to host
    cudaMemcpy(h_x, d_x, nsize*sizeof(float),
                cudaMemcpyDeviceToHost);
    
    // Clean up
    cudaFree(d_x); free(h_x);
}
Kernel/Device Code
#include <helper_cuda.h>
__global__ void kernel(float *x) {
    // Find "where we are"
    int tid = threadIdx.x + blockDim.x*blockIdx.x;
    
    // Perform thread's operation
    x[tid] = (float) threadIdx.x;
}
Information from 2020 onwards related to deep learning and tensors will be saved for another time

References
[1] - History of SM
[2] - Modal GPU Glossary
[3] - Oxford Lecture Notes 1