Contextualizing and Concreting GPU Programming

The onslaught of terms defined during the beginning of the GPU programming journey can get a bit dull.

We will ground ourselves on the main pieces of hardware and how they change during our breakneck history recap of the heart of GPUs, Streaming Multiprocessors (SMs). Much of the traditional machinery of these dieagrams are thrown out. For a more comprehensive understanding, see the references.

Context

Up to 2006, the graphics card design separated tasks into hardware sections representing disjoint parts of the video rendering process. Nvidia engineers would have to guess how much of a GPU die to allocate to each section.

This meant predicting what developers would create with software and in turn the demand placed on each section of the hardware.

Tesla - 2006

Contrasting CPUs, GPUs elect to improve computation speed and incur memory overhead, as a tradeoff. In other words, the bottleneck is usually moving information around more than it is working with that information. Cleverly, threads in need of memory make requests in parallel to greatly reduce the overhead, known as "saturating the memory bus".

Fermi, Kepler, Maxwell, Pascal - 2010 - 2016

Fermi
2010

2 Warp Schedulers

Core x32

Kepler
2012

4 Warp Schedulers

Core x192

Maxwell
2014

4 Warp Schedulers

Core x128

Pascal
2016

4 Warp Schedulers

Core x128

In 2010, Moore's Law allowed the compression of more cores onto Fermi. The SMs dual warp schedulers, allowing us to request and "saturate" faster.

In 2012, Kepler prioritized energy efficiency. The speed of the core clocks is lowered. Transistors became smaller once more, helping compensate. 4 warp schedulers keep our threads even busier.

In 2014, Maxwell reduced the number of cores and simplifying design became the focus primarily for energy savings.

In 2016, Pascal introduced smaller transistors still and Nvidia shrunk down the size of the SM as a whole.

As we've seen (and continue to see), GPUs improved as throughput speed alongside core density increased. Intuitively, the faster you can cycle/iterate (clock frequency) and the more things are cycling (cores) the more computations you can perform.

Volta, Turing - 2017, 2018

Specialization emerged once more in the form of tensor cores, motivated by the increasing demands from ml workloads.

Ampere - 2020

Ampere introduced third generation tensor cores with fine-grained structured sparsity support. An emphasis was placed on the improved performance of sparse matrices and offers a nice bridge between deep learning and gpu programming concepts.

Thoughts

Loosely, reductions in transistor size caused more cores per SM while bouncing between optimizing for compute and energy usage. As this hit a stagnation point and the computation driven by ml/deep learning workloads continued to grow, specialization emerged once more. Recent advents continue to push the frontier on improving compute tuned to particular uses, such as the transformer architecture.

Bringing Things Together

We'll now begin bridging ourselves to software through the CUDA programming model².

CUDA is compiled to PTX and eventually the hardware assembly language, SASS.

Blocks are specified in such a way that communication is limited, allowing them to be hardware agnostic and fit differently depending on the amount of SMs at our disposal.

Program

↓ Different Hardwares ↓

GPU with 2 SMs

SM 0

SM 1

GPU with 4 SMs

SM 0

SM 1

SM 2

SM 3

If you were to burn any image into your head from this writing it'd be the following:

Software

Thread

Thread Block

Shared Memory

Kernel Grid

Global Memory

Hardware

Core

Streaming Multiprocessor

L1 Cache

GPU

GPU RAM

In hardware terms, we iteratively send our information from CPU to GPU to be handled in parallel by our cores. In software terms, we allocate memory on our host and device, launch kernel instances to be run in parallel on our threads.

Each SM computes one thread block and if too many instances are launched information is placed in a queue.

Concreting Things

kernel<<<gridDim, blockDim>>>(args) represents the simplest syntax for a kernel launch where gridDim represents the number of blocks referenced earlier and blockDim represents the number of threads per block³.

Kernel functions are written from the point of view of a thread as this is the action that will be handled by every thread. Within our kernel function we get access to a number of terms encapsulating where we are within our kernel grid.

Host Code

int main(int argc, char **argv) {
    float *h_x, *d_x;        
    int   nblocks=2, nthreads=8, nsize=2*8;

    // Allocating memory
    h_x = (float *)malloc(nsize*sizeof(float));
    cudaMalloc((void **)&d_x, nsize*sizeof(float));

    // Kernel Launch
    kernel<<<nblocks, nthreads>>>(d_x);

    // Copy results back from device to host
    cudaMemcpy(h_x, d_x, nsize*sizeof(float),
                cudaMemcpyDeviceToHost);
    
    // Clean up
    cudaFree(d_x); free(h_x);
}

Kernel/Device Code

#include <helper_cuda.h>
__global__ void kernel(float *x) {
    // Find "where we are"
    int tid = threadIdx.x + blockDim.x*blockIdx.x;
    
    // Perform thread's operation
    x[tid] = (float) threadIdx.x;
}

Information from 2020 onwards related to deep learning and tensors will be saved for another time

References
[1] - History of SM
[2] - Modal GPU Glossary
[3] - Oxford Lecture Notes 1

Contextualizing and Concreting

Introduction

Context

Tesla - 2006

Fermi, Kepler, Maxwell, Pascal - 2010 - 2016

Volta, Turing - 2017, 2018

Ampere - 2020

Thoughts

Bringing Things Together

Concreting Things