Optimizing Parallel Reduction

Introduction

This is an adaptation of my notes to "Optimizing Parallel Reduction in CUDA" by Mark Harris (found in references).

With reduction we move outside the realm of pointwise operations. Our final answer depends on how the array values intermingle with one another. We are posed with the questions, how do we allow the thread blocks to interact?

Reduction Tree Structure
Each level does a mini-reduction, locally aggregating information
10
1
8
-4
0
-2
3
5
11
4
-2
8
15
6
21

As we progress, we need fewer operations as we store more information.

Reductions

Below we incrementally optimize a kernel to perform the reduction operation.

Reduction 1 - Interleaved Addressing, Divergent Branching

Step 1: Stride 1
Values
Threads
for (unsigned int s=1; s < blockDim.x; s *= 2) {
    if (tid % (2*s) == 0) {
        sdata[tid] += sdata[tid + s]; // one tree aggregation
    }
    __syncthreads(); // Allow all threads to write before continuing
}

In the first iteration, the even threads merge with adjacent locations. Allowing 1 thread in every 2 threads to resolve to true then 4 and so forth, building out incrementally as shown. s refers to our stride here.

The above algorithm is inefficient primarily due to warp divergence. Conditional statements can cause threads within the same warp to perform different tasks creating overhead as the operations need to be performed in sequence or are serialized.

What's interesting is each aggregation works as some "mini-reduction" on a subarray. Our strides become larger and larger as we iteratively reduce larger chunks of our array.

Reduction 2 - Interleaved Addressing, Reduced Branching

Step 1: Stride 1
Values
Threads
for (unsigned int s = 1; s < blockDim.x; s *= 2) {
    int index = 2 * s * tid;
    
    if (index + s < blockDim.x) { // check in bounds
        sdata[index] += sdata[index + s]; // thread takes care of diff. index
    }
    __syncthreads();
}

It's seen here thread ids can index into differing memory regions (instead of sdata[tid] = x;).

Consecutive threads now handle the same problem. Warps are grouped in consecutive threads so now they mostly go into the same branch of the if condition and less divergence occurs.

It is now worth talking about memory bank conflicts. Indices in our shared memory are stored in memory banks on our hardware. For 32 banks, indices 0, 32, 64, etc. may all map to the same bank. Therefore, depending on our stride we can end up accessing indices 0 and 32 with separate threads, causing serialization.

Consider what happens with stride 16 assuming 32 banks.

Reduction 3 - Sequential Addressing

Step 1: Stride 8
Values
Threads
for (unsigned int s = blockDim.x/2; s>0; s>>=1) { // right bit shift (divide by 2)
    if (tid < s) {
        sdata[tid] += sdata[tid + s];
    }
    __syncthreads();
}

We reduce the issue of memory bank conflicts as we're not jumping to different places in memory.

Reduction 4 - Global Memory Optimization

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x * (blockDim.x*2) + threadIdx.x; // Note the 2
sdata[tid] = g_idata[i] + g_idata[i + blockDim.x];
__syncthreads();

In our previous reduction, half our threads are idling from the very beginning. Instead, each thread handles 2 blocks of information during its initial reading (causing blockDim.x * 2). Critically, this happens during the loading process from global memory itself.

Reduction 5 - Warp Reduce / Unroll

__device__ void warpReduce(volatile int* sdata, int tid) { // Volatile is important here
    sdata[tid] += sdata[tid + 32];
    sdata[tid] += sdata[tid + 16];
    sdata[tid] += sdata[tid + 8];
    sdata[tid] += sdata[tid + 4];
    sdata[tid] += sdata[tid + 2];
    sdata[tid] += sdata[tid + 1];
}

// Block Manipulation
for (unsigned int s = blockDim.x/2; s > 32; s >>= 1) { // Note the `s > 32`
    if (tid < s)
        sdata[tid] += sdata[tid + s];
    __syncthreads();
}

if (tid < 32) warpReduce(sdata, tid);

When we have less than 32 threads we often deal with a single warp in our hardware. Warps follow a SIMD Structure and automatically perform synchronization. We remove both the if condition and syncing function call and in some sense "hard-code" the rest of these operations and take advantage of the fact that we're now working on a single warp.

Reduction 6 - Templating

template <unsigned int blockSize>
__global__ void reduce6(int *g_idata, int *g_odata) {
    // ... initialization code ...
    if (blockSize >= 512) {
        if (tid < 256) { sdata[tid] += sdata[tid + 256]; }
        __syncthreads();
    }
    if (blockSize >= 256) {
        if (tid < 128) { sdata[tid] += sdata[tid + 128]; }
        __syncthreads();
    }
    if (blockSize >= 128) {
        if (tid < 64) { sdata[tid] += sdata[tid + 64]; }
        __syncthreads();
    }
    if (tid < 32) warpReduce<blockSize>(sdata, tid);
}
switch (threads) {
    case 512:
        reduce6<512><<<dimGrid, dimBlock, smemSize>>>(d_idata, d_odata);
        break;
    case 256:
        reduce6<256><<<dimGrid, dimBlock, smemSize>>>(d_idata, d_odata);
        break;
    case 128:
        reduce6<128><<<dimGrid, dimBlock, smemSize>>>(d_idata, d_odata);
        break;
    // ... additional cases for other block sizes ...
}

The block size is explicitly enumerated allowing us to completely remove the loop and operations that were done related to it, namely loop increment (s > 0) and loop check (s >>= 1).

Templating turns the blockSize variable into a compile time constant. We now know all our statements at compile time and get rid of statements that we do not need (referred to as Dead Code Elimination).

Reduction 7 - Algorithm Cascading

template <unsigned int blockSize>
__global__ void reduce7(int *g_idata, int *g_odata, unsigned int n) {
    extern __shared__ int sdata[];
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x * (blockSize*2) + tid;
    unsigned int gridSize = blockSize * 2 * gridDim.x; // Note the 2 here

    sdata[tid] = 0;

    while (i < n) {
        sdata[tid] += g_idata[i] + g_idata[i + blockSize]; // This block and next
        i += gridSize;  
    }
    __syncthreads();

    // Standard parallel reduction in shared memory
    if (blockSize >= 512) {
        if (tid < 256) { sdata[tid] += sdata[tid + 256]; }
        __syncthreads();
    }
    // ... remaining reduction steps ...

    if (tid < 32) warpReduce<blockSize>(sdata, tid);
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

So far we've been looking at kernels with arrays up to two times larger than our shared memory. Before, if we couldn't store an entire array within shared memory, we would have to launch multiple reduction kernels and further reduce the outcome of these kernels. With this reduction, on a single kernel we can now perform a reduction on indefinitely sized arrays.

As shown in reduction 7, we can instead iterate through our global memory and do a sequential additive operation using each thread ID at our disposal then do our traditional reduction operation we have been discussing.

The motivation for doing this is stated within Brent's Theorem. Put simply moving right (sequential steps) then moving down (traditional reduction) is algorithmically favorable.

Memory Coalescing refers to querying for data in our global memory in an efficient way. Data is fetched from global memory in sequential chunks. If we index information sequentially, then we won't need to make multiple reads from our global memory (an expensive operation).

We see the recurring theme of accessing information sequentially.

Analysis

Performance Comparison
Relative performance - 30x speedup
Main Issue Optimization Time (ms)
Warp Divergence Interleaved Addressing 8.054
Memory Bank Conflicts Reduced Branching 3.456
Thread Underutilization Sequential Addressing 1.722
Global Memory Bottleneck Global Loading 0.965
Unnecessary Synchronization Warp-Level Primitives 0.606
Runtime Overhead Compile-Time Constants 0.381
Large Array Handling Sequential + Parallel 0.268

References
https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf