GPU Computing

Notes · 23. 2. 2023 · [edit]

Preface

This website contains my lecture notes from a lecture by Kazem Shekofteh from the academic year 2022/2023 (University of Heidelberg). If you find something incorrect/unclear, or would like to contribute, feel free to submit a pull request (or let me know via email).

Lecture overview

Introduction [slides]
CUDA programming [slides]
Basic architecture [slides]
Matrix multiplication optimizations [slides]
Parallel computing [slides]
Profiling [slides]
Scheduling optimizations [slides]
$n$ -body optimization [slides]
Host-device optimizations [slides]
OpenACC [slides]
Stencil computations [slides]
OpenCL [slides]
Consistency & Coherence [slides]

Bulk-Synchronous Parallel (BSP) model

established in 1990
attempt to describe GPU computing
parallel programs are split into supersteps
1. compute something
2. communicate what you did
3. synchronize with other processors
parallel slackness: number of virtual processors $v$ $v$ , physical processors $p$ $p$
- $v = 1$ : not viable
- $v = p$ : unpromising wrt. optimality
- $v \gg p$ : leverages slack to schedule and pipeline computation – latency tolerance

Scaling rules

Moore’s law

A law about the exponential increase of processing power, has many variants:

1965: number of transistors will double each year,
1975: every two years,
CPU performance will double every 18 months,
memory size four times every three years, etc.

Amdahl’s law

We want to find the maximum possible improvement, when given

$P$ parallel time of the task, $S$ serial time of the task, $P + S = 1$
$N$ parallel execution units $\mathrm{Speedup} = \frac{1}{S + \frac{P}{N}}$
best case: linear (if superlinear, something is wrong)
- usually has diminishing returns

CUDA programming

compute kernel as C program, executed on GPU
explicit data and thread-level parallelism
computing, not graphics processing

A program always consists of two parts:

host = CPU (no/little parallelism)
device = GPU (high parallelism)
- made up of kernels, which are C functions that are executed on the GPU
- are launched asynchronously (wrt host, not wrt each other)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
__global__ void matAdd (float A[N][N], float B[N][N], float C[N][N]) {
	// one thread computes one element
	int i = blockIdx.x * blockDim.x + threadIdx.x;
	int j = blockIdx.y * blockDim.y + threadIdx.y;
	
	// compute only if we're within bounds!
	if (i < N && j < N)
		C[i][j] = A[i][j] + B[i][j];
}

int main() {
	// the thread grid and block structure is 2D, 2D
	// adding more/less changes the structure
	dim3 dimBlock(16, 16);
	dim3 dimGrid((N + dimBlock.x – 1) / dimBlock.x, (N + dimBlock.y – 1) / dimBlock.y);
	matAdd <<<dimGrid, dimBlock>>> (A, B, C);
}

Overview

general kernel syntax is kernel <<<blockCount, blockSize>>> (args)
- blockCount and blockSize are dim3 and can be 1D/2D/3D
each thread has a unique threadIdx.{x,y,z}
each block has blockIdx.{x,y,z} (and blockDim.{x,y,z} for size)
- will be assigned to one streaming multiprocessor
- they are sometimes called Cooperative Thread Arrays (CTAs)
we have to make sure that the GPU has enough SMs and threads/block to start!

Thread and memory diagram for CUDA.

Memory

Global/device memory

accessible from all threads
high latency
lifetime exceeds thread lifetime
can be quite large, depending on the GPU
includes thread’s local memory (is only thread-local)
- has address interleaving: successive addresses are mapped to different memory banks so sequential access by multiple threads is faster (each accesses different bank)
allocation: cudaMalloc(&dmem, size);
deallocation: cudaFree(&dmem);
transfer (blocking): cudaMemcpy(*dest, *src, size, transfer_type);
- the transfer_type is cudaMemcpy<FROM>To<TO> (Host/Device)

Shared memory

only accessible from the thread’s block
access costs is (in the best case) equal to register access
lifetime is same as thread block lifetime
can be around $48 kB$ for a block (with SM having more to accomodate more blocks)
organized into $n$ banks:
- typically 16-32 banks with 4B width
- parallel access if no conflict (otherwise results in serialization)
can be static/dynamic, based on if it’s known at compile time or not
- the access time shouldn’t differ (at least not significantly)

1
2
3
4
5
6
// static
__shared__ int s[64];

// dynamic, size is third parameter of kernel call
// extern refers to the fact that it's declared elsewhere (kernel)
extern __shared__ int s[];

Registers

are at thread level
depend on run-time configuration
max. 255 registers/thread
we can’t really specify what will become a register
register spilling: if source core exceeds the usage of registers, they spill into local memory

Host memory

pinned (i.e. can’t be paged out by the system)
- use cudaMallocHost
- is a scarce resource (locks memory out for other processes)
  - the OS might limit how much of this can be done
- can be faster if we’re doing a lot of computation on it
pageable (unpinned)
- just use malloc

Code examples

Allocating memory:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
float *hMem;
float *dMem;

if (USE_PINNED_MEMORY) {
	// casting to (void**) shouldn't be necessary in newer CUDAs
	// CUDA's malloc doesn't return the pointer, but a status
	// (which in this case is ignored but can be handled)
	cudaMallocHost((void**) &hMem, N * sizeof(float));
} else {
	hMem = (float*) malloc(N * sizeof(float));
}

cudaMalloc((void**)&dMem, N * sizeof(float));

Copying memory:

1
2
3
4
5
// host -> device
cudaMemcpy(dMem, hMem, N * sizeof(float), cudaMemcpyHostToDevice);

// device -> host
cudaMemcpy(hMem, dMem, N * sizeof(float), cudaMemcpyDeviceToHost);

Variable declaration

Prefix	Location	Access from	Lifetime
`__device__`	global memory (device memory)	device/host	program
`__constant__`	constant memory (device memory)	device/host	program
`__shared__`	shared memory	device	thread block

Function declaration

Prefix	Executed on	Callable from
`__device__`	device	device
`__global__`	device	host
`__host__`	host	host

__global__ defines a kernel, return type is void
__host__ and __device__ can be combined
for functions executed on the GPU:
- no recursion
- only static variable declarations
- no variable parameter count

Coalescing

Combining fine-grain access by multiple threads into a single operation.

for global memory, addresses should be multiples of cache line size
if done improperly, results in significant bandwidth decline:

Offset and stride access graph.

for shared memory, concurrent access by different threads is handled by the memory banks
- for our cluster, there were $32$ banks, where successive $4B$ mapped to successive banks (mod $32$ ), which made the stride pattern look like this:
thread scheduling does not result in coalesced access, needs to be handled manually!

Stride access for Shared memory graph.

Thread scheduling

up to 1k threads per block
- one block executes on one SM (up to $8$ )
- no global synchronization! (context switching on GPU is waaaay too expensive)
threads in a block grouped into warps of $32$ (scheduling units of GPU)
- implementation decision, not CUDA
- all threads in a warp execute the same instruction (on their own data and registers)
  - for conditionals, a mask is used (all of them still execute the same instruction but only the result of those with the mask are written to the memory)

1
2
3
4
5
6
7
8
9
__global__ badKernel (...) {
	id = threadIdx.x;

	// not a great idea, use < instead!
	if ( id % 32 == 0 )
		out = complex_function_call();
	else
		out = 0;
}

During execution, the hardware schedules blocks to SMs

happens repeatedly – when some blocks terminate, others will be distributed
the number depends on a block’s resources (for ex. allocated shared memory)

A SM has multiple warp schedulers that can execute multiple warps concurrently:

context switching is fast (as opposed to CPU), since the data stays on-chip
if a warp doesn’t have resources, it is stalled while they are fetched
- this happens really fast because the data stays in registers
there are multiple policies for scheduling warp executions:
- Round Robin – fetched in a round robin manner
- Least Recently Fetched – fetch based on which has not been fetched the longest
- Fair – fetch for the one with the least amount of fetches

Optimizing Matrix Multiplication

Naive (CPU)

nothing too interesting, just three loops
can be further improved by changing the order of addition for better cache hits

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
__host__ __device__ void GetMatrixValue(int row, int col, float* M, int Width) {
	return M[row * Width + col];
}

__host__ __device__ void SetMatrixValue(int row, int col, float* M, int Width, float val) {
	M[row * Width + col] = val;
}

void MM_CPU (float* M, float* N, float* P, int Width) {
	for (int col = 0; col < Width; ++col) {
		for (int row = 0; row < Width; ++row) {
			float Pvalue = 0;
			for (int k = 0; k < Width; ++k) {
				float Melement = GetMatrixValue(row, k, M, Width);
				float Nelement = GetMatrixValue(k, col, N, Width);
				Pvalue += Melement * Nelement;
			}
			SetMatrixValue(row, col, P, Width, Pvalue);
		}
	}
}

Naive (GPU)

looping handled by a single thread block
per loop, we do 2 FLOPS and 4 memory accesses

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
__global__ void MM_NAIVE (float* Md, float* Nd, float* Pd, int Width) {
	float Pvalue = 0;
	float Melement, Nelement;
	int row = threadIdx.y;
	int col = threadIdx.x;

	for (int k = 0; k < Width; ++k) {
		Melement = GetMatrixValue(row, k, Md, Width);
		Nelement = GetMatrixValue(k, col, Nd, Width);
		Pvalue += Melement * Nelement;
	}

	SetMatrixValue(row, col, Pd, Width, Pvalue);
}

Multiple thread blocks (GPU)

we can split the matrix by thread blocks so each thread block does a single index
no longer limits us to arrays of size $\sqrt{1024}$ (max TPB)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
__global__ void MM_MTB (float* Md, float* Nd, float* Pd, int Width) {
	float Pvalue = 0;
	float Melement, Nelement;
	int row = blockIdx.y * blockDim.y + threadIdx.y;
	int col = blockIdx.x * blockDim.x + threadIdx.x;

	for (int k = 0; k < Width; ++k) {
		Melement = GetMatrixValue(row, k, Md, Width);
		Nelement = GetMatrixValue(k, col, Nd, Width);
		Pvalue += Melement * Nelement;
	}
	
	SetMatrixValue(row, col, Pd, Width, Pvalue);
}

Using shared memory (GPU)

we can utilize shared memory to greatly improve the FLOP/global memory access ratio
- then all threads to operation on those
main trick: copy parts of the matrix from global memory to shared memory!
- creates an additional loop: we tile the blocks such that they fit into shared memory

Note: I think that the code in the presentation is wrong – the tiling doesn’t make sense for sizes other than the size of the block (otherwise threads are writing out of bounds). I’ve modified it, hopefully it’s somewhat correct.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// here we assume that
// - blockDim.x == blockDim.y and they divide Width,
// - nThreads * nBlocks = width ** 2

TILEWIDTH = 32  // same as blockDim.x and blockDim.y!


__global__ void MM_SM (float* Md, float* Nd, float* Pd, int Width) {
	// is allocated statically for simpler code
	__shared__ float Mds[TILEWIDTH][TILEWIDTH];
	__shared__ float Nds[TILEWIDTH][TILEWIDTH];
	
	float Pvalue = 0
	int tx = threadIdx.x;
	int ty = threadIdx.y;
	int row = blockIdx.y * TILEWIDTH + ty;
	int col = blockIdx.x * TILEWIDTH + tx;
	
	if (row > Width || col > Width)
		return;

	// loop over tiles
	for (int m = 0; m < Width / TILEWIDTH; ++m) {
		// load the tile of both of the arrays
		Mds[ty][tx] = GetMatrixValue(row, m * TILEWIDTH + tx, Md, Width);
		Nds[ty][tx] = GetMatrixValue(m * TILEWIDTH + ty, col, Nd, Width);

		// RAW dependency:
		// we must wait for all of them to finish!
		__syncthreads();

		// do the actual computation
		for (int k = 0; k < TILEWIDTH; ++k)
			Pvalue += Mds[ty][k] * Nds[k][tx];

		// WAR dependency:
		// again wait or some threads will change Mds/Nds
		__syncthreads ();
	}

	SetMatrixValue(row, col, Pd, Width, Pvalue);
}

the __syncthreads(); synchronizes all threads within a single block (“wait here”)
- behaves as a barrier to make sure all threads are on the same page
- useful particularly when repeatedly writing/reading shared memory
- RAW (true/data dependency): don’t read before you finish writing
- WAR (anti-dependency): don’t write before you finish reading
- WAW (output dependency): if the last write is important, make sure it’s the last

Parallelism

sequential program
- single thread of control
- instructions executed sequentially
concurrent program
- several autonomous sequential threads
- parallel execution possible
- execution determined by implementation
- is not parallelism – we can implement concurrency by interleaving on a single CPU, it just indicates that the threads are independent

Various levels of parallelism:

Instruction Level Parallelism (ILP) – parallelism of one instruction stream
- huge amount of dependencies and branches
- limited parallelism: pipelining, out-of-order execution
Thread Level Parallelism (TLP) – parallelism of multiple independent instruction streams
- less amount of dependencies, no limitations due to branches
- limited by the maximum number of concurrently executable streams
Data Level Parallelism (DLP) – applying one operation on multiple independent elements
- parallelism depends on data structure
- vectorization techniques (single instruction processing multiple values)
Request Level Parallelism (RLP) – datacenter and customers

The lecture goes into theoretical parallel algorithm design.

Synchronization

Definition (synchronization): enforcement of a defined logical order between events. This establishes a defined time-relation between distinct places, thus defining their behavior in time.

SIMD (warps on GPU): one instruction, no synchronization necessary
MIMD: synchronization necessary (shared variables, process synchronization, etc.)

Profiling

Definition (arithmetic density) , sometimes called computational intensity $r$ is the ratio between floating point operations and data movements, i.e.

r = \frac{\mathrm{FLOPs}}{\mathrm{Byte}}

To evaluate a performance, we use the roofline model:

the performance is limited by its weakest link – either memory-bound or compute-bound
obviously depends on the hardware (memory-bound program can become compute-bound when ran on a different machine)
by optimizing performance, the line can be stretched in both directions

Roofline model illustration.

The lecture goes into CUDA profiling. Here are some important concepts:

Nsight – records and analyzes kernel performance metrics (in detail)
- compute/memory graphs, roofline analysis, etc.

Scheduling optimizations

common and important data parallel primitive (sum, histogram, etc.)
easy to implement but hard to implement fast
to process very large arrays, we will require more than one SM – synchronization problem
- solution: one reduction layer will be one kernel launch
- the examples don’t actually do this, but in practice this would be done
we’re also assuming associativity (so we can do shenaningans with the order of operations)

Naive implementation

we’re implementing a reduction add
one kernel launch will solve a reduction subtree of its block size
- the thread structure is 1D, shared memory stores the subtree
- the output is an array of the results of each block
this isn’t entirely Naive, we’re already using shared memory

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
__global__ void Reduction(int *out, int *in, size_t N) {
	extern __shared__ int sPartials[];
	const int tid = threadIdx.x;

	// each thread loads one element from global to shared mem
	sPartials[tid] = in[blockIdx.x * blockDim.x + threadIdx.x];
	__syncthreads();

	// do reduction in shared mem
	for (unsigned int s = 1; s < blockDim.x; s *= 2) {
		if (tid % ( 2 * s ) == 0)
			sPartials[tid] += sPartials[tid + s];

		__syncthreads();
	}

	if (tid == 0)
		out[blockIdx.x] = sPartials[0];
 }

Reduction: naive implementation diagram.

Interleaved address divergent

remember that all threads in a warp execute the same instructions
the previous code uses threads from all over the place – let’s use the ones from the start
improves performance by almost $50\%$ ! (see table below)

1
2
3
4
5
6
7
8
for (unsigned int s = 1; s < blockDim.x; s *= 2) {
	int index = 2 * s * tid;

	if (index < blockDim.x)
		sPartials[index] += sPartials[index + s];

	__syncthreads();
}

Reduction: interleaved address divergent diagram.

Resolving bank conflicts

consecutive threads should access consecutive memory addresses
- see memory bank stride access graph from a few sections ago
another large performance increase (see table below)
this version seems like the most intuitive one to implement

1
2
3
4
5
6
for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) {
	if (tid < s)
		sPartials[tid] += sPartials[tid + s];

	__syncthreads();
}

Reduction: resolving bank conflicts diagram.

Making use of idle threads

after the first iteration, half of the blocks don’t do anything
- they just load something to shared memory at the beginning and are done
idea: start with half of the blocks and do the first operation while loading
almost double performance increase again!

1
2
3
4
5
6
// we have HALF of the threads but each loads DOUBLE
unsigned int i = blockIdx.x * (blockDim.x * 2) + threadIdx.x;

// each thread loads TWO elements from global to shared mem
sPartials[tid] = in[i] + in[i + blockDim.x];
__syncthreads();

Manual unrolling

when we have one warp left, we don’t need any __syncthreads() calls
a bit ugly but functional and further increases the speed

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
for (unsigned int s = blockDim.x / 2; s > 32; s >>= 1) {
	if (tid < 0)
		sPartials[tid] += sPartials[tid + s];

	__syncthreads();
}

if (tid < 32 && blockDim.x >= 64) sPartials[tid] = sPartials[tid + 32];
if (tid < 16 && blockDim.x >= 32) sPartials[tid] = sPartials[tid + 16];
if (tid <  8 && blockDim.x >= 16) sPartials[tid] = sPartials[tid +  8];
if (tid <  4 && blockDim.x >=  8) sPartials[tid] = sPartials[tid +  4];
if (tid <  2 && blockDim.x >=  4) sPartials[tid] = sPartials[tid +  2];
if (tid <  1 && blockDim.x >=  2) sPartials[tid] = sPartials[tid +  1];

Overview

Here is an overview of the versions we have implemented so far (the values in the table are GB/s for the given version and TPB):

Version \ TPB	$32$	$64$	$128$	$256$	$512$	$1024$
naive	$7.39$	$12.57$	$\mathbf{16.77}$	$14.67$	$12.33$	$9.05$
addresses	$10.46$	$18.33$	$\mathbf{23.88}$	$18.96$	$14.5$	$10.02$
banks	$11.05$	$19.54$	$\mathbf{30.83}$	$27.51$	$23.67$	$17.99$
first add	$21.68$	$37.15$	$\mathbf{58.03}$	$51.31$	$43.75$	$33.66$
unrolling	$22.59$	$36.91$	$\mathbf{68.38}$	$62.35$	$53.06$	$43.78$

$n$ -body optimization

we have $n$ bodies and want to evaluate their movement
uses Newton’s second law of motion
movement is approximated by temporal discretization (i.e. move by some small time)

I’m not writing the formulas from the slides, this isn’t a physics course.

AOS vs SOA

AOS: Arrays of Structures
- data grouped per element index, different element types next to each other
- typical in most applications
- consecutive threads won’t access consecutive places in memory

1
2
3
4
5
6
7
8
// AOS
struct {
	float x, y, z;
	float vx, vy, vz;
	float mass;
} p_t;

p_t particles [MAX_SIZE];

SOA: Structures of Arrays
- data grouped per element type, elements distributed among different arrays
- typical in GPU applications (where multiple threads are accessing memory concurrently)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
struct {
	float x    [MAX_SIZE],
	      y    [MAX_SIZE],
	      z    [MAX_SIZE];
	float vx   [MAX_SIZE],
	      vy   [MAX_SIZE],
	      vz   [MAX_SIZE];
	float mass [MAX_SIZE];
} p_t;

p_t particles;

Naive implementation

single thread takes care of a single body (if blocks cover bodies)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
__host__ __device__ void bodyBodyInteraction(...) {
	float dx = x1 - x0;
	float dy = y1 - y0;
	float dz = z1 - z0;

	float distSqr = dx*dx + dy*dy + dz*dz;
	distSqr += softeningSquared;

	float invDist = rsqrtf(distSqr);

	float invDistCube = invDist * invDist * invDist;
	float s = mass1 * invDistCube;

	*fx = dx * s;
	*fy = dy * s;
	*fz = dz * s;
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
__global__ void ComputeNBodyGravitation_Naive(...) {
	// outer loop, in case blocks don't fully cover the bodies
	for (int i = blockIdx.x * blockDim.x + threadIdx.x;
	     i < N;
	     i += blockDim.x * gridDim.x)
	{
		float acc[3] = {0};
		float4 self = ((float4 *) posMass)[i];

		// care: also includes interaction between itself!
		// is essentially zero and prevents needless branching
#pragma unroll 16
		for (int j = 0; j < N; j++) {
			float4 other = ((float4 *) posMass)[j];
			float fx, fy, fz;

			bodyBodyInteraction(
				&fx, &fy, &fz,
				self.x, self.y, self.z,
				other.x, other.y, other.z, other.w,
				softeningSquared
			);

			acc[0] += fx;
			acc[1] += fy;
			acc[2] += fz;
		}

		force[3*i+0] = acc[0];
		force[3*i+1] = acc[1];
		force[3*i+2] = acc[2];
	}
}

#pragma unroll will reduce branch overhead
- the factor has to be determined empirically
- be careful – if $N$ isn’t a multiple, might not do some calculations/segfault

Shared memory

split inner loop into sections (by block width) that are saved to shared memory
each thread still computes all interactions for one body, but in block-sized chunks

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
__global__ void ComputeNBodyGravitation_Shared (...) {
	extern __shared__ float4 sharedPM[];
	for (int i = blockIdx.x * blockDim.x + threadIdx.x;
	     i < N;
	     i += blockDim.x * gridDim.x)
	{
		float acc[3] = {0};
		float4 self = ((float4 *) posMass)[i];
#pragma unroll 32
		for (int j = 0; j < N; j += blockDim.x) {
			// each thread loads its part to the shared memory
			sharedPM[threadIdx.x] = ((float4 *) posMass)[j+threadIdx.x];
			__syncthreads();

			for (size_t k = 0; k < blockDim.x; k++) {
				float fx, fy, fz;
				float4 other = sharedPM[k];

				bodyBodyInteraction(
					&fx, &fy, &fz,
					self.x, self.y, self.z,
					other.x, other.y, other.z, other.w,
					softeningSquared
				);

				acc[0] += fx;
				acc[1] += fy;
				acc[2] += fz;
			}
			__syncthreads();
		}

		force[3*i+0] = acc[0];
		force[3*i+1] = acc[1];
		force[3*i+2] = acc[2];
	}
}

Host-device optimizations

streams – CPU/GPU concurrency!
- concurrent copy & execute (memcpy & kernel execute)
- here is a nice presentation that sums them well

Host-device synchronization

context-based – block until all outstanding CUDA operations have completed
- cudaMemcpy(), cudaDeviceSynchronize()
- all of these operate on the default stream
  - is special: will block all other streams until it is finished

SAXPY without streaming.

stream-based – block until all outstanding CUDA operations in a stream have completed
- cudaStreamSynchronize(stream) – wait until operations on this stream finish
- cudaDeviceSynchronize(stream) – wait until operations on ALL streams finish
- the user specifies which stream a kernel launch/memory operation goes to
  - kernel <<< ..., cudaStream_t stream >>>
  - cudaMemcpyAsync(..., cudaStream_t stream)
- the number of streams depends on the architecture
- we can also insert events and check when they have been completed

SAXPY with streaming.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
cudaStream_t stream0, stream1;
cudaStreamCreate ( &stream0 );
cudaStreamCreate ( &stream1 );
float *d_A0, *d_B0, *d_C0;
float *d_A1, *d_B1, *d_C1;

// cudaMallocs go here

for (int i = 0; i < n; i += segSize * 2) {
	// stream 0
	cudaMemCpyAsync ( d_A0, h_A + i, segSize * sizeof(float), ... , stream0 );
	cudaMemCpyAsync ( d_B0, h_B + i, segSize * sizeof(float), ... , stream0 );
	saxpy <<< segSize/256, 256, 0, stream0 >>> ( ... );
	cudaMemCpyAsync ( d_C0, h_C + i, segSize * sizeof(float), ... , stream0 );

	// stream 1
	cudaMemCpyAsync ( d_A1, h_A + i + segSize, segSize * sizeof(float), ..., stream1 );
	cudaMemCpyAsync ( d_B1, h_B + i + segSize, segSize * sizeof(float), ..., stream1 );
	saxpy <<< segSize/256, 256, 0, stream1 >>> ( ... );
	cudaMemCpyAsync ( d_C1, h_C + i + segSize, segSize * sizeof(float), ..., stream1 );
}

Issues

hardware used to only have two types of queues:
- copy engine – issues copy operations
- kernel engine – launches kernels
when the stream is processed, the following happens, resulting in sequential execution
- we have to move the kernel launches after the copies!

Improved SAXPY streams diagram.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
cudaStream_t stream0, stream1;
cudaStreamCreate ( &stream0 );
cudaStreamCreate ( &stream1 );
float *d_A0, *d_B0, *d_C0;
float *d_A1, *d_B1, *d_C1;

// cudaMallocs go here

for (int i = 0; i < n; i += segSize * 2) {
	cudaMemCpyAsync ( d_A0, h_A + i, segSize * sizeof(float), ... , stream0 );
	cudaMemCpyAsync ( d_B0, h_B + i, segSize * sizeof(float), ... , stream0 );
	cudaMemCpyAsync ( d_A1, h_A + i + segSize, segSize * sizeof(float), ..., stream1 );
	cudaMemCpyAsync ( d_B1, h_B + i + segSize, segSize * sizeof(float), ..., stream1 );

	saxpy <<< segSize/256, 256, 0, stream0 >>> ( ... );
	saxpy <<< segSize/256, 256, 0, stream1 >>> ( ... );

	cudaMemCpyAsync ( d_C0, h_C + i, segSize * sizeof(float), ... , stream0 );
	cudaMemCpyAsync ( d_C1, h_C + i + segSize, segSize * sizeof(float), ..., stream1 );
}

the newer architectures are better and have multiple hardware queues
be careful, some operations implicitly synchronize all other CUDA operations
- page locked memory allocation (cudaMallocHost() or cudaHostAlloc())
- device memory allocation (cudaMalloc())
- non-async versions of memory operations (cudaMemcpy(), cudaMemset())

Virtual Shared Memory

lets CPU and GPU have the same address space
nicer to deal with (only one type of pointers)
access costs can be quite significant
Unified Virtual Addressing (UVA)
- support since CUDA 4
- single virtual address space for all memory in the system
- GPU code can access all memory
- does not automagically migrate data from one physical location to another
Unified Memory (UM)
- newer way (CUDA 6) of handling memory
- pool of managed memory that is shared between CPU and GPU
- automatic (page) migration between CPU and GPU domains

Productivity

make compiler responsible for low-level implementations
- generate kernels, launch them
- handle data movements, optimizations
- simplifies the writing process

OpenACC

directive-based programming model to off-load compute-intensive loops to accelerators
cross-platform (C, C++, Fortran)
execution model: host-directed execution with an attached accelerator device
- offloading compute-intensive regions
coarse-grain parallelism: fully parallel execution across execution units $\rightarrow$ $\to$ gang parallelism
- limited support for synchronization
- CUDA: grid level
fine-grain parallelism: multiple threads on a single execution unit $\rightarrow$ $\to$ worker parallelism
- latency hiding techniques
- CUDA: warps at block level
SIMD/vector operations: multiple operations per thread $\rightarrow$ $\to$ vector parallelism
- CUDA: threads at block level
compiler takes care of memory
is implemented using directives
- #pragma acc directive-name [clauses]
  - parallel – user responsible for finding parallelisms
  - kernel – compiler responsible for finding parallelisms
  - loop [clause] – share among threads/execute sequentially
    - gang – among thread blocks
    - worker – among thread warps of a block
    - vector – among threads
    - seq – sequential execution
  - data [clause]
    - copy – H2D at region start, D2H at region end
    - copyin/copyout – in/out device
    - create – device allocation
    - present – note that the data is already there
- is an iterative process (test, see how it does, repeat)

Naive version

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
int main() {
	int n = 256*1024*1024; float a = 2.0f; float b = 3.0f;

	float* x;
	float* y;

	// allocate & initialize x, y

	for (int i = 0; i < n; ++i)
		y[i] = a*x[i] + y[i];

	for (int i = 0; i < n; ++i)
		y[i] = b*x[i] + y[i];

	//free and cleanup
}

`kernels` pragma

the compiler will let us now what it did
- which kernels it launched
- what it copied

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
int main() {
	int n = 256*1024*1024; float a = 2.0f; float b = 3.0f;

	// restrict states that the memory of the pointers doesn't overlap
	// is useful for compiler optimizations
	float* restrict x;
	float* restrict y;

	// allocate & initialize x, y

#pragma acc kernels
{
	for (int i = 0; i < n; ++i)
		y[i] = a*x[i] + y[i];

	for (int i = 0; i < n; ++i)
		y[i] = b*x[i] + y[i];
}

	//free and cleanup
}

`parallel loop` pragma

“please, distribute this loop over different threads”

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
int main() {
	int n = 256*1024*1024; float a = 2.0f; float b = 3.0f;

	float* restrict x;
	float* restrict y;

	// allocate & initialize x, y

#pragma acc parallel loop
	for (int i = 0; i < n; ++i)
		y[i] = a*x[i] + y[i];

#pragma acc parallel loop
	for (int i = 0; i < n; ++i)
		y[i] = b*x[i] + y[i];

	//free and cleanup
}

`data` pragma

“please, copy this data to the accelerator”
“now distribute this for loop, oh and also you have the data already”

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
int main() {
	int n = 256*1024*1024; float a = 2.0f; float b = 3.0f;

	float* restrict x;
	float* restrict y;

	// allocate & initialize x, y

#pragma acc data copyin(x[0:n]) copy(y[0:n])
{
#pragma acc parallel loop present(x,y)
	for (int i = 0; i < n; ++i)
		y[i] = a*x[i] + y[i];

#pragma acc parallel loop present(x,y)
	for (int i = 0; i < n; ++i)
		y[i] = b*x[i] + y[i];
}

	//free and cleanup
}

Nested loops

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// distributes outer loop to n threads
// each thread executes the inner loop sequentially
// block size 256, block count n/256 (rounded up)
#pragma acc parallel vector_length(256)
#pragma acc loop gang vector
for ( int i = 0; i < n; ++i ) {
	for ( int j = 0; j < m; ++j ) {
		// do stuff
	}
}

// same as above but only 16 blocks
// some threads might have to loop multiple times
#pragma acc parallel vector_length(256) num_gangs(16)
#pragma acc loop gang vector
for ( int i = 0; i < n; ++i ) {
	for ( int j = 0; j < m; ++j ) {
		// do stuff
	}
}

// parallelizes both loops
// distributes n outer loops to n blocks
// distributes inner loop to their threads (256/block)
#pragma acc parallel vector_length(256)
#pragma acc loop gang
for ( int i = 0; i < n; ++i ) {
#pragma acc loop vector
	for ( int j = 0; j < m; ++j ) {
		// do stuff
	}
}

Stencil computations

iterative kernel that updates regular arrays based on certain patterns
useful for image processing, partial differential equations, fluid dynamics, etc.

Image processing

connected component labeling – identify connected areas in this image
- each segment will be labeled with a different value
- $4$ -way or $8$ -way connectivity (we do $4$ )
- for an image, we apply a threshold to create a black/white image

Note: I am beyond confused to what the algorithm actually is, this is my best guess:

parallelly set labels to the entire stencil (taking threshold into account)
parallelly (repeatedly) merge labels (taking the minimum)
- will be done diagonally
- wavefronts – storing previous elements in shared memory to reduce memory contention

Partial differential equations

Read the slides, I’m fairly certain this isn’t too important.

Performance optimizations

stencil codes are memory-bound
when partitioning the data, there is overlap (called the halo)
- vertical halos are poorly aligned in memory
marching planes – only keep 3 planes in shared memory, cycling the buffers

GPU programming models

up to now, we’ve seen CUDA
similar approach: OpenCL (imperative language)
directive-based: OpenACC (declarative language, we’ve seen it)

OpenCL

Platform model

host stays host
compute devices are things like GPUs
- contain compute units (SMs), which have processing elements (thread blocks/warps)

Execution model

host code (sequential parts, control)
kernels still run on device (computational intensive part)
context – devices, kernel objects, program objects, memory objects
we have to explicitly create queues for different types of commands
work item – kernel function in execution for a single point in the defined index space
- global ID / work group ID + local ID
work group – organization structure of work items with a given kernel instance
- synchronization between work groups not possible (same as CUDA)
- can synchronize between work items
NDRange: $n$ -dimensional index space

CUDA	OpenCL
Grid	NDRange
Thread Block	Work group
Thread	Work item
Thread ID	Global ID
Block index	Block ID
Thread index	Local ID

Different types of kernels exist:

OpenCL kernels: kernel objects associated with kernel functions (user kernels)
Native kernels: execution along with OpenCL kernels on a device and shared memory objects
Built-in kernels: specific for a particular device

Memory model

memory regions: distinct memories visible to both host and device
memory objects: objects defined by the OpenCL API
Shared Virtual Memory: virtual address space exposed to both host and devices (UM in CUDA)
has special memory objects: buffer, image, pipe

CUDA	OpenCL
Host memory	Host memory
Global/device memory	Global memory
Shared memory	Local memory
Registers/local memory	Private memory

Consistency & Coherence

Ordering problem: threads operate independently: which order to apply?

Cache coherence: two threads write to same variable, which gets written to cache:

caches have to be coherent (all threads see the same value)
different cache policies:
- write-back: write to cache, at some later point write to memory
- write-through write to cache and immediately to memory too
- both cases have coherence problems, if we don’t update caches of other threads
a microarchitectural feature

Memory consistency: the order in which memory operations appear to be performed

as opposed to coherence, focuses on the order of execution
strict: any write seen immediately
sequential: write by different processors needs to be seen in teh same order by all processors
highly relaxed for GPU, few guarantees
- __threadfence() stalls current thread until all writes to shared/global memory are visible to other threads (if _threadfence_block() then only shared memory)
- __syncthreads() is a stronger version since it also synchronizes thread execution
an architectural feature

GPU Computing

Preface

Lecture overview

Bulk-Synchronous Parallel (BSP) model

Scaling rules

Moore’s law

Amdahl’s law

CUDA programming

Overview

Memory

Global/device memory

Shared memory

Registers

Host memory

Code examples

Variable declaration

Function declaration

Coalescing

Thread scheduling

Optimizing Matrix Multiplication

Naive (CPU)

Naive (GPU)

Multiple thread blocks (GPU)

Using shared memory (GPU)

Parallelism

Synchronization

Profiling

Scheduling optimizations

Naive implementation

Interleaved address divergent

Resolving bank conflicts

Making use of idle threads

Manual unrolling

Overview

nnn-body optimization

AOS vs SOA

Naive implementation

Shared memory

Host-device optimizations

Host-device synchronization

Issues

Virtual Shared Memory

Productivity

OpenACC

Naive version

kernels pragma

parallel loop pragma

data pragma

Nested loops

Stencil computations

Image processing

Partial differential equations

Performance optimizations

GPU programming models

OpenCL

Platform model

Execution model

Memory model

Consistency & Coherence

$n$ -body optimization

`kernels` pragma

`parallel loop` pragma

`data` pragma