CUDA Guide
Applies to: CUDA 11+, GPU Computing, Deep Learning, Scientific Computing, HPC
Core Principles
- Parallelism First: Design algorithms for thousands of concurrent threads; serial thinking is the primary enemy of GPU performance
- Memory Hierarchy Awareness: Global memory is 100x slower than shared memory and 1000x slower than registers; every kernel design starts with memory access planning
- Coalesced Access: Adjacent threads must access adjacent memory addresses; a single misaligned access pattern can reduce bandwidth by 32x
- Occupancy Over Cleverness: Maximize active warps per SM by managing register count, shared memory usage, and block dimensions together
- Minimize Host-Device Transfers: PCIe bandwidth is the bottleneck; overlap transfers with computation using streams and pinned memory
Guardrails
Error Checking
- ALWAYS check CUDA API return values with a macro wrapper
- ALWAYS call
cudaGetLastError() after every kernel launch
- ALWAYS call
cudaDeviceSynchronize() before reading kernel results on the host
- Use
compute-sanitizer (successor to cuda-memcheck) in development builds
- Handle
cudaErrorMemoryAllocation gracefully; never assume GPU memory is infinite
cuda
1#define CUDA_CHECK(call) \
2 do { \
3 cudaError_t err = call; \
4 if (err != cudaSuccess) { \
5 fprintf(stderr, "CUDA error at %s:%d: %s\n", \
6 __FILE__, __LINE__, cudaGetErrorString(err)); \
7 exit(EXIT_FAILURE); \
8 } \
9 } while (0)
10
11#define CUDA_CHECK_KERNEL() \
12 do { \
13 cudaError_t err = cudaGetLastError(); \
14 if (err != cudaSuccess) { \
15 fprintf(stderr, "Kernel launch error at %s:%d: %s\n", \
16 __FILE__, __LINE__, cudaGetErrorString(err)); \
17 exit(EXIT_FAILURE); \
18 } \
19 } while (0)
Memory Management
- Pair every
cudaMalloc with a cudaFree; prefer RAII wrappers in C++ host code
- Use
cudaMallocManaged (Unified Memory) for prototyping; switch to explicit transfers for production
- Use
cudaMallocHost (pinned memory) when streaming data to the GPU; pageable memory cannot overlap with compute
- Prefer
cudaMemcpyAsync with streams over synchronous cudaMemcpy
- Never access device pointers from host code or host pointers from device code (except Unified Memory)
- Call
cudaMemset or cudaMemsetAsync to zero-initialize device buffers
Kernel Design
- Block size must be a multiple of warp size (32); prefer 128, 256, or 512
- Calculate grid size as
(n + block_size - 1) / block_size
- Always include bounds checking:
if (idx < n) at the top of every kernel
- Use grid-stride loops for kernels that must handle arbitrary data sizes
- Document thread mapping: which dimension maps to which data axis
- Mark device-only helpers as
__device__, host+device as __host__ __device__
cuda
1// Grid-stride loop: works with any grid size, any data size
2__global__ void saxpy(float a, const float* x, float* y, int n) {
3 for (int i = blockIdx.x * blockDim.x + threadIdx.x;
4 i < n;
5 i += blockDim.x * gridDim.x) {
6 y[i] = a * x[i] + y[i];
7 }
8}
Synchronization
- Use
__syncthreads() after every shared memory write before any thread reads another thread's value
- Never place
__syncthreads() inside a conditional branch that not all threads in a block will reach (deadlock)
- Use
__syncwarp() (CUDA 9+) for warp-level synchronization instead of relying on implicit warp-synchronous execution
- Use
cudaDeviceSynchronize() sparingly in production; prefer stream synchronization with cudaStreamSynchronize()
- Use CUDA events (
cudaEventRecord / cudaEventSynchronize) for fine-grained inter-stream ordering
- Profile before optimizing: use Nsight Compute for kernel analysis, Nsight Systems for system-level view
- Target >50% theoretical occupancy; use the CUDA Occupancy Calculator to tune block dimensions
- Aim for >60% of peak memory bandwidth in memory-bound kernels
- Avoid warp divergence: ensure threads within a warp take the same branch when possible
- Prefer
float over double on consumer GPUs (2x throughput difference)
- Minimize atomic operations on global memory; use shared memory atomics with a final reduction
Memory Hierarchy
Understanding the memory hierarchy is the single most important factor in CUDA performance.
| Memory Type | Scope | Latency (cycles) | Size | Cached | Read/Write |
|---|
| Registers | Thread | 1 | ~255 per thread | N/A | R/W |
| Shared | Block | ~5 | 48-164 KB per SM | N/A | R/W |
| L1 Cache | SM | ~28 | 48-192 KB per SM | Auto | R |
| L2 Cache | Device | ~200 | 4-40 MB | Auto | R/W |
| Global | Device | ~400-600 | 4-80 GB (HBM/GDDR) | Yes | R/W |
| Constant | Device | ~5 (cached) | 64 KB | Yes (broadcast) | R |
| Texture | Device | ~400 (cached) | Global pool | Yes (spatial) | R |
Decision guide:
- Data reused within a thread -> registers (automatic via local variables)
- Data shared across threads in a block ->
__shared__ memory
- Read-only data broadcast to all threads ->
__constant__ memory
- Large read-only data with spatial locality -> texture memory
- Everything else -> global memory with coalesced access patterns
Key Patterns
Kernel Launch Configuration
cuda
1// Query device for optimal configuration
2void launch_optimized(const float* input, float* output, int n) {
3 int block_size;
4 int min_grid_size;
5
6 // Let the runtime suggest optimal block size for maximum occupancy
7 cudaOccupancyMaxPotentialBlockSize(
8 &min_grid_size, &block_size, my_kernel, 0, n);
9
10 int grid_size = (n + block_size - 1) / block_size;
11 my_kernel<<<grid_size, block_size>>>(input, output, n);
12 CUDA_CHECK_KERNEL();
13}
Coalesced Memory Access
cuda
1// BAD: Strided access -- adjacent threads access non-adjacent memory
2// Each warp issues 32 separate memory transactions
3__global__ void transpose_naive(const float* in, float* out, int W, int H) {
4 int x = blockIdx.x * blockDim.x + threadIdx.x;
5 int y = blockIdx.y * blockDim.y + threadIdx.y;
6 if (x < W && y < H) {
7 out[x * H + y] = in[y * W + x]; // Write is strided
8 }
9}
10
11// GOOD: Use shared memory to coalesce both reads and writes
12__global__ void transpose_coalesced(
13 const float* in, float* out, int W, int H
14) {
15 __shared__ float tile[32][33]; // +1 padding avoids bank conflicts
16
17 int x = blockIdx.x * 32 + threadIdx.x;
18 int y = blockIdx.y * 32 + threadIdx.y;
19
20 if (x < W && y < H) {
21 tile[threadIdx.y][threadIdx.x] = in[y * W + x]; // Coalesced read
22 }
23 __syncthreads();
24
25 x = blockIdx.y * 32 + threadIdx.x;
26 y = blockIdx.x * 32 + threadIdx.y;
27
28 if (x < H && y < W) {
29 out[y * H + x] = tile[threadIdx.x][threadIdx.y]; // Coalesced write
30 }
31}
Shared Memory Tiling
cuda
1// Dot product of two vectors using shared memory reduction
2__global__ void dot_product(
3 const float* a, const float* b, float* result, int n
4) {
5 __shared__ float cache[256];
6
7 int tid = threadIdx.x;
8 int idx = blockIdx.x * blockDim.x + threadIdx.x;
9
10 // Each thread computes its partial sum via grid-stride
11 float partial = 0.0f;
12 for (int i = idx; i < n; i += blockDim.x * gridDim.x) {
13 partial += a[i] * b[i];
14 }
15 cache[tid] = partial;
16 __syncthreads();
17
18 // Tree reduction in shared memory
19 for (int s = blockDim.x / 2; s > 0; s >>= 1) {
20 if (tid < s) {
21 cache[tid] += cache[tid + s];
22 }
23 __syncthreads();
24 }
25
26 if (tid == 0) {
27 atomicAdd(result, cache[0]);
28 }
29}
Warp-Level Primitives (CUDA 9+)
cuda
1// Warp-level reduction using shuffle instructions -- no shared memory needed
2__device__ float warp_reduce_sum(float val) {
3 for (int offset = warpSize / 2; offset > 0; offset /= 2) {
4 val += __shfl_down_sync(0xFFFFFFFF, val, offset);
5 }
6 return val;
7}
8
9// Block-level reduction combining warp shuffles and shared memory
10__device__ float block_reduce_sum(float val) {
11 __shared__ float warp_sums[32]; // One slot per warp (max 32 warps/block)
12
13 int lane = threadIdx.x % warpSize;
14 int warp_id = threadIdx.x / warpSize;
15
16 val = warp_reduce_sum(val);
17
18 if (lane == 0) {
19 warp_sums[warp_id] = val;
20 }
21 __syncthreads();
22
23 // First warp reduces the warp sums
24 int num_warps = (blockDim.x + warpSize - 1) / warpSize;
25 val = (threadIdx.x < num_warps) ? warp_sums[threadIdx.x] : 0.0f;
26 if (warp_id == 0) {
27 val = warp_reduce_sum(val);
28 }
29
30 return val;
31}
Occupancy Calculator
cuda
1// Query occupancy at compile time for tuning
2void report_occupancy() {
3 int block_size = 256;
4 int num_blocks;
5
6 cudaOccupancyMaxActiveBlocksPerMultiprocessor(
7 &num_blocks, my_kernel, block_size, 0);
8
9 cudaDeviceProp prop;
10 cudaGetDeviceProperties(&prop, 0);
11
12 int active_warps = num_blocks * (block_size / prop.warpSize);
13 int max_warps = prop.maxThreadsPerMultiProcessor / prop.warpSize;
14 float occupancy = (float)active_warps / max_warps;
15
16 printf("Occupancy: %.1f%% (%d/%d warps)\n",
17 occupancy * 100, active_warps, max_warps);
18}
Nsight Profiling Workflow
bash
1# System-level trace: find CPU/GPU idle gaps, stream concurrency
2nsys profile -o trace ./program
3nsys stats trace.nsys-rep
4
5# Kernel-level analysis: roofline, memory throughput, occupancy
6ncu --set full -o kernel_report ./program
7ncu -i kernel_report.ncu-rep # Open in Nsight Compute GUI
8
9# Quick single-metric check
10ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed ./program
Memory Bandwidth Measurement
cuda
1// Measure effective bandwidth of a kernel
2void measure_bandwidth(int n) {
3 size_t bytes = 2 * n * sizeof(float); // Read A + Write B
4
5 cudaEvent_t start, stop;
6 CUDA_CHECK(cudaEventCreate(&start));
7 CUDA_CHECK(cudaEventCreate(&stop));
8
9 CUDA_CHECK(cudaEventRecord(start));
10 copy_kernel<<<grid, block>>>(d_in, d_out, n);
11 CUDA_CHECK(cudaEventRecord(stop));
12 CUDA_CHECK(cudaEventSynchronize(stop));
13
14 float ms = 0;
15 CUDA_CHECK(cudaEventElapsedTime(&ms, start, stop));
16
17 float gb_per_sec = bytes / (ms * 1e6);
18 printf("Effective bandwidth: %.2f GB/s\n", gb_per_sec);
19
20 CUDA_CHECK(cudaEventDestroy(start));
21 CUDA_CHECK(cudaEventDestroy(stop));
22}
Essential Commands
bash
1# Compile CUDA code
2nvcc -arch=sm_80 -O3 -o program main.cu # Single file
3nvcc -arch=native -lineinfo -o program main.cu # With debug line info
4
5# CMake build
6cmake -B build -DCMAKE_CUDA_ARCHITECTURES="70;80;86"
7cmake --build build -j$(nproc)
8
9# Runtime debugging
10compute-sanitizer ./program # Memory errors (replaces cuda-memcheck)
11compute-sanitizer --tool racecheck ./program # Shared memory race conditions
12compute-sanitizer --tool initcheck ./program # Uninitialized device memory reads
13compute-sanitizer --tool synccheck ./program # Synchronization errors
14
15# Profiling
16nsys profile ./program # System-level timeline
17ncu ./program # Kernel-level metrics
18ncu --kernel-name my_kernel --launch-skip 2 --launch-count 1 ./program
19
20# Device info
21nvidia-smi # GPU status and memory usage
22nvcc --version # CUDA compiler version
CMakeLists.txt Template
cmake
1cmake_minimum_required(VERSION 3.18)
2project(myproject LANGUAGES CXX CUDA)
3
4set(CMAKE_CXX_STANDARD 17)
5set(CMAKE_CUDA_STANDARD 17)
6set(CMAKE_CUDA_ARCHITECTURES 70 80 86)
7set(CMAKE_CUDA_SEPARABLE_COMPILATION ON)
8
9find_package(CUDAToolkit REQUIRED)
10
11add_library(kernels src/kernels.cu)
12target_include_directories(kernels PUBLIC include)
13
14add_executable(main src/main.cpp)
15target_link_libraries(main kernels CUDA::cudart)
16
17enable_testing()
18add_executable(tests tests/test_kernels.cu)
19target_link_libraries(tests kernels CUDA::cudart)
20add_test(NAME gpu_tests COMMAND tests)
References
For detailed patterns and examples, see:
External References