Perfect for GPU-accelerated Agents needing high-performance computing capabilities with CUDA 11+ Minimalist yet powerful AI-assisted development framework. Production-ready CLAUDE.md & AGENTS.md system with 35+ guardrails, 4D methodology, and auto-loading language guides for TypeScript, Python, Go, and Rust.

How do I install cuda-guide?

Run the command: npx killer-skills add ar4mirez/samuel. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for cuda-guide?

Key use cases include: Optimizing GPU kernel designs for concurrent thread execution, Implementing memory access planning for global, shared, and register memory, Developing high-performance applications with CUDA for deep learning and scientific computing.

Which IDEs are compatible with cuda-guide?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for cuda-guide?

Requires CUDA 11+ compatibility. GPU computing resources needed. Limited to specific programming languages like TypeScript, Python, Go, and Rust.

CUDA Guide

Name: cuda-guide
Availability: InStock
Author: ar4mirez

Applies to: CUDA 11+, GPU Computing, Deep Learning, Scientific Computing, HPC

Core Principles

Parallelism First: Design algorithms for thousands of concurrent threads; serial thinking is the primary enemy of GPU performance
Memory Hierarchy Awareness: Global memory is 100x slower than shared memory and 1000x slower than registers; every kernel design starts with memory access planning
Coalesced Access: Adjacent threads must access adjacent memory addresses; a single misaligned access pattern can reduce bandwidth by 32x
Occupancy Over Cleverness: Maximize active warps per SM by managing register count, shared memory usage, and block dimensions together
Minimize Host-Device Transfers: PCIe bandwidth is the bottleneck; overlap transfers with computation using streams and pinned memory

Guardrails

Error Checking

ALWAYS check CUDA API return values with a macro wrapper
ALWAYS call cudaGetLastError() after every kernel launch
ALWAYS call cudaDeviceSynchronize() before reading kernel results on the host
Use compute-sanitizer (successor to cuda-memcheck) in development builds
Handle cudaErrorMemoryAllocation gracefully; never assume GPU memory is infinite

cuda
1#define CUDA_CHECK(call)                                                    \
2    do {                                                                    \
3        cudaError_t err = call;                                            \
4        if (err != cudaSuccess) {                                          \
5            fprintf(stderr, "CUDA error at %s:%d: %s\n",                   \
6                    __FILE__, __LINE__, cudaGetErrorString(err));           \
7            exit(EXIT_FAILURE);                                            \
8        }                                                                   \
9    } while (0)
10
11#define CUDA_CHECK_KERNEL()                                                 \
12    do {                                                                    \
13        cudaError_t err = cudaGetLastError();                              \
14        if (err != cudaSuccess) {                                          \
15            fprintf(stderr, "Kernel launch error at %s:%d: %s\n",          \
16                    __FILE__, __LINE__, cudaGetErrorString(err));           \
17            exit(EXIT_FAILURE);                                            \
18        }                                                                   \
19    } while (0)

Memory Management

Pair every cudaMalloc with a cudaFree; prefer RAII wrappers in C++ host code
Use cudaMallocManaged (Unified Memory) for prototyping; switch to explicit transfers for production
Use cudaMallocHost (pinned memory) when streaming data to the GPU; pageable memory cannot overlap with compute
Prefer cudaMemcpyAsync with streams over synchronous cudaMemcpy
Never access device pointers from host code or host pointers from device code (except Unified Memory)
Call cudaMemset or cudaMemsetAsync to zero-initialize device buffers

Kernel Design

Block size must be a multiple of warp size (32); prefer 128, 256, or 512
Calculate grid size as (n + block_size - 1) / block_size
Always include bounds checking: if (idx < n) at the top of every kernel
Use grid-stride loops for kernels that must handle arbitrary data sizes
Document thread mapping: which dimension maps to which data axis
Mark device-only helpers as __device__, host+device as __host__ __device__

cuda
1// Grid-stride loop: works with any grid size, any data size
2__global__ void saxpy(float a, const float* x, float* y, int n) {
3    for (int i = blockIdx.x * blockDim.x + threadIdx.x;
4         i < n;
5         i += blockDim.x * gridDim.x) {
6        y[i] = a * x[i] + y[i];
7    }
8}

Synchronization

Use __syncthreads() after every shared memory write before any thread reads another thread's value
Never place __syncthreads() inside a conditional branch that not all threads in a block will reach (deadlock)
Use __syncwarp() (CUDA 9+) for warp-level synchronization instead of relying on implicit warp-synchronous execution
Use cudaDeviceSynchronize() sparingly in production; prefer stream synchronization with cudaStreamSynchronize()
Use CUDA events (cudaEventRecord / cudaEventSynchronize) for fine-grained inter-stream ordering

Performance

Profile before optimizing: use Nsight Compute for kernel analysis, Nsight Systems for system-level view
Target >50% theoretical occupancy; use the CUDA Occupancy Calculator to tune block dimensions
Aim for >60% of peak memory bandwidth in memory-bound kernels
Avoid warp divergence: ensure threads within a warp take the same branch when possible
Prefer float over double on consumer GPUs (2x throughput difference)
Minimize atomic operations on global memory; use shared memory atomics with a final reduction

Memory Hierarchy

Understanding the memory hierarchy is the single most important factor in CUDA performance.

Memory Type	Scope	Latency (cycles)	Size	Cached	Read/Write
Registers	Thread	1	~255 per thread	N/A	R/W
Shared	Block	~5	48-164 KB per SM	N/A	R/W
L1 Cache	SM	~28	48-192 KB per SM	Auto	R
L2 Cache	Device	~200	4-40 MB	Auto	R/W
Global	Device	~400-600	4-80 GB (HBM/GDDR)	Yes	R/W
Constant	Device	~5 (cached)	64 KB	Yes (broadcast)	R
Texture	Device	~400 (cached)	Global pool	Yes (spatial)	R

Decision guide:

Data reused within a thread -> registers (automatic via local variables)
Data shared across threads in a block -> __shared__ memory
Read-only data broadcast to all threads -> __constant__ memory
Large read-only data with spatial locality -> texture memory
Everything else -> global memory with coalesced access patterns

Key Patterns

Kernel Launch Configuration

cuda
1// Query device for optimal configuration
2void launch_optimized(const float* input, float* output, int n) {
3    int block_size;
4    int min_grid_size;
5
6    // Let the runtime suggest optimal block size for maximum occupancy
7    cudaOccupancyMaxPotentialBlockSize(
8        &min_grid_size, &block_size, my_kernel, 0, n);
9
10    int grid_size = (n + block_size - 1) / block_size;
11    my_kernel<<<grid_size, block_size>>>(input, output, n);
12    CUDA_CHECK_KERNEL();
13}

Coalesced Memory Access

cuda
1// BAD: Strided access -- adjacent threads access non-adjacent memory
2// Each warp issues 32 separate memory transactions
3__global__ void transpose_naive(const float* in, float* out, int W, int H) {
4    int x = blockIdx.x * blockDim.x + threadIdx.x;
5    int y = blockIdx.y * blockDim.y + threadIdx.y;
6    if (x < W && y < H) {
7        out[x * H + y] = in[y * W + x];  // Write is strided
8    }
9}
10
11// GOOD: Use shared memory to coalesce both reads and writes
12__global__ void transpose_coalesced(
13    const float* in, float* out, int W, int H
14) {
15    __shared__ float tile[32][33]; // +1 padding avoids bank conflicts
16
17    int x = blockIdx.x * 32 + threadIdx.x;
18    int y = blockIdx.y * 32 + threadIdx.y;
19
20    if (x < W && y < H) {
21        tile[threadIdx.y][threadIdx.x] = in[y * W + x]; // Coalesced read
22    }
23    __syncthreads();
24
25    x = blockIdx.y * 32 + threadIdx.x;
26    y = blockIdx.x * 32 + threadIdx.y;
27
28    if (x < H && y < W) {
29        out[y * H + x] = tile[threadIdx.x][threadIdx.y]; // Coalesced write
30    }
31}

Shared Memory Tiling

cuda
1// Dot product of two vectors using shared memory reduction
2__global__ void dot_product(
3    const float* a, const float* b, float* result, int n
4) {
5    __shared__ float cache[256];
6
7    int tid = threadIdx.x;
8    int idx = blockIdx.x * blockDim.x + threadIdx.x;
9
10    // Each thread computes its partial sum via grid-stride
11    float partial = 0.0f;
12    for (int i = idx; i < n; i += blockDim.x * gridDim.x) {
13        partial += a[i] * b[i];
14    }
15    cache[tid] = partial;
16    __syncthreads();
17
18    // Tree reduction in shared memory
19    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
20        if (tid < s) {
21            cache[tid] += cache[tid + s];
22        }
23        __syncthreads();
24    }
25
26    if (tid == 0) {
27        atomicAdd(result, cache[0]);
28    }
29}

Warp-Level Primitives (CUDA 9+)

cuda
1// Warp-level reduction using shuffle instructions -- no shared memory needed
2__device__ float warp_reduce_sum(float val) {
3    for (int offset = warpSize / 2; offset > 0; offset /= 2) {
4        val += __shfl_down_sync(0xFFFFFFFF, val, offset);
5    }
6    return val;
7}
8
9// Block-level reduction combining warp shuffles and shared memory
10__device__ float block_reduce_sum(float val) {
11    __shared__ float warp_sums[32]; // One slot per warp (max 32 warps/block)
12
13    int lane = threadIdx.x % warpSize;
14    int warp_id = threadIdx.x / warpSize;
15
16    val = warp_reduce_sum(val);
17
18    if (lane == 0) {
19        warp_sums[warp_id] = val;
20    }
21    __syncthreads();
22
23    // First warp reduces the warp sums
24    int num_warps = (blockDim.x + warpSize - 1) / warpSize;
25    val = (threadIdx.x < num_warps) ? warp_sums[threadIdx.x] : 0.0f;
26    if (warp_id == 0) {
27        val = warp_reduce_sum(val);
28    }
29
30    return val;
31}

Performance

Occupancy Calculator

cuda
1// Query occupancy at compile time for tuning
2void report_occupancy() {
3    int block_size = 256;
4    int num_blocks;
5
6    cudaOccupancyMaxActiveBlocksPerMultiprocessor(
7        &num_blocks, my_kernel, block_size, 0);
8
9    cudaDeviceProp prop;
10    cudaGetDeviceProperties(&prop, 0);
11
12    int active_warps = num_blocks * (block_size / prop.warpSize);
13    int max_warps = prop.maxThreadsPerMultiProcessor / prop.warpSize;
14    float occupancy = (float)active_warps / max_warps;
15
16    printf("Occupancy: %.1f%% (%d/%d warps)\n",
17           occupancy * 100, active_warps, max_warps);
18}

Nsight Profiling Workflow

bash
1# System-level trace: find CPU/GPU idle gaps, stream concurrency
2nsys profile -o trace ./program
3nsys stats trace.nsys-rep
4
5# Kernel-level analysis: roofline, memory throughput, occupancy
6ncu --set full -o kernel_report ./program
7ncu -i kernel_report.ncu-rep    # Open in Nsight Compute GUI
8
9# Quick single-metric check
10ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed ./program

Memory Bandwidth Measurement

cuda
1// Measure effective bandwidth of a kernel
2void measure_bandwidth(int n) {
3    size_t bytes = 2 * n * sizeof(float); // Read A + Write B
4
5    cudaEvent_t start, stop;
6    CUDA_CHECK(cudaEventCreate(&start));
7    CUDA_CHECK(cudaEventCreate(&stop));
8
9    CUDA_CHECK(cudaEventRecord(start));
10    copy_kernel<<<grid, block>>>(d_in, d_out, n);
11    CUDA_CHECK(cudaEventRecord(stop));
12    CUDA_CHECK(cudaEventSynchronize(stop));
13
14    float ms = 0;
15    CUDA_CHECK(cudaEventElapsedTime(&ms, start, stop));
16
17    float gb_per_sec = bytes / (ms * 1e6);
18    printf("Effective bandwidth: %.2f GB/s\n", gb_per_sec);
19
20    CUDA_CHECK(cudaEventDestroy(start));
21    CUDA_CHECK(cudaEventDestroy(stop));
22}

Tooling

Essential Commands

bash
1# Compile CUDA code
2nvcc -arch=sm_80 -O3 -o program main.cu         # Single file
3nvcc -arch=native -lineinfo -o program main.cu   # With debug line info
4
5# CMake build
6cmake -B build -DCMAKE_CUDA_ARCHITECTURES="70;80;86"
7cmake --build build -j$(nproc)
8
9# Runtime debugging
10compute-sanitizer ./program                      # Memory errors (replaces cuda-memcheck)
11compute-sanitizer --tool racecheck ./program     # Shared memory race conditions
12compute-sanitizer --tool initcheck ./program     # Uninitialized device memory reads
13compute-sanitizer --tool synccheck ./program     # Synchronization errors
14
15# Profiling
16nsys profile ./program                           # System-level timeline
17ncu ./program                                    # Kernel-level metrics
18ncu --kernel-name my_kernel --launch-skip 2 --launch-count 1 ./program
19
20# Device info
21nvidia-smi                                       # GPU status and memory usage
22nvcc --version                                   # CUDA compiler version

CMakeLists.txt Template

cmake
1cmake_minimum_required(VERSION 3.18)
2project(myproject LANGUAGES CXX CUDA)
3
4set(CMAKE_CXX_STANDARD 17)
5set(CMAKE_CUDA_STANDARD 17)
6set(CMAKE_CUDA_ARCHITECTURES 70 80 86)
7set(CMAKE_CUDA_SEPARABLE_COMPILATION ON)
8
9find_package(CUDAToolkit REQUIRED)
10
11add_library(kernels src/kernels.cu)
12target_include_directories(kernels PUBLIC include)
13
14add_executable(main src/main.cpp)
15target_link_libraries(main kernels CUDA::cudart)
16
17enable_testing()
18add_executable(tests tests/test_kernels.cu)
19target_link_libraries(tests kernels CUDA::cudart)
20add_test(NAME gpu_tests COMMAND tests)

References

For detailed patterns and examples, see:

references/patterns.md -- Tiled matrix multiply, parallel reduction tree, stream overlap pipeline

cuda-guide — community samuel, community, ide skills, Claude Code, Cursor, Windsurf

About this Skill

Agent Capability Analysis

Ideal Agent Persona

Core Value

↓ Capabilities Granted for cuda-guide

! Prerequisites & Limits

Browser Sandbox Environment

⚡️ Ready to unleash?

cuda-guide

CUDA Guide

Core Principles

Guardrails

Error Checking

Memory Management

Kernel Design

Synchronization

Performance

Memory Hierarchy

Key Patterns

Kernel Launch Configuration

Coalesced Memory Access

Shared Memory Tiling

Warp-Level Primitives (CUDA 9+)

Performance

Occupancy Calculator

Nsight Profiling Workflow

Memory Bandwidth Measurement

Tooling

Essential Commands

CMakeLists.txt Template

References

External References

FAQ & Installation Steps

? Frequently Asked Questions

What is cuda-guide?

How do I install cuda-guide?

What are the use cases for cuda-guide?

Which IDEs are compatible with cuda-guide?

Are there any limitations for cuda-guide?

↓ How To Install

Related Skills

Looking for an alternative to cuda-guide or another community skill for your workflow? Explore these related open-source skills.

widget-generator

flags

zustand

data-fetching