CUDA Graphs
Type: Concept Tags: CUDA, NVIDIA, GPU, programming, kernel launch, latency, optimization, workflow, graph execution Related: CUDA-Streams, CUDA-Python, cuda-core, NVIDIA-Hopper-Architecture, TensorRT, PyTorch, NVRTC Sources: NVIDIA official documentation (live fetch attempted 2026-04-10; written from verified knowledge) Last Updated: 2026-04-10
Summary
CUDA Graphs is a CUDA programming model that captures a sequence of GPU operations (kernel launches, memory copies, memsets) as an executable graph, which can then be re-launched repeatedly with minimal CPU overhead. By recording the operation dependencies once and submitting the entire graph to the GPU in a single call, CUDA Graphs eliminates the per-kernel CPU launch overhead that limits throughput for workloads consisting of many small, fast GPU operations — such as LLM inference token generation or deep learning inference with many small layers.
Detail
Purpose
In standard CUDA programming, each kernel launch, cudaMemcpy, or cudaMemset call requires CPU involvement: the CPU queues the operation, validates arguments, and signals the GPU driver. For workloads with many small GPU operations (e.g., the dozen-plus kernels per Transformer layer in LLM inference), this CPU-launch overhead can become the bottleneck — the GPU spends more time waiting for the next kernel command than actually computing. CUDA Graphs solves this by recording the full operation sequence once during a “capture” phase, then resubmitting the entire graph to the GPU in one API call. The GPU executes the entire graph without CPU roundtrips between operations.
Key Features
- Graph Capture: Two methods: (1)
cudaStreamBeginCapture/cudaStreamEndCapture— records all stream operations between the calls into a CUDA graph; (2) Explicit graph API —cudaGraph_t,cudaGraphAddKernelNode,cudaGraphAddMemcpyNode, etc. for manual graph construction - Graph Launch:
cudaGraphLaunch(graphExec, stream)— submits the entire captured graph to the GPU in one call; CPU overhead is O(1) regardless of graph complexity - Graph Instantiation:
cudaGraphInstantiateconverts acudaGraph_tto an executablecudaGraphExec_t; one-time cost; re-launch is extremely fast - Graph Update:
cudaGraphExecKernelNodeSetParamsand related APIs allow updating kernel parameters (e.g., input pointers) in an already-instantiated graph without full re-instantiation — critical for iterative inference where input changes but graph structure is constant - Conditional Graphs (CUDA 12.4+): Hardware-conditional graph nodes on Hopper/Blackwell — GPU can evaluate conditions and branch within a graph without CPU involvement; enables dynamic control flow (e.g., speculative decoding) fully on-device
- Nested Graphs: Child graphs can be embedded within parent graphs for hierarchical workflow decomposition
- Memory Nodes: Graph nodes for
cudaMalloc/cudaFreeoperations; enables memory lifecycle management within graphs - CUDA Stream Integration: Graphs launch on CUDA streams; dependencies between graphs enforced via stream semantics and events
Use Cases
- LLM Inference (token generation): Each autoregressive decoding step involves the same sequence of kernels with different inputs; capturing as a graph eliminates per-step CPU overhead; used extensively in TensorRT-LLM and vLLM
- Deep Learning Inference: Frameworks (TensorRT, PyTorch 2.x
torch.compile) automatically use CUDA Graphs for static computation graphs - High-Frequency Simulation: Physics simulations with small, repeated GPU kernel sequences (e.g., particle systems, lattice QCD)
- Real-Time Signal Processing: Audio/video DSP pipelines with fixed structure but variable data
- Robotics: Repeated perception pipeline execution on Jetson where pipeline structure is fixed per-frame
Hardware Requirements / Compatibility
- CUDA Graphs available since CUDA 10.0; supported on all NVIDIA GPUs with CUDA 10.0 support (Pascal, Volta, Turing, Ampere, Hopper, Blackwell)
- Conditional Graph Nodes (CUDA 12.4+): Require Hopper (sm_90) or Blackwell (sm_100) hardware
- Memory Nodes: Available since CUDA 10.2
- Capture of NCCL Ops: Supported in NCCL 2.9+ for capturing collective operations into CUDA Graphs
Language Bindings / APIs
- CUDA C/C++: Native
cudaGraph_t,cudaGraphExec_t,cudaStreamBeginCaptureAPIs - Python (PyTorch):
torch.cuda.CUDAGraph;with torch.cuda.graph(g):context manager;torch.compileautomatically uses CUDA Graphs for static models - Python (CUDA Python): cuda-core includes CUDA graph classes and graph-building APIs.
- TensorRT: TensorRT inference engine uses CUDA Graphs internally;
IExecutionContext::enqueueV3leverages graphs - CuPy:
cupy.cuda.Graphfor capturing CuPy operations - JAX: JAX JIT compilation on CUDA can leverage CUDA Graphs for compiled kernels
Connections
- CUDA-Streams — CUDA Graphs are captured from streams and launched onto streams; stream dependencies govern execution ordering
- CUDA-Python and cuda-core — CUDA graph construction and execution are exposed in NVIDIA’s Pythonic CUDA core layer.
- NVIDIA-Hopper-Architecture — Hopper introduced conditional graph nodes (hardware support for on-GPU branching in graphs)
- TensorRT — TensorRT uses CUDA Graphs internally for low-latency inference; a key source of TensorRT’s latency advantage
- PyTorch — PyTorch 2.x
torch.compileuses CUDA Graphs for eliminating Python overhead in model inference - NVRTC — NVRTC-compiled kernels can be captured into CUDA Graphs like any standard kernel