CUDA Streams

Type: Concept Tags: CUDA, NVIDIA, GPU, concurrency, asynchronous, streams, kernel, overlap, programming model Related: CUDA-Graphs, CUDA-Unified-Memory, Multi-Process-Service, NVCC, CuPy Sources: NVIDIA official documentation (live fetch attempted 2026-04-10; written from verified knowledge) Last Updated: 2026-04-10

Summary

CUDA Streams are ordered sequences of GPU operations (kernel launches, memory copies, events) that execute in issue order on the GPU, with independent streams able to overlap execution concurrently. Streams are the fundamental mechanism for GPU concurrency in CUDA: by distributing independent work across multiple streams, developers can overlap compute with data transfer, execute multiple kernels simultaneously, and pipeline workloads to maximize GPU utilization. Almost every asynchronous CUDA operation is tied to a stream.

Detail

Purpose

A single CUDA stream is fully serial — operations execute in the order they were issued. Without streams, all GPU operations would serialize, leaving the copy engines and compute engines idle while waiting for unrelated work to complete. Multiple streams allow the CUDA runtime and GPU hardware to overlap operations that have no data dependency: while one stream is running a kernel, another stream can be DMA-copying data to/from host memory, and a third stream can be running a different kernel on a different SM partition — all simultaneously. This is essential for hiding PCIe transfer latency and maximizing GPU occupancy.

Key Features

  • Default (Null) Stream: All CUDA operations that don’t specify a stream use the default (null) stream; the null stream provides synchronization barriers with all other streams — null stream operations wait for all non-null stream operations to complete and block subsequent non-null operations
  • Non-Blocking Streams: Explicit user streams created with cudaStreamCreate or cudaStreamCreateWithFlags(cudaStreamNonBlocking) — these do not synchronize with the null stream; purely independent
  • Priority Streams: cudaStreamCreateWithPriority — create streams with higher (negative integer) or lower priority; GPU scheduler prefers higher-priority work when resources are available
  • Concurrent Kernel Execution: Modern NVIDIA GPUs (Fermi+) can execute kernels from different streams simultaneously if SM resources permit; depends on kernel resource usage (shared memory, registers, warp count)
  • Async Memory Operations: All async host-device copies require a stream: cudaMemcpyAsync(dst, src, size, kind, stream); copy engine (DMA) and compute engine overlap when on separate streams
  • CUDA Events: cudaEventRecord(event, stream) captures a point in stream execution; cudaStreamWaitEvent(stream, event) makes one stream wait for an event recorded in another — enables fine-grained inter-stream dependency management
  • Synchronization: cudaStreamSynchronize(stream) blocks CPU until all operations in stream complete; cudaDeviceSynchronize() waits for all streams
  • Stream Capture: cudaStreamBeginCapture/cudaStreamEndCapture — record stream operations into a CUDA Graph for repeated low-overhead replay
  • Copy Engine Overlap: NVIDIA GPUs have 1–3 dedicated DMA copy engines (separate from the compute engines); async copies on one stream overlap with kernels on another, hiding PCIe latency

Use Cases

  • Double-buffering data pipelines: Stream 1 computes on batch N while Stream 2 copies batch N+1 to GPU, eliminating idle time between batches
  • Multi-stage inference pipelines: Pre-processing, inference, and post-processing on separate streams to maximize overlap
  • Multi-model serving: Triton Inference Server uses streams to run independent model instances concurrently on different SM partitions
  • Distributed training: NCCL collective operations (allreduce, allgather) are stream-ordered; training code overlaps compute with communication on separate streams
  • Graph execution: CUDA Graphs launch onto a specified stream; multiple graph instances run concurrently on different streams
  • HPC libraries: cuBLAS, cuFFT, cuDNN all accept stream arguments for asynchronous operation within application-managed pipelines

Hardware Requirements / Compatibility

  • Concurrent Kernel Execution: Maxwell+ GPUs for robust concurrent kernel support; Volta+ with MPS for isolated concurrent processes
  • Multiple Copy Engines: H100 has dedicated bidirectional copy engines (PCIe H2D and D2H simultaneously); Blackwell expands this further
  • Hyper-Q (Kepler+): Hardware work queues that allow many independent CPU threads to simultaneously submit work to a single GPU without serialization
  • Priority Streams: Kepler+ (CUDA 5.5+)
  • Stream Capture (CUDA Graphs): CUDA 10.0+

Language Bindings / APIs

  • CUDA C/C++: cudaStream_t, cudaStreamCreate, cudaMemcpyAsync, cudaStreamSynchronize, cudaEventRecord, cudaStreamWaitEvent
  • Python (CuPy): cupy.cuda.Stream(); all CuPy operations accept stream argument; with cupy.cuda.Stream() as s: context manager
  • Python (PyTorch): torch.cuda.Stream(); with torch.cuda.stream(s): context manager; torch.cuda.current_stream(), torch.cuda.synchronize()
  • CUDA Fortran: cudafor module provides stream types and async operations
  • cuBLAS/cuFFT/cuDNN: All accept cublasSetStream, cufftSetStream, cudnnSetStream for stream-ordered execution

Connections

  • CUDA-Graphs — CUDA Graphs are captured from streams and launched onto streams; streams are the underlying execution context for graph nodes
  • CUDA-Unified-MemorycudaMemPrefetchAsync is a stream-ordered Unified Memory operation; migration timing is controlled via streams
  • Multi-Process-Service — MPS provides GPU context sharing; each client process has its own stream namespace but shares GPU hardware
  • NVCC — NVCC compiles CUDA C++ code including stream-based asynchronous patterns
  • CuPy — CuPy’s Stream API closely mirrors the CUDA C++ stream API for Python GPU programming

Resources