NVSHMEM

Type: Technology Tags: CUDA, NVIDIA, GPU, PGAS, Distributed Computing, Multi-GPU, HPC, OpenSHMEM Related: NVSHMEM4Py, NCCL, NVIDIA-HPC-X, GPUDirect-RDMA, CUDA-Python, cuda-core, CuPy, PyTorch, cuFFT, cuFFTMp, cuBLAS, NVIDIA-Fortran-CUDA-Interfaces, AmgX Sources: NVIDIA official documentation, https://docs.nvidia.com/nvshmem/, https://docs.nvidia.com/nvshmem/api/api/language_bindings/python/index.html, https://docs.nvidia.com/nvshmem/api/api/language_bindings/python/interoperability.html Last Updated: 2026-04-29

Summary

NVSHMEM is NVIDIA’s implementation of the OpenSHMEM programming model for GPU clusters, providing a Partitioned Global Address Space (PGAS) interface that enables direct GPU-to-GPU communication without CPU orchestration. It supports GPU-initiated one-sided communication (puts/gets), fine-grained asynchronous data transfers, and interoperability with MPI. NVSHMEM4Py provides the official Python language binding for NVSHMEM.

Detail

Purpose

Traditional multi-GPU programming models (MPI, NCCL) require CPU involvement to orchestrate communication. NVSHMEM enables GPUs to directly access memory on remote GPUs using one-sided put/get operations initiated from CUDA kernels, eliminating CPU synchronization bottlenecks and enabling fine-grained, irregular communication patterns that are difficult to express with collective operations.

Key Features

  • PGAS (Partitioned Global Address Space): unified address space across all GPUs in a cluster
  • GPU-initiated one-sided communication: put, get, atomic operations from CUDA kernels
  • CPU-initiated and CUDA stream-based operation modes
  • Low-overhead one-sided primitives (initiating process specifies all transfer details)
  • Asynchronous design: overlaps computation and communication
  • MPI interoperability for hybrid programming models
  • Compatible with OpenSHMEM implementations
  • Multi-architecture support: x86 and Arm processors
  • NVSHMEM4Py: official Python language binding with host APIs, Python device APIs, and CuPy/PyTorch interoperability.
  • Used by cuFFTMp for multi-node FFT communication

Use Cases

  • Quantum chromodynamics (QCD) simulations via QUDA library
  • Spatial-parallel deep learning convolutions
  • Irregular communication patterns in sparse solvers
  • HPC applications with fine-grained neighbor communication
  • Molecular dynamics with non-collective communication patterns
  • Graph analytics with irregular data access

Hardware Requirements

  • NVIDIA GPU with CUDA support
  • NVLink for intra-node GPU communication
  • InfiniBand or RoCE for inter-node GPU communication
  • GPU cluster environment required for multi-node use

Language Bindings

Connections

  • NCCL — NCCL handles collective operations; NVSHMEM handles fine-grained point-to-point PGAS communication
  • NVSHMEM4Py — official Python binding for NVSHMEM, including symmetric memory and Python GPU library interoperability.
  • CUDA-Python and cuda-core — NVSHMEM4Py participates in the current Python CUDA stack and uses cuda-core concepts for device/stream setup.
  • NVIDIA-HPC-X — HPC-X includes SHMEM/PGAS communication libraries that sit near NVSHMEM’s GPU-cluster programming model
  • GPUDirect-RDMA — NVSHMEM relies on GPU-aware RDMA paths for inter-node GPU communication
  • cuFFT and cuFFTMp — cuFFTMp uses NVSHMEM for multi-node distributed FFT communication
  • NVIDIA-Fortran-CUDA-Interfaces - current Fortran CUDA interfaces include NVSHMEM module/interface coverage.
  • cuBLAS — cuBLASMp multi-node linear algebra uses NVSHMEM for data movement
  • AmgX — AmgX multi-node solvers use NVSHMEM-like communication patterns

Resources