NVSHMEM

Type: Technology Tags: CUDA, NVIDIA, GPU, PGAS, Distributed Computing, Multi-GPU, HPC, OpenSHMEM Related: NVSHMEM4Py, NCCL, NVIDIA-HPC-X, GPUDirect-RDMA, CUDA-Python, cuda-core, CuPy, PyTorch, cuFFT, cuFFTMp, cuBLAS, NVIDIA-Fortran-CUDA-Interfaces, AmgX Sources: NVIDIA official documentation, https://docs.nvidia.com/nvshmem/, https://docs.nvidia.com/nvshmem/api/api/language_bindings/python/index.html, https://docs.nvidia.com/nvshmem/api/api/language_bindings/python/interoperability.html Last Updated: 2026-04-29

Summary

NVSHMEM is NVIDIA’s implementation of the OpenSHMEM programming model for GPU clusters, providing a Partitioned Global Address Space (PGAS) interface that enables direct GPU-to-GPU communication without CPU orchestration. It supports GPU-initiated one-sided communication (puts/gets), fine-grained asynchronous data transfers, and interoperability with MPI. NVSHMEM4Py provides the official Python language binding for NVSHMEM.

Detail

Purpose

Traditional multi-GPU programming models (MPI, NCCL) require CPU involvement to orchestrate communication. NVSHMEM enables GPUs to directly access memory on remote GPUs using one-sided put/get operations initiated from CUDA kernels, eliminating CPU synchronization bottlenecks and enabling fine-grained, irregular communication patterns that are difficult to express with collective operations.

Key Features

PGAS (Partitioned Global Address Space): unified address space across all GPUs in a cluster
GPU-initiated one-sided communication: put, get, atomic operations from CUDA kernels
CPU-initiated and CUDA stream-based operation modes
Low-overhead one-sided primitives (initiating process specifies all transfer details)
Asynchronous design: overlaps computation and communication
MPI interoperability for hybrid programming models
Compatible with OpenSHMEM implementations
Multi-architecture support: x86 and Arm processors
NVSHMEM4Py: official Python language binding with host APIs, Python device APIs, and CuPy/PyTorch interoperability.
Used by cuFFTMp for multi-node FFT communication

Use Cases

Quantum chromodynamics (QCD) simulations via QUDA library
Spatial-parallel deep learning convolutions
Irregular communication patterns in sparse solvers
HPC applications with fine-grained neighbor communication
Molecular dynamics with non-collective communication patterns
Graph analytics with irregular data access

Hardware Requirements

NVIDIA GPU with CUDA support
NVLink for intra-node GPU communication
InfiniBand or RoCE for inter-node GPU communication
GPU cluster environment required for multi-node use

Language Bindings

C and C++ (primary API)
Python via NVSHMEM4Py

Connections

NCCL — NCCL handles collective operations; NVSHMEM handles fine-grained point-to-point PGAS communication
NVSHMEM4Py — official Python binding for NVSHMEM, including symmetric memory and Python GPU library interoperability.
CUDA-Python and cuda-core — NVSHMEM4Py participates in the current Python CUDA stack and uses cuda-core concepts for device/stream setup.
NVIDIA-HPC-X — HPC-X includes SHMEM/PGAS communication libraries that sit near NVSHMEM’s GPU-cluster programming model
GPUDirect-RDMA — NVSHMEM relies on GPU-aware RDMA paths for inter-node GPU communication
cuFFT and cuFFTMp — cuFFTMp uses NVSHMEM for multi-node distributed FFT communication
NVIDIA-Fortran-CUDA-Interfaces - current Fortran CUDA interfaces include NVSHMEM module/interface coverage.
cuBLAS — cuBLASMp multi-node linear algebra uses NVSHMEM for data movement
AmgX — AmgX multi-node solvers use NVSHMEM-like communication patterns

AIPS BOOM

Explorer

NVSHMEM

NVSHMEM

Summary

Detail

Purpose

Key Features

Use Cases

Hardware Requirements

Language Bindings

Connections

Resources

Graph View

Table of Contents

Backlinks