GPUDirect RDMA
Type: Technology Tags: NVIDIA, RDMA, networking, InfiniBand, GPU, peer-to-peer, MPI, HPC, inter-node, zero-copy Related: NVLink, NVIDIA-ConnectX-InfiniBand, NVIDIA-ConnectX-9, NVIDIA-BlueField-DPU, NVIDIA-DOCA, NVIDIA-DOCA-OFED, DOCA-GPUNetIO, DOCA-RDMA, NVIDIA-MLNX-OFED, NVIDIA-Network-Operator, NVIDIA-HPC-X, NCCL, NVSHMEM, GPU-Direct-Storage Sources: NVIDIA official documentation (live fetch attempted 2026-04-10; updated from https://www.nvidia.com/en-us/networking/products/ethernet-adapters/connectx-9-supernic/) Last Updated: 2026-04-29
Summary
GPUDirect RDMA (Remote Direct Memory Access) is an NVIDIA technology that enables network adapters (InfiniBand HCAs or RDMA-capable NICs) to directly read from and write to GPU memory over a network — bypassing the CPU and system DRAM entirely. In distributed GPU computing, this eliminates the double-copy bottleneck (GPU→CPU→NIC for send; NIC→CPU→GPU for receive), enabling inter-node GPU memory transfers at near-theoretical network bandwidth with minimal CPU involvement. GPUDirect RDMA is a foundational technology for large-scale distributed AI training and HPC applications.
Detail
Purpose
In traditional GPU cluster programming (without GPUDirect RDMA), sending data from one node’s GPU to another’s follows a slow path: GPU→cudaMemcpy to CPU pinned memory→NIC DMA→network→remote NIC DMA→cudaMemcpy from CPU→GPU. Each hop adds latency and consumes CPU cycles and memory bandwidth. In large-scale LLM training, inter-node gradient synchronization (NCCL AllReduce) is the primary bottleneck after GPU compute. GPUDirect RDMA collapses this to: GPU memory→NIC DMA→network→remote NIC DMA→GPU memory — halving latency and freeing the CPU for other work.
Key Features
- Direct GPU-NIC Memory Access: The NIC’s PCIe master engine can directly DMA data to/from GPU frame buffer memory without CPU staging; uses PCIe peer-to-peer addressing enabled by GPU BAR1 memory window
- Zero-Copy Design: Data moves GPU→NIC or NIC→GPU without ever touching CPU DRAM; CPU is not in the data path
- Sub-Microsecond Latency: Combined with InfiniBand network, GPU-to-GPU RDMA latency across nodes is 1–2 µs for small messages
- Bandwidth Efficiency: Full InfiniBand NDR 400 Gb/s bandwidth utilizable for GPU data without CPU or DRAM bottleneck
- NCCL Integration: NCCL 2.x+ automatically uses GPUDirect RDMA when available; no application code changes needed for distributed training
- MPI Integration: OpenMPI and MVAPICH2 use CUDA-Aware MPI which internally leverages GPUDirect RDMA for
MPI_Send/MPI_Recvof CUDA buffers - GPUDirect Async: Extension enabling GPU to independently initiate RDMA transfers without CPU involvement; GPU kernel writes to a doorbell register, triggering NIC RDMA directly — true GPU-driven networking
- NVSHMEM: NVSHMEM extends GPUDirect RDMA with a PGAS (Partitioned Global Address Space) model — GPU kernels directly call
nvshmem_put/nvshmem_getto address remote GPU memory by virtual address
Use Cases
- Large-Scale LLM Training: 1000+ GPU runs with NCCL AllReduce over InfiniBand NDR/HDR; GPUDirect RDMA is the technology enabling efficient inter-node gradient synchronization
- HPC MPI Applications: Molecular dynamics (GROMACS, AMBER), climate modeling (FV3, MPAS) using CUDA-Aware MPI with GPUDirect RDMA for collective operations
- High-Frequency Inference Disaggregation: KV cache transfer in prefill/decode disaggregated LLM serving (NIXL uses GPUDirect RDMA for fast token migration)
- Distributed GPU Database: Dask-CUDA and cuDF with UCXX for distributed GPU DataFrames across cluster nodes
- Real-Time HPC: FPGA-GPU communication for radar signal processing; NIC-GPU for financial analytics
Hardware Requirements / Compatibility
- GPU: NVIDIA Fermi (cc 2.0) and newer with GPUDirect RDMA support; BAR1 memory window must be accessible to PCIe peers
- NIC/HCA: NVIDIA ConnectX-4 or newer for InfiniBand; also supported by Mellanox/NVIDIA RoCE (RDMA over Converged Ethernet)
- PCIe Topology: NIC and GPU must be on the same PCIe root complex (or connected via PCIe peer-to-peer bridge); some server BIOS settings disable PCIe P2P — must be enabled
- Linux:
nvidia-peermemkernel module (part of NVIDIA driver) enables GPU-to-NIC peer memory mapping - CUDA: CUDA 5.0+ for GPUDirect RDMA; CUDA 11.3+ for GPUDirect Async
- InfiniBand: NVIDIA-DOCA-OFED or legacy NVIDIA-MLNX-OFED with RDMA support; Kubernetes exposure can be automated through NVIDIA-Network-Operator
Language Bindings / APIs
- NCCL: Transparent — NCCL uses GPUDirect RDMA automatically;
ncclAllReduce(sendBuf, recvBuf, count, ..., comm, stream) - CUDA-Aware MPI:
MPI_Send(cuda_ptr, count, ...)— OpenMPI 1.7+ or MVAPICH2 2.0+ with CUDA support; GPU pointer passed directly - NVSHMEM:
nvshmem_float_put(dest, src, nelems, pe)— GPU-callable RDMA; symmetric heap on all PEs - UCX: Unified Communication X framework (
ucx_perf, Python bindings) used by many applications for RDMA transport - libibverbs / rdmacm: Low-level RDMA programming (InfiniBand verbs) with GPU memory registered via
ibv_reg_mron CUDA pinned memory
Connections
- NVLink — NVLink handles intra-node GPU-to-GPU communication; GPUDirect RDMA handles inter-node (between servers); both used together in large GPU clusters
- NVIDIA-ConnectX-InfiniBand — ConnectX HCAs are the primary network adapters implementing GPUDirect RDMA in NVIDIA’s networking stack
- NVIDIA-ConnectX-9 — current SuperNIC generation that includes GPUDirect and RDMA capabilities in NVIDIA docs.
- NVIDIA-BlueField-DPU — BlueField DPU offloads RDMA network processing; BlueField-3 supports GPUDirect RDMA with DPU-accelerated NCCL flows
- NVIDIA-DOCA — DOCA RDMA, DOCA-OFED, and GPUDirect support are part of the current NVIDIA networking software stack.
- NVIDIA-DOCA-OFED — current Linux host-driver stack for GPU-aware RDMA paths.
- DOCA-GPUNetIO — DOCA library that makes GPU packet processing and GPU-controlled networking explicit.
- DOCA-RDMA — DOCA API for building RDMA applications on NVIDIA networking devices.
- NVIDIA-MLNX-OFED — legacy standalone Linux OFED stack still found in older GPUDirect RDMA setups.
- NVIDIA-Network-Operator — deploys the Kubernetes networking components required for GPUDirect RDMA workloads.
- NVIDIA-HPC-X — provides MPI/SHMEM/UCX/UCC communication libraries that use GPU-aware RDMA paths.
- NCCL — NCCL is the primary user of GPUDirect RDMA for distributed deep learning; provides AllReduce, AllGather over RDMA automatically
- NVSHMEM — NVSHMEM extends GPUDirect RDMA into a PGAS programming model for GPU-to-GPU remote memory access from within GPU kernels
- GPU-Direct-Storage — GPUDirect Storage extends the P2P concept to NVMe storage; GPU reads/writes storage without CPU staging