NCCL

Type: Technology Tags: CUDA, NVIDIA, GPU, Collective Communications, Distributed Training, Multi-GPU, HPC, RDMA Related: cuDNN, TensorRT, NVSHMEM, NVIDIA-HPC-X, NVIDIA-DOCA-OFED, DOCA-GPUNetIO, DOCA-RDMA, NVIDIA-MLNX-OFED, GPUDirect-RDMA, NVIDIA-Spectrum-X, NVIDIA-Quantum-InfiniBand, cuBLAS, Thrust Sources: NVIDIA official documentation Last Updated: 2026-04-09

Summary

NCCL (NVIDIA Collective Communications Library) is NVIDIA’s library implementing high-performance multi-GPU and multi-node collective communication primitives (all-reduce, all-gather, broadcast, reduce, reduce-scatter) optimized for NVIDIA GPU interconnects. It automatically detects system topology (PCIe, NVLink, NVSwitch, InfiniBand, RoCE) and selects optimal ring/tree algorithms, making it the standard communication backend for distributed deep learning training in PyTorch, TensorFlow, JAX, and other frameworks.

Detail

Purpose

Distributed deep learning requires frequent synchronization of model gradients and activations across GPUs (all-reduce for data parallelism, all-gather for tensor parallelism). Doing this efficiently requires knowledge of the physical interconnect topology. NCCL abstracts this complexity, automatically optimizing communication patterns for any hardware configuration while providing a simple C API.

Key Features

Collective operations: all-reduce, all-gather, broadcast, reduce, reduce-scatter
Point-to-point send/receive capabilities
Device APIs: communication directly from CUDA kernels for lower latency
Automatic topology detection: PCIe, NVLink, NVSwitch, InfiniBand, RoCE
Advanced graph algorithms for optimal ring and tree construction
Plugin framework for custom network transports
Integration points with NVIDIA-HPC-X through NCCL-RDMA-SHARP and Spectrum-X NCCL plugin guidance
Multi-node networking depends on a working host-driver stack such as current NVIDIA-DOCA-OFED or legacy NVIDIA-MLNX-OFED, plus GPUDirect-RDMA when GPU memory is registered directly with the NIC
Support for multi-threaded, multi-process, and MPI-driven applications
Built-in profiling: NCCL RAS (Remote Access Service) and NCCL Inspector
Simple C API following MPI conventions

Use Cases

Distributed data-parallel training (all-reduce for gradient synchronization)
Tensor-parallel LLM training (all-gather for weight gathering)
Pipeline-parallel training communication
Distributed inference serving with tensor parallelism
HPC multi-GPU scientific simulations

Hardware Requirements

NVIDIA GPU with CUDA support
NVLink for intra-node GPU communication (V100, A100, H100, B100)
NVSwitch for full all-to-all intra-node bandwidth (DGX systems)
InfiniBand or RoCE for inter-node communication
Supports all NVIDIA data center GPUs

Language Bindings

C (primary API, MPI-convention compatible)
Python (via PyTorch distributed, Horovod, other frameworks)

Connections

cuDNN — deep learning frameworks use both cuDNN (compute) and NCCL (communication)
TensorRT — multi-GPU TensorRT inference uses NCCL for tensor parallelism
NVSHMEM — NVSHMEM is NCCL’s complement: NCCL for collective ops, NVSHMEM for fine-grained PGAS communication
NVIDIA-HPC-X — HPC-X includes NCCL-RDMA-SHARP and Spectrum-X NCCL plugin material for NVIDIA fabrics
NVIDIA-DOCA-OFED — current Linux host networking stack for NCCL over InfiniBand or RoCE.
DOCA-GPUNetIO — current DOCA docs mention NCCL GIN transport enabled through GPUNetIO GPU communications.
DOCA-RDMA — DOCA RDMA is part of the adjacent NVIDIA RDMA application programming surface.
NVIDIA-MLNX-OFED — legacy standalone Linux OFED stack for older NCCL/RDMA deployments.
GPUDirect-RDMA — direct GPU-to-NIC data path used by NCCL for inter-node collectives.
NVIDIA-Spectrum-X — Spectrum-X is NVIDIA’s Ethernet AI networking path for NCCL over RoCE environments
NVIDIA-Quantum-InfiniBand — Quantum InfiniBand provides SHARP and low-latency fabric support for NCCL collectives
cuBLAS — cuBLAS handles per-GPU compute; NCCL handles the synchronization between GPUs

AIPS BOOM

Explorer

NCCL

NCCL

Summary

Detail

Purpose

Key Features

Use Cases

Hardware Requirements

Language Bindings

Connections

Resources

Graph View

Table of Contents

Backlinks