cuFFTMp

Type: Library Tags: NVIDIA, CUDA, cuFFT, cuFFTMp, FFT, multi-GPU, multi-node, NVSHMEM, MPI, HPC Related: cuFFT, cuFFTDx, NVSHMEM, NVIDIA-HPC-SDK, NVIDIA-HPC-Compilers, NCCL, NVIDIA-HPC-X, GPUDirect-RDMA, nvmath-python, NVIDIA-CUDA Sources: https://docs.nvidia.com/cuda/cufftmp/index.html, https://docs.nvidia.com/hpc-sdk/index.html Last Updated: 2026-04-29

Summary

cuFFTMp is the cuFFT multi-process library for distributed FFTs across multiple GPUs and multiple nodes. It supports 2D and 3D multi-GPU, multi-node FFTs, slab and pencil decompositions, arbitrary block sizes, an MPI-compatible interface, and a low-latency implementation based on NVSHMEM. NVIDIA distributes cuFFTMp through NVIDIA-HPC-SDK and NVIDIA Developer Zone.

Detail

Purpose

Large FFT workloads in weather, seismic, molecular dynamics, computational physics, and other HPC domains can exceed a single GPU or single node. cuFFTMp extends the cuFFT family into distributed-memory systems so applications can execute 2D/3D FFTs with data decomposed across processes and GPUs.

Current scope

  • 2D and 3D multi-GPU, multi-node FFTs.
  • Slab (1D) and pencil (2D) data decomposition with arbitrary block sizes.
  • MPI-compatible interface for integration into HPC applications.
  • NVSHMEM-based low-latency communication optimized for single-node and multi-node FFTs.
  • x86_64 and aarch64 support.
  • Documentation for quick start, usage requirements, API usage, versioning, performance considerations, memory requirements, bootstrapping, NVSHMEM integration, and API references.

NVIDIA context

cuFFTMp is distinct from cuFFT and cuFFTDx: cuFFT is the main host-side FFT library, cuFFTDx is the device-side FFT library for in-kernel fusion, and cuFFTMp is the distributed multi-process FFT path for HPC systems.

Connections

  • cuFFT - parent FFT library family.
  • cuFFTDx - device-side FFT sibling for fused kernels.
  • NVSHMEM - low-latency communication layer used by cuFFTMp.
  • NVIDIA-HPC-SDK - distribution and documentation hub.
  • NVIDIA-HPC-Compilers - compiler context for HPC applications that link distributed FFT code.
  • NCCL, NVIDIA-HPC-X, and GPUDirect-RDMA - adjacent communication/fabric stack for multi-GPU and multi-node HPC.
  • nvmath-python - Python math library that documents cuFFTMp as part of FFT support.
  • NVIDIA-CUDA - core CUDA platform underneath the FFT library stack.

Resources