cuBLASMp
Type: Technology Tags: NVIDIA, CUDA, cuBLAS, distributed linear algebra, dense matrices, multi-process, HPC Related: cuBLAS, NCCL, NVSHMEM, NVPL, NVIDIA-DGX-SuperPOD, NVIDIA-CUDA Sources: https://docs.nvidia.com/cuda/cublasmp/index.html Last Updated: 2026-04-29
Summary
cuBLASMp is NVIDIA’s multi-process, GPU-accelerated library for distributed dense linear algebra. It provides PBLAS-like C APIs and supports 2D block-cyclic data layouts used in distributed-memory numerical computing.
Detail
Purpose
Single-GPU and single-process BLAS are not enough for very large dense problems or tensor-parallel workloads. cuBLASMp extends NVIDIA dense linear algebra into distributed, multi-process settings for HPC and large-scale AI systems.
Key capabilities
- Distributed dense linear algebra over multiple processes and GPUs.
- 2D block-cyclic data layout compatibility.
- PBLAS-like C API surface for distributed BLAS-style operations.
- Availability through NVIDIA Developer Zone, NVIDIA HPC SDK, PyPI, and conda-forge.
- Documented use for tensor parallelism in distributed machine learning.
NVIDIA context
cuBLASMp connects CUDA-X math libraries with scale-out systems such as NVIDIA-DGX-SuperPOD, InfiniBand-backed clusters, and model-parallel AI training/inference workflows.
Connections
- cuBLAS - single-process dense BLAS foundation that cuBLASMp extends to distributed settings.
- cuSOLVERMp - companion distributed dense solver/eigensolver library.
- NCCL - collective communication often appears in the same distributed GPU systems.
- NVSHMEM - another GPU-cluster programming model for distributed memory systems.
- NVIDIA-DGX-SuperPOD - target class of scale-out NVIDIA GPU infrastructure.
Source Excerpts
- NVIDIA describes cuBLASMp as a high-performance multi-process GPU library for distributed dense linear algebra.