cuBLASMp

Type: Technology Tags: NVIDIA, CUDA, cuBLAS, distributed linear algebra, dense matrices, multi-process, HPC Related: cuBLAS, NCCL, NVSHMEM, NVPL, NVIDIA-DGX-SuperPOD, NVIDIA-CUDA Sources: https://docs.nvidia.com/cuda/cublasmp/index.html Last Updated: 2026-04-29

Summary

cuBLASMp is NVIDIA’s multi-process, GPU-accelerated library for distributed dense linear algebra. It provides PBLAS-like C APIs and supports 2D block-cyclic data layouts used in distributed-memory numerical computing.

Detail

Purpose

Single-GPU and single-process BLAS are not enough for very large dense problems or tensor-parallel workloads. cuBLASMp extends NVIDIA dense linear algebra into distributed, multi-process settings for HPC and large-scale AI systems.

Key capabilities

Distributed dense linear algebra over multiple processes and GPUs.
2D block-cyclic data layout compatibility.
PBLAS-like C API surface for distributed BLAS-style operations.
Availability through NVIDIA Developer Zone, NVIDIA HPC SDK, PyPI, and conda-forge.
Documented use for tensor parallelism in distributed machine learning.

NVIDIA context

cuBLASMp connects CUDA-X math libraries with scale-out systems such as NVIDIA-DGX-SuperPOD, InfiniBand-backed clusters, and model-parallel AI training/inference workflows.

Connections

cuBLAS - single-process dense BLAS foundation that cuBLASMp extends to distributed settings.
cuSOLVERMp - companion distributed dense solver/eigensolver library.
NCCL - collective communication often appears in the same distributed GPU systems.
NVSHMEM - another GPU-cluster programming model for distributed memory systems.
NVIDIA-DGX-SuperPOD - target class of scale-out NVIDIA GPU infrastructure.

Source Excerpts

NVIDIA describes cuBLASMp as a high-performance multi-process GPU library for distributed dense linear algebra.

AIPS BOOM

Explorer

cuBLASMp

cuBLASMp

Summary

Detail

Purpose

Key capabilities

NVIDIA context

Connections

Source Excerpts

Graph View

Table of Contents

Backlinks