cuBLAS

Type: Technology Tags: CUDA, NVIDIA, GPU, Linear Algebra, BLAS, Math, HPC, AI Related: cuSOLVER, cuBLASLt, cuBLASXt, cuBLASDx, cuBLASMp, cuSPARSE, Incomplete-LU-Cholesky, Floating-Point-and-IEEE-754, cuTENSOR, CUTLASS, nvmath-python, NVIDIA-CUDA Sources: NVIDIA official documentation, https://docs.nvidia.com/cuda/cublas/index.html Last Updated: 2026-04-29

Summary

cuBLAS is NVIDIA’s GPU-accelerated library implementing all 152 standard Basic Linear Algebra Subprograms (BLAS) routines for NVIDIA GPUs. It covers Level 1 (vector-vector), Level 2 (matrix-vector), and Level 3 (matrix-matrix) operations, with optimized support for Tensor Cores and mixed/low-precision arithmetic. It is the foundational building block for GPU-accelerated AI and HPC numerical computing.

Detail

Purpose

cuBLAS solves the problem of accelerating dense linear algebra workloads by offloading standard BLAS routines to the GPU. It allows developers to replace CPU BLAS calls with GPU-accelerated equivalents with minimal code changes, enabling dramatic speedups for matrix-heavy workloads in deep learning, scientific simulation, and financial modeling.

Key Features

Complete support for all 152 standard BLAS routines
Half-precision and integer matrix multiplication (GEMM) for AI workloads
GEMM operations optimized for Tensor Cores with kernel fusion support
Batched operations for processing many small matrices efficiently
Multi-GPU execution via cuBLASXt and cuBLASMp variants
Mixed- and low-precision computation support (FP64, FP32, FP16, INT8)
CUDA stream compatibility for concurrent, asynchronous operations
cuBLASLt: programmable GEMM API with descriptors, heuristics, and advanced tuning options
cuBLASXt: single-node, multi-GPU interface for Level 3 workloads
cuBLASMp: multi-node distributed linear algebra (preview)
cuBLASDx: device-side kernel API for use inside CUDA kernels (preview)

Use Cases

Training and inference of deep neural networks
High-performance computing (HPC) simulations
Computational fluid dynamics
Financial risk modeling and Monte Carlo methods
Scientific computing requiring dense matrix operations

Hardware Requirements

NVIDIA GPU with CUDA support (Kepler or later)
Tensor Core acceleration available on Volta (V100) and later GPUs
Mixed-precision and INT8 requires Turing (T4) or Ampere (A100) and later

Language Bindings

C and C++ (primary API)
Fortran (via cuBLAS Fortran interface)
Python (via nvmath-python and cuPy wrappers)

Connections

cuSOLVER — builds on cuBLAS for linear solvers and decompositions
cuBLASLt — flexible cuBLAS GEMM API for descriptor-driven tuning, heuristics, and low-precision matrix multiply
cuBLASXt — single-node multi-GPU cuBLAS host interface for BLAS3 operations
cuBLASDx — device-side BLAS-style operations for fused CUDA kernels
cuBLASMp — distributed dense linear algebra extension for multi-process GPU systems
cuSPARSE — sparse counterpart to cuBLAS dense operations
Incomplete-LU-Cholesky — iterative solver guide that combines cuBLAS vector operations with cuSPARSE sparse operations
Floating-Point-and-IEEE-754 — numerical accuracy guidance relevant to BLAS precision and reproducibility questions
cuTENSOR — extends dense algebra to arbitrary tensor contractions
CUTLASS — open-source GEMM templates that complement cuBLAS
nvmath-python — Python-friendly wrapper exposing cuBLAS and other math libs

AIPS BOOM

Explorer

cuBLAS

cuBLAS

Summary

Detail

Purpose

Key Features

Use Cases

Hardware Requirements

Language Bindings

Connections

Resources

Graph View

Table of Contents

Backlinks