cuBLASLt

Type: Technology Tags: NVIDIA, CUDA, cuBLAS, GEMM, Tensor Cores, linear algebra, CUDA-X Related: cuBLAS, cuBLASMp, cuBLASXt, cuBLASDx, CUTLASS, TensorRT, TensorRT-LLM, Transformer-Engine, PyTorch, JAX, NVIDIA-CUDA Sources: https://docs.nvidia.com/cuda/cublas/index.html#using-the-cublaslt-api Last Updated: 2026-04-29

Summary

cuBLASLt is NVIDIA’s lightweight cuBLAS API dedicated to programmable GEMM operations. It extends the core cuBLAS dense linear algebra stack with descriptor-driven matrix multiplication, flexible data layouts, mixed and narrow precision support, epilogues, algorithm selection, and heuristic search for Tensor Core-oriented workloads.

Detail

Purpose

Standard BLAS calls are intentionally compact, but modern AI and HPC GEMM workloads often need more control over layouts, data types, scaling, epilogues, and algorithm choices. cuBLASLt gives developers a more configurable GEMM interface while keeping execution inside NVIDIA’s optimized cuBLAS library path.

Key capabilities

Descriptor-based GEMM through matmul, matrix-layout, transform, preference, and algorithm descriptors.
Heuristic selection APIs for finding suitable GEMM algorithms under workspace, layout, precision, and performance constraints.
Support for mixed-precision and narrow-precision GEMM, including FP8 and FP4-oriented scaling modes in current CUDA documentation.
Epilogue options that can fuse common post-GEMM operations such as bias-related transformer patterns.
Reusable operation descriptions, similar in spirit to planning once and executing repeatedly for matching input shapes.
Logging and heuristics-cache controls for observability and tuning.

NVIDIA context

cuBLASLt is one of the central low-level CUDA-X paths behind high-performance transformer and inference stacks. It sits below frameworks and compilers such as PyTorch, JAX, TensorRT, TensorRT-LLM, and Transformer-Engine, while CUTLASS provides a lower-level template route for custom kernels when library APIs are not enough.

Connections

cuBLAS - cuBLASLt is documented as part of the cuBLAS library and complements the standard cuBLAS API.
cuBLASXt - sibling cuBLAS API for single-node multi-GPU BLAS3 execution.
cuBLASMp - distributed dense linear algebra path for multi-process systems.
cuBLASDx - device-side BLAS path for fusing dense operations into CUDA kernels.
CUTLASS - template library for custom GEMM kernels when application-specific tiling or fusion is needed.
TensorRT and TensorRT-LLM - production inference stacks that depend on optimized GEMM and low-precision CUDA math.
Transformer-Engine - transformer precision layer adjacent to cuBLASLt FP8/FP4-era GEMM capabilities.
PyTorch and JAX - GPU frameworks whose NVIDIA paths commonly rely on cuBLAS/cuBLASLt through backend libraries and compilers.

Source Excerpts

NVIDIA describes cuBLASLt as a lightweight GEMM library with flexible layouts, input and compute types, algorithm selection, and heuristic programmability.
The current cuBLAS 13.2 documentation lists cuBLASLt alongside the standard cuBLAS API, cuBLASXt, and cuBLASDx.

AIPS BOOM

Explorer

cuBLASLt

cuBLASLt

Summary

Detail

Purpose

Key capabilities

NVIDIA context

Connections

Source Excerpts

Graph View

Table of Contents

Backlinks