cuTENSOR
Type: Technology Tags: CUDA, NVIDIA, GPU, Tensor, Linear Algebra, Deep Learning, HPC, Math Related: cuBLAS, cuDNN, CUTLASS, cuSOLVER, nvmath-python, cuTENSORMg, cuTENSORMp, cuQuantum, cuTensorNet, cuDensityMat Sources: NVIDIA official documentation Last Updated: 2026-04-09
Summary
cuTENSOR is NVIDIA’s GPU-accelerated tensor linear algebra library, providing high-performance tensor contraction, reduction, and elementwise operations. It leverages NVIDIA Tensor Cores (including TF32, 3xTF32, and DMMA modes) and supports arbitrary tensor dimensionality, block-sparse contractions, single-process multi-GPU execution through cuTENSORMg, and multi-process distributed scaling through cuTENSORMp. It is used in deep learning, quantum chemistry, and computational physics.
Detail
Purpose
Tensor contractions are the generalization of matrix multiplication to higher-dimensional arrays — a core operation in deep learning, quantum chemistry, and physics simulation. cuTENSOR provides a highly optimized, hardware-native implementation of these operations that outperforms ad-hoc approaches and fully utilizes Tensor Core hardware on modern NVIDIA GPUs.
Key Features
- Just-in-time (JIT) compiled kernels for tensor contraction
- Plan-based multi-stage API for contraction, reduction, and elementwise ops
- Support for arbitrarily dimensional tensor descriptors
- TF32, 3xTF32, and DMMA compute type support for Tensor Cores
- Block-sparse tensor contractions for sparsity exploitation
- Expressive API enabling elementwise operation fusion
- Mixed precision with int64 extents for large tensor dimensions
- cuTENSORMg: single-process multi-GPU tensor operations.
- cuTENSORMp: multi-process distributed tensor contractions.
Use Cases
- Deep learning training and inference (tensor operations in transformers)
- Computer vision model acceleration
- Quantum chemistry simulations (e.g., CCSD, MP2)
- Computational physics (e.g., tensor network methods)
- Scientific computing with multi-dimensional arrays
Hardware Requirements
- NVIDIA GPU with CUDA support
- Tensor Core acceleration on Volta (V100) and later
- 3xTF32 and DMMA modes on Ampere (A100) and later
- cuTENSORMg and cuTENSORMp require multi-GPU and/or multi-node interconnect context.
Language Bindings
- C and C++ (primary API)
- Python (via nvmath-python and CuPy wrappers)
Connections
- cuBLAS — cuBLAS handles 2D matrix operations; cuTENSOR extends to N-D tensors
- cuDNN — cuDNN uses tensor operations internally; cuTENSOR provides the low-level primitive
- CUTLASS — CUTLASS provides GEMM templates; cuTENSOR provides higher-level tensor contraction
- nvmath-python — Python-accessible interface for cuTENSOR operations
- cuTENSORMg - single-process multi-GPU cuTENSOR support.
- cuTENSORMp - multi-process distributed cuTENSOR support.
- cuTensorNet - current cuQuantum tensor-network component built on cuTENSOR.
- cuDensityMat - current cuQuantum analog-dynamics component that lists cuTENSOR as a prerequisite.