cuTENSOR

Type: Technology Tags: CUDA, NVIDIA, GPU, Tensor, Linear Algebra, Deep Learning, HPC, Math Related: cuBLAS, cuDNN, CUTLASS, cuSOLVER, nvmath-python, cuTENSORMg, cuTENSORMp, cuQuantum, cuTensorNet, cuDensityMat Sources: NVIDIA official documentation Last Updated: 2026-04-09

Summary

cuTENSOR is NVIDIA’s GPU-accelerated tensor linear algebra library, providing high-performance tensor contraction, reduction, and elementwise operations. It leverages NVIDIA Tensor Cores (including TF32, 3xTF32, and DMMA modes) and supports arbitrary tensor dimensionality, block-sparse contractions, single-process multi-GPU execution through cuTENSORMg, and multi-process distributed scaling through cuTENSORMp. It is used in deep learning, quantum chemistry, and computational physics.

Detail

Purpose

Tensor contractions are the generalization of matrix multiplication to higher-dimensional arrays — a core operation in deep learning, quantum chemistry, and physics simulation. cuTENSOR provides a highly optimized, hardware-native implementation of these operations that outperforms ad-hoc approaches and fully utilizes Tensor Core hardware on modern NVIDIA GPUs.

Key Features

Just-in-time (JIT) compiled kernels for tensor contraction
Plan-based multi-stage API for contraction, reduction, and elementwise ops
Support for arbitrarily dimensional tensor descriptors
TF32, 3xTF32, and DMMA compute type support for Tensor Cores
Block-sparse tensor contractions for sparsity exploitation
Expressive API enabling elementwise operation fusion
Mixed precision with int64 extents for large tensor dimensions
cuTENSORMg: single-process multi-GPU tensor operations.
cuTENSORMp: multi-process distributed tensor contractions.

Use Cases

Deep learning training and inference (tensor operations in transformers)
Computer vision model acceleration
Quantum chemistry simulations (e.g., CCSD, MP2)
Computational physics (e.g., tensor network methods)
Scientific computing with multi-dimensional arrays

Hardware Requirements

NVIDIA GPU with CUDA support
Tensor Core acceleration on Volta (V100) and later
3xTF32 and DMMA modes on Ampere (A100) and later
cuTENSORMg and cuTENSORMp require multi-GPU and/or multi-node interconnect context.

Language Bindings

C and C++ (primary API)
Python (via nvmath-python and CuPy wrappers)

Connections

cuBLAS — cuBLAS handles 2D matrix operations; cuTENSOR extends to N-D tensors
cuDNN — cuDNN uses tensor operations internally; cuTENSOR provides the low-level primitive
CUTLASS — CUTLASS provides GEMM templates; cuTENSOR provides higher-level tensor contraction
nvmath-python — Python-accessible interface for cuTENSOR operations
cuTENSORMg - single-process multi-GPU cuTENSOR support.
cuTENSORMp - multi-process distributed cuTENSOR support.
cuTensorNet - current cuQuantum tensor-network component built on cuTENSOR.
cuDensityMat - current cuQuantum analog-dynamics component that lists cuTENSOR as a prerequisite.

AIPS BOOM

Explorer

cuTENSOR

cuTENSOR

Summary

Detail

Purpose

Key Features

Use Cases

Hardware Requirements

Language Bindings

Connections

Resources

Graph View

Table of Contents

Backlinks