CUDA Hopper Tuning Guide

Type: Guide Tags: NVIDIA, CUDA, Hopper, performance, tuning, TMA, thread block clusters, DPX, NVLink Related: CUDA-Hopper-Compatibility-Guide, NVIDIA-Hopper-Architecture, CUDA-Best-Practices-Guide, CUDA-Programming-Guide, Nsight-Compute, NVLink, CUDA-Graphs Sources: https://docs.nvidia.com/cuda/hopper-tuning-guide/index.html Last Updated: 2026-04-29

Summary

The CUDA Hopper Tuning Guide is NVIDIA’s architecture-specific guide for optimizing CUDA applications on Hopper GPUs. It covers Hopper features such as Tensor Memory Accelerator, thread block clusters, distributed shared memory, DPX instructions, inline compression, and fourth-generation NVLink.

Detail

Purpose

Hopper introduced new performance features beyond Ampere while preserving the CUDA programming model. This tuning guide explains where developers should look after applying general CUDA best practices.

Key tuning areas

Tensor Memory Accelerator for asynchronous data movement.
Thread block clusters and distributed shared memory.
Improved FP32 throughput and Tensor Core behavior.
DPX instructions for dynamic programming workloads.
HBM3/HBM3e memory, larger L2, inline compression, and unified shared memory/L1/texture cache behavior.
Fourth-generation NVLink for multi-GPU bandwidth.

NVIDIA context

This page should be linked from Hopper, CUDA performance, Nsight, and LLM/HPC optimization discussions. Hopper is still a major deployed base for H100, H200, GH200, DGX H100/H200, and many AI Enterprise environments.

Connections

CUDA-Hopper-Compatibility-Guide - compatibility check before Hopper-specific performance tuning.
NVIDIA-Hopper-Architecture - architecture-level page for Hopper.
CUDA-Best-Practices-Guide - general performance recommendations.
CUDA-Programming-Guide - deeper reference for clusters, memory hierarchy, and synchronization.
Nsight-Compute - kernel profiler for tuning Hopper kernels.
NVLink - Hopper tuning includes fourth-generation NVLink behavior.

Source Excerpts

NVIDIA’s Hopper tuning guide highlights TMA, thread block clusters, distributed shared memory, DPX instructions, inline compression, and NVLink 4.

Resources

CUDA Hopper Tuning Guide

AIPS BOOM

Explorer

CUDA-Hopper-Tuning-Guide