TensorRT-LLM

Type: Technology Tags: CUDA, NVIDIA, GPU, LLM, Inference, Optimization, Transformer, Generative AI, CUDA-X Related: TensorRT, Triton-Inference-Server, NVIDIA-AIPerf, NVIDIA-GenAI-Perf, Triton-Performance-Analyzer, Triton-Model-Analyzer, Triton-Model-Navigator, NeMo-Export-Deploy, NeMo-Megatron-Bridge, NeMo-AutoModel, Red-Hat-AI-Factory-with-NVIDIA, FlashInfer, NVIDIA-NeMo, Megatron-LM, cuDNN, CUTLASS, NIXL Sources: NVIDIA official documentation, developer.nvidia.com/tensorrt-llm, https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html, https://docs.nvidia.com/aiperf/getting-started/profiling-with-ai-perf, https://docs.nvidia.com/nemo/export-deploy/latest/index.html, https://docs.nvidia.com/nemo/export-deploy/latest/apidocs/nemo_export/nemo_export.html, https://docs.nvidia.com/ai-enterprise/deployment/red-hat-ai-factory/latest/overview.html Last Updated: 2026-04-29

Summary

TensorRT-LLM is NVIDIA’s open-source library for optimizing and deploying large language model (LLM) inference at maximum throughput and minimum latency on NVIDIA GPUs. It provides a Python API for defining and compiling LLM inference engines, incorporating state-of-the-art optimizations including in-flight batching (continuous batching), paged KV cache, custom attention kernels (Flash Attention), tensor parallelism, pipeline parallelism, quantization (INT4, INT8, FP8), and speculative decoding. TensorRT-LLM is the backend for NVIDIA Triton’s LLM serving and the primary path to production LLM deployment on NVIDIA hardware.

Detail

Purpose

TensorRT-LLM solves the performance gap between raw LLM inference and hardware-optimal execution. Naive LLM deployment achieves 10-30% of theoretical GPU throughput; TensorRT-LLM closes this gap through aggressive kernel fusion, custom attention implementations, quantization, and dynamic batching strategies, delivering 2-8x throughput improvements over baseline PyTorch inference for production LLM serving.

Key Features

  • In-flight batching (continuous batching): dynamically combines requests of different lengths for maximum GPU utilization
  • Paged KV cache: manages KV cache memory as virtual pages to support arbitrary context lengths
  • Flash Attention and Multi-Head Latent Attention (MLA) custom CUDA kernels
  • Tensor Parallelism (TP) and Pipeline Parallelism (PP) for multi-GPU inference
  • Weight-only INT4 and INT8 quantization (AWQ, GPTQ, SmoothQuant)
  • FP8 quantization and execution on Hopper/Blackwell GPUs
  • Speculative decoding: draft model + target model for reduced decode latency
  • Medusa, EAGLE speculative decoding variants
  • LoRA adapter support for serving multiple fine-tuned variants from one base model
  • Chunked prefill for long-context generation
  • KV cache disaggregation via NIXL for prefill-decode separation
  • Supports: GPT, Llama, Mistral, Mixtral, Falcon, Nemotron, Gemma, Phi, Qwen, BLOOM, and more
  • Python Model Definition API for building custom LLM engines
  • Triton Inference Server backend for production serving
  • NeMo Export-Deploy path for exporting NeMo, AutoModel, Megatron Bridge, and Hugging Face checkpoints into TensorRT-LLM deployment flows.
  • Red-Hat-AI-Factory-with-NVIDIA lists TensorRT-LLM as one of the engines that can power NIM inference microservices in the OpenShift AI stack.

Use Cases

  • Production LLM API serving (ChatGPT-style applications)
  • RAG (Retrieval-Augmented Generation) system backends
  • Code generation and completion services
  • Multimodal LLM inference (vision-language models)
  • Multi-LoRA serving: one base model serving many fine-tuned adapters simultaneously
  • Long-context document processing (128K+ token contexts)
  • Batch offline LLM inference for data processing pipelines

Hardware Requirements

  • NVIDIA GPU with CUDA Compute Capability 7.0+ (Volta minimum for FP16)
  • A100 (Ampere): INT8, FP16 quantization; strong multi-GPU TP support
  • H100 (Hopper): FP8 quantization and execution, NVLink 4.0 for TP
  • H200/B100/B200 (Blackwell): next-generation FP8 and FP4 support
  • NVLink required for tensor parallelism (TP>1) on same node
  • InfiniBand for multi-node pipeline parallelism
  • CUDA 12.x required for full feature set

Language Bindings

  • Python (Model Definition API, tensorrt_llm package)
  • C++ (runtime and backend code)
  • REST/gRPC via Triton Inference Server integration

Connections

  • TensorRT — TensorRT-LLM is built on TensorRT’s compilation and runtime infrastructure
  • Triton-Inference-Server — Triton’s tensorrtllm backend wraps TRT-LLM for production serving
  • NVIDIA-AIPerf - current NVIDIA benchmarking tool for OpenAI-compatible LLM services backed by TensorRT-LLM or adjacent engines.
  • NVIDIA-GenAI-Perf - older Triton generative AI benchmark examples include TensorRT-LLM-backed serving.
  • Triton-Performance-Analyzer, Triton-Model-Analyzer, and Triton-Model-Navigator - Triton toolchain for measuring, tuning, and preparing served TensorRT/Triton models.
  • NeMo-Export-Deploy - current NeMo library that exports and deploys LLM/MM checkpoints through TensorRT-LLM, Triton, and related runtimes.
  • NeMo-Megatron-Bridge and NeMo-AutoModel - upstream NeMo training/checkpoint sources that can feed TensorRT-LLM deployment paths.
  • Red-Hat-AI-Factory-with-NVIDIA - OpenShift AI deployment guide where NIM may be powered by TensorRT-LLM.
  • FlashInfer — FlashInfer attention kernels integrated as an optional attention backend
  • CUTLASS — custom GEMM and attention kernels in TRT-LLM use CUTLASS templates
  • NVIDIA-NeMo — NeMo-trained LLMs are exported and deployed via TRT-LLM
  • Megatron-LM — Megatron checkpoint format supported for direct TRT-LLM conversion
  • NIXL — NIXL provides KV cache transfer for disaggregated prefill-decode serving
  • cuDNN — cuDNN fused attention used as an alternative attention backend in TRT-LLM

Resources