Triton Performance Analyzer
Type: Tool Tags: NVIDIA, Triton, Performance Analyzer, benchmarking, inference, latency, throughput, model serving Related: Triton-Inference-Server, Triton-Model-Analyzer, NVIDIA-AIPerf, NVIDIA-GenAI-Perf, TensorRT, TensorRT-LLM, NVIDIA-NIM, NIM-for-LLM-Benchmarking-Guide Sources: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/README.html, https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_benchmark/perf_analyzer.html Last Updated: 2026-04-29
Summary
Triton Performance Analyzer is NVIDIA’s Triton CLI for measuring and optimizing inference performance of models served through Triton-Inference-Server. It helps teams test how latency and throughput change as they vary batching, request load, model configuration, and deployment strategies.
Detail
Purpose
Triton users need a repeatable way to measure serving behavior while tuning model configuration, batching, and request shape. Performance Analyzer applies load to an inference server, measures response behavior, and helps identify the throughput-latency tradeoffs for a given model and hardware target.
Current scope
- Concurrency mode, request-rate mode, and custom-interval mode for load generation.
- Time-window and count-window measurement modes for stabilization.
- Support for standard models plus sequence, ensemble, and decoupled models.
- Auto-generated or file-specified input data and optional output verification.
- Triton HTTP/gRPC endpoint benchmarking.
- Installation and execution through Triton SDK containers or built tooling.
NVIDIA context
Performance Analyzer is the Triton-native performance tool for general model serving. For generative AI endpoints and OpenAI-compatible LLM services, use NVIDIA-AIPerf as the current forward path; NVIDIA-GenAI-Perf remains a legacy/phased-out companion tool.
Connections
- Triton-Inference-Server - primary server measured by Performance Analyzer.
- Triton-Model-Analyzer - builds on performance measurements to search Triton model configurations and generate reports.
- NVIDIA-AIPerf - current generative AI benchmark tool for OpenAI-compatible endpoints.
- NVIDIA-GenAI-Perf - older generative AI benchmarking tool in the Triton performance family.
- TensorRT - optimized TensorRT models served through Triton can be benchmarked with Performance Analyzer.
- TensorRT-LLM - LLM serving may use AIPerf/GenAI-Perf for generative metrics and Triton tooling for server-side tuning.
- NVIDIA-NIM and NIM-for-LLM-Benchmarking-Guide - NIM services use related benchmark concepts for latency and throughput measurement.
Source Excerpts
- NVIDIA docs describe Performance Analyzer as a CLI that helps optimize Triton model inference performance.
- Current docs list concurrency, request-rate, and custom-interval load modes.