NVIDIA Dynamo
Type: Platform Tags: NVIDIA, Dynamo, inference, LLM serving, Kubernetes, NIM, disaggregated serving Related: Dynamo-Disaggregated-Serving, Dynamo-KV-Cache-Aware-Routing, Dynamo-KV-Block-Manager, Dynamo-Planner, Dynamo-Profiler, NVIDIA-NIM, Red-Hat-AI-Factory-with-NVIDIA, NVIDIA-CMX, NVIDIA-AI-Data-Platform, NVIDIA-Grove, KAI-Scheduler, TensorRT-LLM, Triton-Inference-Server, NVIDIA-AIPerf, NVIDIA-GenAI-Perf, NIXL, NVIDIA-NeMo, NVIDIA-DGX Sources: https://docs.nvidia.com/dynamo/index.html, https://docs.nvidia.com/dynamo/latest/getting-started/introduction, https://docs.nvidia.com/dynamo/design-docs/disaggregated-serving, https://docs.nvidia.com/dynamo/latest/user-guides/kv-cache-aware-routing, https://docs.nvidia.com/dynamo/latest/components/kvbm, https://docs.nvidia.com/dynamo/latest/components/planner, https://docs.nvidia.com/dynamo/latest/components/profiler, https://docs.nvidia.com/aiperf/architecture-internals/architecture-of-ai-perf, https://docs.nvidia.com/ai-enterprise/deployment/red-hat-ai-factory/latest/overview.html, https://docs.nvidia.com/ai-enterprise/deployment/red-hat-ai-factory/latest/network-operator.html, https://www.nvidia.com/en-us/data-center/ai-storage/cmx/, https://www.nvidia.com/en-us/data-center/ai-data-platform/ Last Updated: 2026-04-29
Summary
NVIDIA Dynamo is NVIDIA’s inference-serving platform documentation for running and deploying high-performance AI inference services locally or on Kubernetes. It is closely connected to the modern NVIDIA inference stack around NVIDIA-NIM, TensorRT-LLM, Triton-Inference-Server, and disaggregated serving components such as NIXL.
Detail
Purpose
AI factories need inference services that can scale beyond a single process or single deployment pattern. Dynamo provides a documented path for running inference infrastructure through CLI-based local/VM setup and Kubernetes deployment guides.
Key capabilities
- Local or VM quickstart through the Dynamo CLI.
- Dynamo-Disaggregated-Serving for separating prefill and decode workers.
- Dynamo-KV-Cache-Aware-Routing for cache-overlap and worker-load-aware request placement.
- Dynamo-KV-Block-Manager for KV cache offload, sharing, and memory tiering.
- Dynamo-Planner for SLA-aware autoscaling around TTFT, ITL, and traffic changes.
- Dynamo-Profiler for profiling model/backend/hardware combinations and generating deployment recommendations.
- Container-based install path with dependencies packaged into images.
- Kubernetes installation and quickstart path for cluster deployments.
- Designed to fit NVIDIA inference runtimes and model-serving workflows.
- Relevant to large language model serving, disaggregation, and operational deployment.
- Red-Hat-AI-Factory-with-NVIDIA references Dynamo with NIXL as part of distributed inference support for llm-d and large scale-out AI factory workloads.
- AIPerf-style benchmarking is relevant for measuring distributed serving latency, throughput, GPU telemetry, and server metrics around Dynamo-style deployments.
- Kubernetes multinode deployments can use NVIDIA-Grove and KAI-Scheduler for topology-aware pod placement, gang scheduling, and coordinated scaling.
- NVIDIA-CMX positions Dynamo as the serving layer that can route requests with awareness of where reusable KV cache resides in a context-memory tier.
NVIDIA context
Dynamo is part of NVIDIA’s fast-moving inference layer, bridging model-serving systems and AI factory operations. It should be tracked alongside NVIDIA-NIM, NIXL, TensorRT-LLM, and Triton-Inference-Server for current production inference architecture.
Connections
- Dynamo-Disaggregated-Serving - core Dynamo pattern for splitting prefill and decode work.
- Dynamo-KV-Cache-Aware-Routing - router mode that routes requests based on cache overlap and worker load.
- Dynamo-KV-Block-Manager - KVBM memory-management layer for KV cache offload and reuse.
- Dynamo-Planner - autoscaler for LLM-specific latency and throughput goals.
- Dynamo-Profiler - profiling and configuration-discovery component for Dynamo deployments.
- NVIDIA-NIM - Dynamo complements NIM-based model deployment and serving.
- Red-Hat-AI-Factory-with-NVIDIA - OpenShift AI deployment guide that references Dynamo and NIXL for distributed inference.
- NVIDIA-CMX - context memory storage layer for long-context inference and KV-cache reuse.
- NVIDIA-AI-Data-Platform - AI Data Platform lists centralized cache for distributed inference with Dynamo as a reference workflow.
- NVIDIA-Grove - Grove provides declarative Kubernetes orchestration for multi-component Dynamo inference systems.
- KAI-Scheduler - KAI handles topology-aware and gang-scheduled placement for Grove/Dynamo deployments.
- NIXL - disaggregated inference uses high-throughput transfer for KV cache and tensor movement.
- TensorRT-LLM - optimized LLM engine commonly associated with NVIDIA inference serving.
- Triton-Inference-Server - established NVIDIA inference server used across model modalities.
- NVIDIA-AIPerf - current NVIDIA benchmark tool for load generation, latency/throughput measurement, GPU telemetry, and server metrics against inference endpoints.
- NVIDIA-GenAI-Perf - older generative AI benchmark tool useful for legacy workflow lookup.
- NVIDIA-DGX - representative infrastructure for large inference deployments.
Source Excerpts
- NVIDIA Dynamo docs describe Dynamo as a high-throughput, low-latency inference framework for distributed generative AI workloads.
- Current docs highlight disaggregated serving, KV cache-aware routing, and KV cache offloading as composable system-level performance techniques.