Dynamo Disaggregated Serving

Type: Concept Tags: NVIDIA, Dynamo, disaggregated serving, LLM inference, prefill, decode, KV cache, NIXL, vLLM, TensorRT-LLM Related: NVIDIA-Dynamo, Dynamo-KV-Cache-Aware-Routing, Dynamo-KV-Block-Manager, Dynamo-Planner, Dynamo-Profiler, NIXL, TensorRT-LLM, vLLM, NVIDIA-Grove, KAI-Scheduler, NVIDIA-CMX, NVIDIA-AIPerf Sources: https://docs.nvidia.com/dynamo/design-docs/disaggregated-serving, https://docs.nvidia.com/dynamo/latest/getting-started/introduction Last Updated: 2026-04-29

Summary

Dynamo Disaggregated Serving is NVIDIA Dynamo’s architecture pattern for separating LLM prefill and decode work onto specialized worker pools. It lets systems scale compute-heavy prefill and memory-bound decode independently, transfer KV cache between workers with NIXL, and combine the pattern with Dynamo-KV-Cache-Aware-Routing and Dynamo-KV-Block-Manager for higher throughput and lower time to first token.

Detail

Purpose

LLM requests have two very different phases. Prefill processes the prompt and creates KV cache, while decode generates output tokens while repeatedly reading KV cache. A single worker pool can underutilize hardware because prefill and decode stress different resources. Disaggregation lets operators assign separate prefill and decode capacity and schedule each phase around its bottleneck.

Current scope

Separate prefill and decode workers for LLM serving.
Prefill workers generate KV cache; decode workers continue token generation.
KV transfer between prefill and decode workers through NIXL and backend-specific transfer metadata.
Router orchestration through Dynamo prefill/decode routing flows.
Support for vLLM, SGLang, and TensorRT-LLM style backends in current Dynamo docs.
Runtime-reconfigurable prefill/decode worker counts so workers can be added or drained as traffic changes.
Works with KV routing, KV offloading, Planner autoscaling, and Grove topology-aware placement.

NVIDIA context

Disaggregated serving is one of the three system-level performance techniques NVIDIA highlights for Dynamo alongside KV cache-aware routing and KV cache offloading. It connects NVIDIA-Dynamo to NIXL, Dynamo-KV-Block-Manager, NVIDIA-Grove, KAI-Scheduler, TensorRT-LLM, and NVIDIA-CMX.

Connections

NVIDIA-Dynamo - umbrella inference framework that implements disaggregated serving.
Dynamo-KV-Cache-Aware-Routing - routes requests based on cache overlap and worker load across aggregated or disaggregated topologies.
Dynamo-KV-Block-Manager - memory-management layer for KV cache offload and reuse.
Dynamo-Planner - autoscaler that understands prefill/decode scaling constraints.
Dynamo-Profiler - produces performance data used to choose disaggregated serving configurations.
NIXL - data transfer layer for moving KV cache between workers and memory tiers.
NVIDIA-Grove and KAI-Scheduler - Kubernetes orchestration and placement for multi-component prefill/decode deployments.
TensorRT-LLM and vLLM - inference engines that can sit underneath Dynamo disaggregation.
NVIDIA-CMX - context memory platform aligned with KV-cache reuse and long-context inference.
NVIDIA-AIPerf - benchmark tool for validating latency/throughput improvements from disaggregated deployments.

Source Excerpts

NVIDIA Dynamo docs describe prefill and decode as phases with different compute and memory characteristics.
Current docs state that high-performance disaggregation depends on efficient KV transfer.

AIPS BOOM

Explorer

Dynamo-Disaggregated-Serving

Dynamo Disaggregated Serving

Summary

Detail

Purpose

Current scope

NVIDIA context

Connections

Source Excerpts

Resources

Graph View

Table of Contents

Backlinks