NIXL (NVIDIA Inference Xfer Library)

Type: Technology Tags: CUDA, NVIDIA, GPU, Inference, Data Transfer, Networking, LLM, KV Cache, CUDA-X Related: NVIDIA-Dynamo, Dynamo-Disaggregated-Serving, Dynamo-KV-Block-Manager, Dynamo-KV-Cache-Aware-Routing, Red-Hat-AI-Factory-with-NVIDIA, TensorRT-LLM, Triton-Inference-Server, GPU-Direct-Storage, GPUDirect-RDMA, DOCA-GPUNetIO, DOCA-RDMA, NCCL, NVSHMEM Sources: NVIDIA official documentation, developer.nvidia.com/nixl, https://docs.nvidia.com/dynamo/design-docs/disaggregated-serving, https://docs.nvidia.com/dynamo/latest/components/kvbm, https://docs.nvidia.com/ai-enterprise/deployment/red-hat-ai-factory/latest/network-operator.html Last Updated: 2026-04-29

Summary

NIXL (NVIDIA Inference Xfer Library) is a high-performance data transfer library purpose-built for GPU inference workloads, particularly large language model (LLM) serving. It provides a unified API for efficient data movement between GPUs, CPUs, storage, and network endpoints, optimized for the KV cache transfer patterns required in disaggregated LLM inference architectures (prefill-decode disaggregation). NIXL abstracts multiple transport backends (UCX, NVLink, InfiniBand, NVMe) behind a single interface to maximize inference throughput.

Detail

Purpose

NIXL addresses the data movement bottleneck in modern LLM serving, where prefill computation and decode computation can be disaggregated across different GPU pools for efficiency. Moving large KV (key-value) caches between prefill servers and decode servers — or between GPU memory and high-capacity CPU/NVMe storage — requires high-bandwidth, low-latency transfers that existing general-purpose libraries do not optimize for specifically.

Key Features

Unified transfer API supporting multiple backends: UCX (RDMA/InfiniBand), NVLink, PCIe, NVMe-oF
GPU Direct RDMA: NIC-to-GPU transfers bypassing CPU for minimum latency
KV cache-optimized transfer patterns for disaggregated LLM inference
GPU-initiated transfers: allows GPU kernels to trigger data movement without CPU involvement
Current DOCA GPUNetIO docs list NIXL as a GPUNetIO-using application for inference transfer paths.
Memory registration and buffer management for pooled transfer efficiency
Asynchronous, stream-ordered operations for overlapping compute and communication
Integration with TensorRT-LLM for KV cache disaggregation
Integration with NVIDIA-Dynamo for disaggregated serving, Dynamo-KV-Block-Manager tiered KV cache memory, and KV cache movement between prefill/decode workers.
NVIDIA’s Red-Hat-AI-Factory-with-NVIDIA guide references NIXL as the transfer library used by llm-d and Dynamo-oriented distributed inference paths.
Support for CPU offload (GPU-to-DRAM) for extended KV cache capacity
Support for NVMe offload via GPU Direct Storage for very large KV cache pools
Multi-GPU and multi-node topologies

Use Cases

Disaggregated LLM inference: separate prefill and decode GPU pools
KV cache offload to CPU/NVMe for extended context length support
High-throughput LLM serving with large batch sizes
Multi-model serving with shared KV cache across model replicas
GPU-to-GPU data streaming for inference pipeline stages

Hardware Requirements

NVIDIA GPU with CUDA Compute Capability 7.0+ (Volta minimum)
H100/A100 with NVLink or InfiniBand for full bandwidth utilization
RDMA-capable NICs (ConnectX series) for GPU Direct RDMA
NVMe SSDs with GPU Direct Storage for NVMe offload
CUDA 11.8 or higher

Language Bindings

C++ (primary library API)
Python bindings for integration testing and application use

Connections

NVIDIA-Dynamo - Dynamo uses NIXL as a data transfer foundation for disaggregated serving and KV cache movement.
Dynamo-Disaggregated-Serving - NIXL moves KV cache between prefill and decode workers.
Dynamo-KV-Block-Manager - KVBM uses NIXL as the transfer/storage layer for KV cache tiers.
Dynamo-KV-Cache-Aware-Routing - routing and cache locality rely on fast transfer and reuse paths.
Red-Hat-AI-Factory-with-NVIDIA - OpenShift AI deployment guide that names NIXL in the distributed inference networking context.
TensorRT-LLM — NIXL is the primary data transfer layer for TRT-LLM disaggregated inference
Triton-Inference-Server — Triton LLM serving uses NIXL for KV cache transfers in disaggregated deployments
GPU-Direct-Storage — NIXL leverages GDS for NVMe-to-GPU KV cache offload
GPUDirect-RDMA — NIXL uses GPU-aware network transfer paths for inter-node inference data movement.
DOCA-GPUNetIO — NVIDIA’s GPUNetIO docs list NIXL as an application using GPU networking functions.
DOCA-RDMA — RDMA application primitives are adjacent to NIXL’s UCX/RDMA backend story.
NCCL — NIXL complements NCCL; while NCCL handles collective training comms, NIXL handles point-to-point inference data movement
NVSHMEM — NVSHMEM handles symmetric memory for HPC; NIXL targets inference-specific transfer patterns

AIPS BOOM

Explorer

NIXL

NIXL (NVIDIA Inference Xfer Library)

Summary

Detail

Purpose

Key Features

Use Cases

Hardware Requirements

Language Bindings

Connections

Resources

Graph View

Table of Contents

Backlinks