NVIDIA AI Cluster Runtime

Type: Tool Tags: NVIDIA, AI Cluster Runtime, AICR, Kubernetes, GPU Operator, validated stack, runtime recipe, AI cloud Related: NVIDIA-Cloud-Accelerator-NCX, NVIDIA-GPU-Operator, NVIDIA-Network-Operator, NVIDIA-Cloud-Native-Technologies, NVIDIA-Run-ai, NVIDIA-NVSentinel, NVIDIA-Project-GPUd Sources: https://docs.nvidia.com/ncx/index.html; https://github.com/NVIDIA/aicr Last Updated: 2026-04-29

Summary

NVIDIA AI Cluster Runtime (AICR) is an open NVIDIA project for reproducible GPU-accelerated Kubernetes cluster runtime recipes. NCX describes it as a canonical, continuously validated definition of the NVIDIA-accelerated Kubernetes runtime. The public repository explains that AICR captures known-good combinations of drivers, operators, kernels, and system configurations and publishes them as version-locked recipes for Helm, Argo CD, and other deployment frameworks.

Detail

AICR addresses a practical AI infrastructure problem: GPU Kubernetes clusters fail in subtle ways when kernel versions, drivers, container runtimes, Kubernetes versions, and operators drift out of validated combinations. AICR packages those combinations into recipes so operators can generate, query, and validate consistent deployments.

The public README describes AICR recipes as optimized for specific hardware, cloud, OS, and workload intent; validated through automated constraint and compatibility checks; and reproducible from the same inputs. Example workflows include taking a cluster snapshot, generating a recipe for service/accelerator/OS/intent/platform combinations, querying hydrated configuration values, and validating a recipe against a snapshot.

Connections

Source Excerpts

  • “AI Cluster Runtime provides a canonical, continuously validated definition.”
  • “captures known-good combinations of drivers, operators, kernels, and system configurations”