NVIDIA DGX Systems

Type: Platform Tags: NVIDIA, hardware, HPC, AI supercomputer, DGX, data center, training, infrastructure Related: NVIDIA-Base-Command, NVIDIA-Base-Command-Manager, NVIDIA-Bright-Cluster-Manager, NVIDIA-BaseOS, NVIDIA-DGX-Cloud, NVIDIA-DGX-SuperPOD, NVIDIA-DGX-BasePOD, NVIDIA-DGX-BasePOD-B200-H200-H100-RA, NVIDIA-DGX-B200, NVIDIA-DGX-SuperPOD-B200-RA, NVIDIA-GB200-NVL72, NVIDIA-DGX-SuperPOD-GB200-RA, NVIDIA-DGX-B300, NVIDIA-DGX-SuperPOD-B300-Spectrum-4-Ethernet-RA, NVIDIA-DGX-SuperPOD-B300-Quantum-X800-InfiniBand-RA, NVIDIA-DGX-Spark, NVIDIA-DGX-Station, NVIDIA-DGX-Quantum, NVIDIA-DGX-Enterprise-Support, NVIDIA-GB300-NVL72, NVIDIA-Certified-Systems, NVIDIA-Data-Center-CPUs, NVIDIA-Cloud-Accelerator-NCX, NVIDIA-Blackwell-Architecture, NVIDIA-Vera-Rubin, NVIDIA-Vera-Rubin-POD, NVIDIA-Hopper-Architecture, NVLink, NCCL, NVIDIA-MIG, NVIDIA-GPU-Operator, NVIDIA-Optimized-Frameworks, NVIDIA-Resiliency-Extension, NVIDIA-AI-Enterprise, NVIDIA-Enterprise-Licensing-Guide Sources: NVIDIA official documentation (live fetch attempted 2026-04-10; updated from https://www.nvidia.com/en-us/data-center/dgx-b200/, https://docs.nvidia.com/dgx-superpod/reference-architecture-scalable-infrastructure-b200/latest/index.html, https://docs.nvidia.com/dgx-superpod/reference-architecture-scalable-infrastructure-gb200/latest/index.html, https://www.nvidia.com/en-us/data-center/dgx-b300/, https://www.nvidia.com/en-us/data-center/gb300-nvl72/, https://docs.nvidia.com/dgx-superpod/reference-architecture/scalable-infrastructure-b300/latest/index.html, https://docs.nvidia.com/dgx-superpod/reference-architecture/scalable-infrastructure-b300-xdr/latest/index.html, https://www.nvidia.com/en-us/products/workstations/dgx-spark/, https://www.nvidia.com/en-us/products/workstations/dgx-station/, https://docs.nvidia.com/dgx-basepod/index.html, https://www.nvidia.com/en-us/data-center/dgx-support/, https://docs.nvidia.com/deeplearning/frameworks/index.html) Last Updated: 2026-05-09

Summary

NVIDIA DGX systems are purpose-built AI supercomputers and infrastructure platforms integrating NVIDIA GPUs, NVLink interconnects, high-bandwidth memory, networking, DGX OS/BaseOS, and NVIDIA AI software into validated systems for AI training, inference, and development. The DGX family now spans personal AI systems such as NVIDIA-DGX-Spark, deskside systems such as NVIDIA-DGX-Station, data center systems such as NVIDIA-DGX-B200 and NVIDIA-DGX-B300, rack-scale systems such as NVIDIA-GB200-NVL72 and NVIDIA-GB300-NVL72, enterprise reference architectures such as NVIDIA-DGX-BasePOD and NVIDIA-DGX-SuperPOD, and cloud delivery through NVIDIA-DGX-Cloud.

Detail

Purpose

Training large foundation models (LLMs, multi-modal models, scientific AI) at scale requires not just powerful GPUs, but tightly integrated GPU-to-GPU communication fabric, validated software stacks, and production-grade reliability. Assembling these components independently is complex and time-consuming. DGX systems provide a validated, out-of-the-box AI computing platform where all components (GPUs, NVLink, NVSwitch, InfiniBand, storage, software) are integrated, tested, and supported by NVIDIA — reducing time-to-training and operational risk.

Key Features

Current DGX Systems and platforms (as of 2026):

DGX Spark: compact GB10 Grace Blackwell desktop AI computer for local model development, fine-tuning, inference, data science, edge prototyping, and local agent work
DGX Station: GB300 Grace Blackwell Ultra deskside AI supercomputer with 748 GB coherent memory, NVLink-C2C, ConnectX-8 networking, MIG partitioning, and optional RTX PRO GPU support
DGX H100: 8× H100 SXM5 (80 GB HBM3) GPUs; 640 GB total GPU memory; 4th-gen NVLink + NVSwitch for all-to-all 900 GB/s GPU bandwidth; 2× ConnectX-7 InfiniBand for multi-node scaling; 10 kW power
DGX H200: 8× H200 SXM5 (141 GB HBM3e) GPUs; 1.1 TB total GPU memory — optimized for LLM inference and large-model training that benefits from bigger memory footprint
DGX B200: 8x Blackwell GPUs; 1,440 GB total HBM3e memory; 14.4 TB/s aggregate NVLink bandwidth; ConnectX-7 networking and BlueField-3 DPUs; Blackwell DGX platform for AI factory develop-to-deploy pipelines
DGX B300: current Blackwell Ultra DGX generation connected to NVIDIA-DGX-B300 and NVIDIA-GB300-NVL72 rack-scale guidance
GB200 NVL72: rack-scale system with 72 Blackwell GPUs and 36 Grace CPUs connected via NVLink 5; designed as a single, liquid-cooled AI supercomputer unit with 130 TB/s rack-scale NVLink bandwidth
DGX Station A100/H100: Workstation-class systems for small-team or on-premises development
DGX SuperPOD: Multi-rack clusters of DGX nodes connected via InfiniBand NDR fabric; scales from ~20 to 1000s of nodes; used for pre-training frontier models; “AI data center in a box”
DGX BasePOD: prescriptive DGX reference architecture for enterprise AI infrastructure below SuperPOD scale; current RA covers DGX B200, H200, and H100 with NDR400 InfiniBand
DGX Quantum: DGX-branded quantum-classical architecture identity; current en-US navigation redirects toward NVIDIA-NVQLink
DGX Cloud: NVIDIA-managed DGX infrastructure on Oracle Cloud, Azure, GCP, and AWS; per-node/per-hour rental of full DGX pods; includes NVIDIA AI Enterprise software
DGX Enterprise Support: support, infrastructure services, and training layer for DGX systems, BasePOD, and SuperPOD

Key System Capabilities:

NVLink/NVSwitch Fabric: All 8 GPUs in a DGX node are fully connected via NVSwitch, enabling any-to-any GPU communication at line rate — critical for tensor parallelism in LLM training
NVIDIA AI Enterprise Bundle: DGX systems ship with Base Command Manager (cluster OS), NGC access, and AI Enterprise software as standard; current NVIDIA-Enterprise-Licensing-Guide guidance distinguishes Hopper DGX bundle inclusion from Blackwell DGX systems that require separate AI Enterprise licenses
Validated Storage Integration: Certified with VAST Data, WekaFS, DDN EXAScaler, and NetApp for high-throughput model checkpoint storage
Validated AI factory ecosystem: DGX deployments connect to NVIDIA-Certified-Systems, NVIDIA-Bright-Cluster-Manager, NVIDIA-Data-Center-CPUs, and NVIDIA-Cloud-Accelerator-NCX guidance for broader data center infrastructure.

Use Cases

Pre-training LLMs and multimodal foundation models at scale (GPT-4 class, Llama family, Nemotron)
Large-scale scientific AI: climate modeling, molecular dynamics, drug discovery simulation
High-throughput LLM inference serving at enterprise scale using DGX H200 or GB200 NVL72
AI research labs requiring dense GPU compute without public cloud cost/latency concerns
Enterprise “AI factory” deployment: dedicated on-premises AI infrastructure under DGX SuperPOD architecture
Edge-to-cloud AI development: DGX Station for local development, DGX SuperPOD for production training

Hardware Requirements / Compatibility

DGX H100: 2× Intel Xeon Platinum CPUs; 2 TB DDR5 RAM; 30 TB NVMe SSD; Ubuntu 22.04 + DGX OS
DGX B200: 2x Intel Xeon Platinum CPUs; HBM3e GPU memory; NVLink 5 + NVSwitch 4
Power: 10–14.3 kW per DGX node; requires 3-phase power; liquid cooling optional/required for B200 class
Networking: 8× ConnectX-7 (400 Gb/s InfiniBand NDR or 400GbE) network cards per node for inter-node scaling
OS: DGX OS (Ubuntu-based, customized); Base Command Manager as Kubernetes cluster OS for SuperPOD

Language Bindings / APIs

DGX is a hardware platform; software APIs are those of the installed frameworks:
- CUDA, cuDNN, NCCL — GPU programming and communication
- NGC CLI — container and model management, including NVIDIA-Optimized-Frameworks images
- Base Command CLI (ngc bc) — job scheduling and cluster management
- DCGM REST API — GPU health and telemetry

Connections

NVIDIA-Base-Command — Base Command Platform is the MLOps software layer for DGX SuperPOD and DGX Cloud
NVIDIA-Base-Command-Manager — Base Command Manager handles cluster management and infrastructure operations for AI data centers
NVIDIA-Bright-Cluster-Manager — Bright Cluster Manager is the HPC/AI cluster-management lineage adjacent to Base Command Manager
NVIDIA-BaseOS — BaseOS/DGX OS provides the validated operating system layer for DGX deployments
NVIDIA-DGX-Cloud — cloud-accessible counterpart to on-prem DGX infrastructure
NVIDIA-DGX-SuperPOD — scale-out DGX cluster architecture for AI factories and large training runs
NVIDIA-DGX-BasePOD and NVIDIA-DGX-BasePOD-B200-H200-H100-RA — prescriptive enterprise DGX reference architecture for building AI infrastructure.
NVIDIA-DGX-B200 — Blackwell-generation DGX system and AI factory foundation.
NVIDIA-DGX-SuperPOD-B200-RA — DGX B200 SuperPOD reference architecture with 32-system scalable units.
NVIDIA-GB200-NVL72 and NVIDIA-DGX-SuperPOD-GB200-RA — rack-scale Grace Blackwell SuperPOD architecture.
NVIDIA-DGX-B300 — current Blackwell Ultra DGX system page.
NVIDIA-DGX-SuperPOD-B300-Spectrum-4-Ethernet-RA and NVIDIA-DGX-SuperPOD-B300-Quantum-X800-InfiniBand-RA — current DGX B300 SuperPOD reference architecture documents.
NVIDIA-DGX-Spark — compact personal Grace Blackwell AI computer for local development and agent work.
NVIDIA-DGX-Station — GB300 Grace Blackwell Ultra deskside AI supercomputer.
NVIDIA-DGX-Quantum — queryable DGX Quantum architecture page that points to current NVQLink direction.
NVIDIA-DGX-Enterprise-Support — support and services layer for DGX systems, BasePOD, and SuperPOD.
NVIDIA-GB300-NVL72 — rack-scale Blackwell Ultra NVL72 system adjacent to DGX B300 deployments.
NVIDIA-Certified-Systems — certified partner systems extend validated NVIDIA infrastructure beyond DGX-branded platforms
NVIDIA-Data-Center-CPUs — NVIDIA data center CPUs pair Grace with GPU systems in GH200, GB200, and rack-scale AI factory designs
NVIDIA-Cloud-Accelerator-NCX — NCX describes cloud partner accelerator infrastructure for NVIDIA AI workloads
NVIDIA-Blackwell-Architecture — DGX B200 and GB200 NVL72 are the flagship DGX systems for Blackwell architecture
NVIDIA-Vera-Rubin and NVIDIA-Vera-Rubin-POD — next-generation platform and POD-scale AI factory architecture after Blackwell.
NVIDIA-Hopper-Architecture — DGX H100 and H200 are the Hopper-generation DGX systems
NVLink — NVLink/NVSwitch fabric is the defining interconnect technology within every DGX node
NCCL — NCCL handles GPU-to-GPU communication for distributed training across DGX nodes
NVIDIA-MIG — partitions supported DGX GPUs for isolated multi-tenant workloads
NVIDIA-GPU-Operator — GPU Operator provisions Kubernetes on DGX SuperPOD nodes
NVIDIA-Optimized-Frameworks — DGX systems commonly run NVIDIA deep learning framework containers from NGC.
NVIDIA-Resiliency-Extension — long-running DGX training jobs can use NVRx for job-level fault tolerance and restart patterns.
NVIDIA-AI-Enterprise — AI Enterprise software included with DGX systems for production AI workloads
NVIDIA-Enterprise-Licensing-Guide — explains DGX software bundle treatment for Hopper systems and separate AI Enterprise licensing for Blackwell DGX systems

AIPS BOOM

Explorer

NVIDIA-DGX

NVIDIA DGX Systems

Summary

Detail

Purpose

Key Features

Use Cases

Hardware Requirements / Compatibility

Language Bindings / APIs

Connections

Resources

Graph View

Table of Contents

Backlinks