NVIDIA NIM (Inference Microservices)

Summary

NVIDIA NIM (NVIDIA Inference Microservices) is NVIDIA’s containerized inference microservice layer for deploying foundation models on clouds, data centers, and self-hosted GPU infrastructure. Current NVIDIA docs position NIM as part of NVIDIA-AI-Enterprise, with production runtimes, ongoing security updates, build.nvidia.com API access, and integration into the broader NVIDIA-NeMo agent lifecycle stack.

Detail

Purpose

NIM packages model-specific inference runtimes, APIs, containers, and deployment guidance so teams can move from model selection to production serving without rebuilding the entire inference stack. The same model capability may appear as a hosted build.nvidia.com API, a downloadable NIM, an NGC artifact, or a Kubernetes deployment.

Current scope

Model APIs: build.nvidia.com exposes free, partner, and downloadable model endpoints across chat, code, retrieval, speech, biology, image/video, safety, and multimodal use cases.
LLM serving: NIM-for-Large-Language-Models is the specific current docs surface for NIM Day 0, NIM Turbo, NIM Certified, model-free NIM, model-specific NIMs, LoRA, tool calling/MCP, and production LLM deployment; NIM-for-LLM-Benchmarking-Guide, NVIDIA-AIPerf, and NVIDIA-Run-ai cover latency-throughput measurement, scheduling, and Enterprise RA sizing patterns.
Kubernetes lifecycle: NVIDIA-NIM-Operator manages NIM and NeMo microservices, model caches, services, pipelines, autoscaling, and air-gapped deployment patterns; NVIDIA-NIM-on-GKE is the Google Kubernetes Engine deployment guide.
Local RTX/Windows deployment: NVIDIA-NIM-on-WSL2 documents running certain downloadable NIMs on RTX Windows systems with WSL2, Podman, NVIDIA Container Toolkit, and AI Workbench.
OpenShift deployment: Red-Hat-AI-Factory-with-NVIDIA shows NIM on Red Hat OpenShift through Helm, NIM Operator, KServe, and OpenShift AI/Gen AI Studio workflows.
Blueprints: NVIDIA-AI-Blueprints use NIM as a reusable model-serving layer in application workflows such as NVIDIA-RAG-Blueprint, NVIDIA-AI-Q-Blueprint, NVIDIA-Data-Flywheel-Blueprint, NVIDIA-Video-Search-and-Summarization-Blueprint, and NVIDIA-Tokkio-Digital-Human-Blueprint.
API docs and recipes: NVIDIA-API-Documentation and LLM-Inference-Quick-Start-Recipes provide developer-facing examples for hosted APIs and NIM-style inference workflows.
Developer-tool AI assistants: Nsight-Copilot uses NIM microservices to provide CUDA-aware chat, code generation, and profiler-context assistance for CUDA developers.
CUDA model evaluation: ComputeEval can evaluate NIM-hosted and OpenAI-compatible model endpoints on CUDA programming tasks.
Self-hosted inference: NIM containers are delivered through NVIDIA catalog/registry workflows and can run on NVIDIA GPU infrastructure.
Enterprise runtime: NIM is part of NVIDIA-AI-Enterprise and receives production-grade runtime and security-update treatment.
NeMo integration: NeMo-Platform uses NIM targets and inference gateways for agent lifecycle workflows, including NeMo-Data-Designer, NeMo-Customizer, NeMo-Evaluator, and NeMo-Auditor. Framework-side NeMo tools such as NeMo-AutoModel, NeMo-RL, NeMo-Megatron-Bridge, and NeMo-Export-Deploy create, adapt, post-train, evaluate, or export model artifacts that can sit upstream of NIM-style deployment.
Retriever integration: NeMo-Retriever uses NIM microservices for embedding, reranking, OCR, object/document extraction, and multimodal embeddings; see NeMo-Retriever-Embedding-NIM, Llama-Nemotron-Embed-1B-v2, Llama-Nemotron-Embed-VL-1B-v2, NeMo-Retriever-Reranking-NIM, Llama-Nemotron-Rerank-1B-v2, NIM-for-Image-OCR, NIM-for-Object-Detection, and NIM-for-NV-CLIP.
Text and multimodal retrieval/reranking: Llama-Nemotron-Embed-1B-v2 and Llama-Nemotron-Rerank-1B-v2 cover text retrieval, while Llama-Nemotron-Embed-VL-1B-v2 and Llama-Nemotron-Rerank-VL-1B-v2 cover visual document retrieval.
Physical AI and video: NIM-for-Cosmos-WFM deploys Cosmos Predict/Transfer world models, while NIM-for-Cosmos-Embed1 creates joint video-text embeddings for retrieval and dataset curation.
Physics and simulation AI: NIM-for-Earth-2-CorrDiff and NIM-for-Earth-2-FourCastNet expose current Earth-2 weather models, while NIM-for-DoMINO-Automotive-Aero exposes a PhysicsNeMo automotive aerodynamics surrogate model.
Multimodal and visual AI: NIM-for-Vision-Language-Models covers current VLM serving, including Nemotron-3-Nano-Omni, Nemotron-Parse, Cosmos Reason, and partner VLM models, while NIM-for-Visual-Generative-AI covers image generation, editing, and visual/3D generation NIMs.
Quantum AI models: Ising-Calibration-1-35B-A3B is a NVIDIA Build endpoint for quantum calibration chart understanding.
Speech, media, and digital-human AI: NVIDIA-Speech-NIM-Microservices is the current docs surface for NVIDIA-ASR-NIM, Nemotron-ASR-Streaming, NVIDIA-TTS-NIM, and NVIDIA-NMT-NIM, while Nemotron-3-VoiceChat covers full-duplex speech-to-speech voice agents and NVIDIA-Background-Noise-Removal-NIM, NIM-for-Maxine-Studio-Voice, NIM-for-Maxine-Audio2Face-2D, NIM-for-Maxine-Eye-Contact, NIM-for-Maxine-Active-Speaker-Detection, and NIM-for-Audio2Face-3D cover Maxine and ACE-style media/digital-human microservices.
Safety and guardrails: NVIDIA-NemoGuard-NIMs covers content safety, topic control, jailbreak detection, and Day 0 safety NIMs such as Nemotron-Content-Safety-Reasoning-4B-Experimental-NIM, while NIM-for-Multimodal-Safety and Nemotron-3-Content-Safety cover visual/multimodal moderation.
Medical imaging: NIM-for-MAISI and NIM-for-VISTA-3D expose MONAI/Clara medical imaging models as NIM microservices for research workflows.
Biology, chemistry, and atomistic modeling: BioNeMo and ALCHEMI NIMs include structure prediction, sequence design, MSA search, molecular generation, docking, geometry relaxation, and molecular dynamics services.
AI data and context: NVIDIA-AI-Data-Platform uses NIM in retrieval/agent workflows, while NVIDIA-CMX and NVIDIA-Dynamo address large-scale context and inference-serving patterns around NIM deployments.
Disaggregated serving: NIXL and NVIDIA-Dynamo connect to newer large-model serving patterns such as KV-cache and tensor transfer.

Representative model families

Nemotron reasoning, instruction, safety, ASR, OCR, voice, embedding, reranking, Nemotron-3-Nano, Nemotron-3-Super, Nemotron-3-Nano-Omni, and Nemotron-Parse models.
Parakeet-ASR, NVIDIA-Canary, Nemotron-ASR-Streaming, and Nemotron-3-VoiceChat speech/voice models surfaced through NVIDIA speech and Build NVIDIA model paths.
BioNeMo/Biology and ALCHEMI models surfaced through NIM-style APIs and NGC artifacts, including NIM-for-AlphaFold2, NIM-for-AlphaFold2-Multimer, NIM-for-OpenFold2, NIM-for-OpenFold3, NIM-for-Boltz2, NIM-for-Evo-2, NIM-for-MSA-Search, NIM-for-ProteinMPNN, NIM-for-RFdiffusion, NIM-for-MolMIM, NIM-for-GenMol, NIM-for-DiffDock, NIM-for-ALCHEMI-Batched-Geometry-Relaxation, and NIM-for-ALCHEMI-Batched-Molecular-Dynamics.
Physical AI, media, and multimodal model families, including Cosmos WFM, Cosmos Embed1, Earth-2 weather models, DoMINO automotive aero, NV-CLIP embeddings, VLMs, Visual GenAI NIMs, Maxine media NIMs, and Audio2Face-3D digital-human animation.
Community and partner open models optimized for NVIDIA inference.

NVIDIA context

NIM is the practical deployment boundary between NVIDIA’s model catalog and production applications. It links model development in NVIDIA-NeMo, inference optimization in TensorRT-LLM, serving in Triton-Inference-Server, catalog distribution in NGC, and enterprise support in NVIDIA-AI-Enterprise.

Connections

NIM-for-Large-Language-Models - LLM-specific NIM docs for production LLM serving.
NIM-for-LLM-Benchmarking-Guide - latency-throughput benchmarking guide for OpenAI-compatible LLM NIM deployments.
NVIDIA-Run-ai - Enterprise RA paper uses Run:ai to schedule, autoscale, and pack LLM NIM inference workloads.
NVIDIA-AIPerf - current NVIDIA tool for benchmarking OpenAI-compatible NIM LLM services.
NVIDIA-GenAI-Perf - older generative AI benchmark tool now useful mainly for migration and legacy workflows.
NVIDIA-NIM-Operator - Kubernetes operator for NIM and NeMo microservice lifecycle management.
NVIDIA-NIM-on-GKE - Google Kubernetes Engine deployment guide for NIM microservices.
NVIDIA-NIM-on-WSL2 - RTX Windows/WSL2 deployment guide for downloadable NIM containers.
Nemotron-3-Nano and Nemotron-3-Super - current NVIDIA Nemotron reasoning model entries surfaced through build.nvidia.com and Nemotron model workflows.
Red-Hat-AI-Factory-with-NVIDIA - OpenShift AI deployment guide for NIM microservices on AI Enterprise.
NeMo-Retriever-Embedding-NIM - embedding NIM for semantic search, RAG, and vector retrieval.
Llama-Nemotron-Embed-1B-v2 - text embedding model for multilingual and cross-lingual retrieval.
Llama-Nemotron-Embed-VL-1B-v2 - multimodal embedding model for visual document retrieval and NeMo Retriever RAG pipelines.
NeMo-Retriever-Reranking-NIM - reranking NIM for improving retrieved passage relevance.
Llama-Nemotron-Rerank-1B-v2 - text reranker model for multilingual and cross-lingual retrieval.
Llama-Nemotron-Rerank-VL-1B-v2 - multimodal reranker for visual document/page retrieval.
NIM-for-Image-OCR - OCR NIM for extracting text from visual document regions.
NIM-for-Object-Detection - document object-detection NIMs for page, table, and graphic elements.
NIM-for-NV-CLIP - multimodal text/image embedding NIM for retrieval, RAG, and semantic search.
NIM-for-Cosmos-WFM - Cosmos world foundation model NIM for text/image/video-to-world and video transfer workflows.
NIM-for-Cosmos-Embed1 - joint video-text embedding NIM for physical AI dataset search and curation.
NIM-for-Earth-2-CorrDiff - Earth-2 NIM for weather downscaling and diffusion correction.
NIM-for-Earth-2-FourCastNet - Earth-2 NIM for global medium-range AI weather forecasting.
NIM-for-DoMINO-Automotive-Aero - PhysicsNeMo NIM for automotive aerodynamic surrogate simulation.
NIM-for-Vision-Language-Models - VLM NIM family for multimodal reasoning and image/video understanding.
Ising-Calibration-1-35B-A3B - quantum calibration VLM endpoint on Build NVIDIA.
Nemotron-3-Nano-Omni - current Nemotron VLM NIM for text, image, video, audio, document, chart, and GUI understanding.
Nemotron-Parse - current Nemotron document parser NIM for text/table extraction, layout classes, and bounding boxes.
NIM-for-Visual-Generative-AI - visual generation NIM family for image generation, editing, and 3D assets.
NVIDIA-Speech-NIM-Microservices - current Speech NIM docs collection for ASR, TTS, and NMT microservices.
NVIDIA-ASR-NIM - speech-to-text NIM for Parakeet, Canary, Whisper, Conformer, and Nemotron ASR models.
Nemotron-ASR-Streaming - 600M-parameter English streaming ASR model surfaced through Build NVIDIA, Riva, and Speech NIM.
Nemotron-3-VoiceChat - 12B full-duplex speech-to-speech model for realtime voice agents.
NVIDIA-TTS-NIM - text-to-speech NIM for Magpie models, voices, emotional styles, and voice cloning.
NVIDIA-NMT-NIM - neural machine translation NIM for Riva Translate 1.6B and 36-language translation workflows.
NVIDIA-Background-Noise-Removal-NIM - Maxine audio cleanup NIM for improving speech intelligibility and ASR accuracy.
NIM-for-Maxine-Studio-Voice - Maxine audio NIM for enhancing noisy or reverberant speech toward studio-recorded quality.
NIM-for-Maxine-Audio2Face-2D - Maxine NIM for generating 2D portrait facial animation from speech audio.
NIM-for-Maxine-Eye-Contact - Maxine video NIM for gaze correction and camera-facing eye contact.
NIM-for-Maxine-Active-Speaker-Detection - Maxine NIM for active speaker detection from video plus diarized audio.
NIM-for-Audio2Face-3D - Digital Human NIM for audio/emotion-driven 3D facial animation and ARKit blendshape output.
NVIDIA-NemoGuard-NIMs - guardrail NIM family for content safety, topic control, and jailbreak detection.
Nemotron-3-Content-Safety - multimodal, multilingual content-safety model for prompt, image, and response moderation.
Nemotron-Content-Safety-Reasoning-4B-Experimental-NIM - Day 0 content-safety reasoning NIM.
Llama-3.1-Nemotron-Safety-Guard-8B-NIM - multilingual content-safety NIM for trustworthy LLM applications.
Llama-3.1-NemoGuard-8B-TopicControl-NIM - topic-control NIM for enforcing allowed conversation boundaries.
Llama-3.1-NemoGuard-8B-ContentSafety-NIM - NemoGuard content-safety NIM lineage for user and bot moderation.
NVIDIA-NemoGuard-JailbreakDetect-NIM - jailbreak and prompt-injection detection NIM.
NIM-for-Multimodal-Safety - multimodal safety NIM family for visual content moderation.
NIM-for-MAISI - medical imaging NIM for synthetic CT generation and annotation masks.
NIM-for-VISTA-3D - medical imaging NIM for interactive 3D segmentation and annotation.
NIM-for-AlphaFold2 - BioNeMo NIM for single-chain protein structure prediction.
NIM-for-AlphaFold2-Multimer - BioNeMo NIM for protein complex structure prediction.
NIM-for-OpenFold2 - BioNeMo NIM for OpenFold2 monomer protein structure prediction.
NIM-for-OpenFold3 - BioNeMo NIM for all-atom biomolecular complexes.
NIM-for-Boltz2 - BioNeMo NIM for biomolecular structure and binding-affinity prediction.
NIM-for-Evo-2 - BioNeMo NIM for DNA sequence interpretation and generation.
NIM-for-MSA-Search - BioNeMo NIM for GPU-accelerated MSA, paired MSA, and structural template search.
NIM-for-ProteinMPNN - BioNeMo NIM for protein sequence design from backbone structures.
NIM-for-RFdiffusion - BioNeMo NIM for generative protein structure and complex design.
NIM-for-MolMIM and NIM-for-GenMol - BioNeMo NIMs for controlled and fragment-based small molecule generation.
NIM-for-DiffDock - BioNeMo NIM for protein-ligand docking and pose prediction.
NIM-for-ALCHEMI-Batched-Geometry-Relaxation and NIM-for-ALCHEMI-Batched-Molecular-Dynamics - ALCHEMI NIMs for atomistic relaxation and molecular dynamics.
NVIDIA-AI-Enterprise - commercial enterprise software layer that includes supported NIM runtimes.
NGC and NVIDIA-NGC-Catalog - distribution path for containers, models, and artifacts.
NVIDIA-AI-Blueprints - reference workflows that compose NIM with NeMo, Retriever, Nemotron, and deployment assets.
NVIDIA-RAG-Blueprint - RAG reference workflow that composes NIM, Retriever, VLM, and guardrail services.
NVIDIA-AI-Q-Blueprint - enterprise research agent blueprint that lists NIM/Nemotron model options.
NVIDIA-Data-Flywheel-Blueprint - continuous optimization workflow that evaluates and redeploys candidate NIMs.
NVIDIA-Video-Search-and-Summarization-Blueprint - video intelligence blueprint that uses VLMs, LLMs, and video embeddings.
NVIDIA-Tokkio-Digital-Human-Blueprint - digital-human blueprint that uses speech, LLM, RAG, and Audio2Face services.
NVIDIA-AI-Data-Platform - data-platform reference design that uses NIM with retrieval and agent workloads.
NVIDIA-API-Documentation - hosted API reference for NVIDIA model and microservice endpoints.
LLM-Inference-Quick-Start-Recipes - hands-on inference recipes that complement NIM deployment docs.
NVIDIA-Brev - cloud GPU development environments useful for prototyping against NIM and NVIDIA APIs.
NVIDIA-Cloud-Accelerator-NCX - cloud accelerator infrastructure for NIM and AI Enterprise workloads.
TensorRT-LLM - optimized LLM inference backend for many NVIDIA-serving workflows.
Triton-Inference-Server - serving layer used across NVIDIA model modalities.
NeMo-Platform - agent lifecycle platform that integrates hosted and self-hosted NIM targets.
NeMo-Data-Designer, NeMo-Customizer, NeMo-Evaluator, and NeMo-Auditor - NeMo Platform services that call, customize, evaluate, or audit NIM-compatible model endpoints.
NeMo-AutoModel, NeMo-RL, NeMo-Megatron-Bridge, and NeMo-Export-Deploy - framework-side training, post-training, bridge, and export/deploy tools adjacent to NIM deployment.
NeMo-Retriever - retrieval microservices use NIM for embeddings and reranking.
NVIDIA-BioNeMo - biology and drug-discovery NIMs sit under the BioNeMo platform.
NVIDIA-Cosmos and Earth-2 - physical AI and climate AI platforms with current NIM microservice surfaces.
NVLM - NVIDIA VLM family adjacent to NIM for VLM deployment.
NVIDIA-Maxine and NVIDIA-ACE - media AI and digital-human application layers with current NIM microservice surfaces.
NVIDIA-Dynamo - newer NVIDIA inference-serving platform adjacent to NIM deployments.
NVIDIA-CMX - context-memory storage platform relevant to long-context, multi-turn NIM inference at scale.
Nemotron - flagship NVIDIA model family available through build.nvidia.com and NIM paths.
Nsight-Copilot - CUDA-aware assistant powered by NIM microservices.
ComputeEval - benchmark framework that can generate CUDA solutions through NIM-hosted model endpoints.

Source Excerpts

NVIDIA NIM docs describe NIM microservices as part of NVIDIA AI Enterprise for deploying foundation models on cloud or data center infrastructure.
build.nvidia.com lists NVIDIA-published models, downloadable artifacts, free endpoints, and NIM API experiences.

AIPS BOOM

Explorer

NVIDIA-NIM

NVIDIA NIM (Inference Microservices)

Summary

Detail

Purpose

Current scope

Representative model families

NVIDIA context

Connections

Source Excerpts

Resources

Graph View

Table of Contents

Backlinks