NVLM (Vision-Language Model)

Type: Model Tags: NVIDIA, VLM, Vision-Language, Multimodal, LLM, Computer Vision Related: NVIDIA-NeMo, Nemotron, TensorRT-LLM, NIM-for-Vision-Language-Models, NIM-for-NV-CLIP, NVIDIA-NIM, NVIDIA-EAGLE Sources: NVIDIA official documentation, https://docs.nvidia.com/nim/vision-language-models/latest/introduction.html, https://docs.nvidia.com/nim/nvclip/latest/introduction.html Last Updated: 2026-04-29

Summary

NVLM 1.0 is NVIDIA’s family of open-source frontier-class multimodal large language models (MLLMs) that achieve performance competitive with GPT-4V and Claude 3.5 Sonnet on vision-language benchmarks. NVLM introduces a novel dual-path architecture (NVLM-D) combining decoder-only cross-attention and an NVLM-H hybrid design, allowing the model to excel at both image understanding and text-only tasks without the regression typical in multimodal fine-tuning. Released in September 2024, NVLM-D 72B is available on Hugging Face.

Detail

Purpose

Most multimodal LLMs suffer from a trade-off: adding vision capability degrades text-only performance. NVLM addresses this by designing an architecture that preserves LLM text capabilities while adding high-quality visual understanding, enabling a single model to replace both a specialist LLM and a separate vision model in enterprise pipelines.

Key Features

NVLM-D (decoder-only): cross-attention between visual tokens and text decoder; strong OCR, document understanding
NVLM-H (hybrid): combines cross-attention and interleaved visual tokens; best of both architectures
NVLM-X (cross-attention only): efficient variant for deployment
Model sizes: NVLM-D 72B (primary), additional sizes planned
Dynamic high-resolution image tiling: handles arbitrary aspect ratios and resolutions
Strong benchmark performance: top results on MMBench, OCRBench, MathVista, DocVQA, AI2D
Text-only performance preserved: scores competitive with Qwen-72B on pure NLP tasks
Open weights under NVIDIA Open Model License

Use Cases

Document understanding and OCR (invoices, contracts, forms, PDFs)
Visual question answering for enterprise data
Chart and table interpretation
Scientific paper and figure analysis
Image-grounded customer support and AI assistants
Multimodal RAG (retrieval-augmented generation) systems

Hardware Requirements / Compatibility

NVLM-D 72B: multi-GPU (4x or 8x H100/A100 80GB) with tensor parallelism
Smaller variants: single A100 80GB
TensorRT-LLM optimization for optimized inference
Available through NIM-style production deployment paths; see NIM-for-Vision-Language-Models for the current NVIDIA VLM serving docs surface

Language Bindings / APIs

Python (Hugging Face Transformers, NeMo framework)
NVIDIA NIM REST API
vLLM backend support
Available on Hugging Face (nvidia/NVLM-D-72B)

Connections

NVIDIA-NeMo — NeMo framework used for NVLM training and fine-tuning
Nemotron — shares LLM backbone lineage with Nemotron model family
TensorRT-LLM — NVLM inference optimized via TensorRT-LLM
NVIDIA-EAGLE — EAGLE is another NVIDIA VLM; NVLM focuses on frontier scale
NIM-for-Vision-Language-Models - current NVIDIA NIM docs surface for VLM deployment, OpenAI-compatible APIs, observability, and model support.
NIM-for-NV-CLIP - multimodal embedding NIM for image/text retrieval workflows that complement VLM reasoning.
NVIDIA-NIM — umbrella inference microservices layer for deploying VLMs

AIPS BOOM

Explorer

NVLM

NVLM (Vision-Language Model)

Summary

Detail

Purpose

Key Features

Use Cases

Hardware Requirements / Compatibility

Language Bindings / APIs

Connections

Resources

Graph View

Table of Contents

Backlinks