NVLM (Vision-Language Model)
Type: Model Tags: NVIDIA, VLM, Vision-Language, Multimodal, LLM, Computer Vision Related: NVIDIA-NeMo, Nemotron, TensorRT-LLM, NIM-for-Vision-Language-Models, NIM-for-NV-CLIP, NVIDIA-NIM, NVIDIA-EAGLE Sources: NVIDIA official documentation, https://docs.nvidia.com/nim/vision-language-models/latest/introduction.html, https://docs.nvidia.com/nim/nvclip/latest/introduction.html Last Updated: 2026-04-29
Summary
NVLM 1.0 is NVIDIA’s family of open-source frontier-class multimodal large language models (MLLMs) that achieve performance competitive with GPT-4V and Claude 3.5 Sonnet on vision-language benchmarks. NVLM introduces a novel dual-path architecture (NVLM-D) combining decoder-only cross-attention and an NVLM-H hybrid design, allowing the model to excel at both image understanding and text-only tasks without the regression typical in multimodal fine-tuning. Released in September 2024, NVLM-D 72B is available on Hugging Face.
Detail
Purpose
Most multimodal LLMs suffer from a trade-off: adding vision capability degrades text-only performance. NVLM addresses this by designing an architecture that preserves LLM text capabilities while adding high-quality visual understanding, enabling a single model to replace both a specialist LLM and a separate vision model in enterprise pipelines.
Key Features
- NVLM-D (decoder-only): cross-attention between visual tokens and text decoder; strong OCR, document understanding
- NVLM-H (hybrid): combines cross-attention and interleaved visual tokens; best of both architectures
- NVLM-X (cross-attention only): efficient variant for deployment
- Model sizes: NVLM-D 72B (primary), additional sizes planned
- Dynamic high-resolution image tiling: handles arbitrary aspect ratios and resolutions
- Strong benchmark performance: top results on MMBench, OCRBench, MathVista, DocVQA, AI2D
- Text-only performance preserved: scores competitive with Qwen-72B on pure NLP tasks
- Open weights under NVIDIA Open Model License
Use Cases
- Document understanding and OCR (invoices, contracts, forms, PDFs)
- Visual question answering for enterprise data
- Chart and table interpretation
- Scientific paper and figure analysis
- Image-grounded customer support and AI assistants
- Multimodal RAG (retrieval-augmented generation) systems
Hardware Requirements / Compatibility
- NVLM-D 72B: multi-GPU (4x or 8x H100/A100 80GB) with tensor parallelism
- Smaller variants: single A100 80GB
- TensorRT-LLM optimization for optimized inference
- Available through NIM-style production deployment paths; see NIM-for-Vision-Language-Models for the current NVIDIA VLM serving docs surface
Language Bindings / APIs
- Python (Hugging Face Transformers, NeMo framework)
- NVIDIA NIM REST API
- vLLM backend support
- Available on Hugging Face (nvidia/NVLM-D-72B)
Connections
- NVIDIA-NeMo — NeMo framework used for NVLM training and fine-tuning
- Nemotron — shares LLM backbone lineage with Nemotron model family
- TensorRT-LLM — NVLM inference optimized via TensorRT-LLM
- NVIDIA-EAGLE — EAGLE is another NVIDIA VLM; NVLM focuses on frontier scale
- NIM-for-Vision-Language-Models - current NVIDIA NIM docs surface for VLM deployment, OpenAI-compatible APIs, observability, and model support.
- NIM-for-NV-CLIP - multimodal embedding NIM for image/text retrieval workflows that complement VLM reasoning.
- NVIDIA-NIM — umbrella inference microservices layer for deploying VLMs