Llama Nemotron Rerank VL 1B v2

Type: NVIDIA model Tags: NVIDIA, Llama Nemotron, NeMo Retriever, reranking, multimodal retrieval, VLM, RAG, document retrieval, TensorRT, Build NVIDIA Related: NeMo-Retriever, NeMo-Retriever-Reranking-NIM, NeMo-Retriever-Embedding-NIM, Llama-Nemotron-Embed-1B-v2, Llama-Nemotron-Rerank-1B-v2, Llama-Nemotron-Embed-VL-1B-v2, NVIDIA-NIM, NIM-for-Vision-Language-Models, NIM-for-NV-CLIP, NVIDIA-EAGLE, Nemotron, Nemotron-Parse, NVIDIA-RAG-Blueprint, NVIDIA-AI-Q-Blueprint, NVIDIA-AI-Data-Platform, TensorRT, Triton-Inference-Server Sources: https://build.nvidia.com/nvidia/llama-nemotron-rerank-vl-1b-v2/modelcard, https://build.nvidia.com/nvidia/llama-nemotron-rerank-vl-1b-v2, https://huggingface.co/nvidia/llama-nemotron-rerank-vl-1b-v2 Last Updated: 2026-04-29

Summary

Llama Nemotron Rerank VL 1B v2 is an NVIDIA multimodal reranking model for question-answer retrieval over document pages, slides, images, text, and image+text inputs. It scores query/document pairs with raw logits so retrieval pipelines can reorder candidate pages before generation.

Detail

The Build NVIDIA model card describes the model as a multimodal question-answering retrieval reranker. It is designed for vision RAG and document retrieval pipelines where an embedding model first returns candidate pages and the reranker then performs deeper query/document interaction to produce a final ranking.

The model can process document pages as images, text, or image+text combinations. NVIDIA calls out screenshots of document pages or slides as expected image inputs, including pages with text, tables, charts, and infographics. This makes it adjacent to Nemotron-Parse, NIM-for-Image-OCR, and document-ingestion workflows that convert enterprise PDFs and slides into retrievable visual/text context.

Architecturally, the model card describes a transformer cross-encoder with about 1.7B parameters. It uses an Eagle VLM architecture with a SigLIP 2 400M vision encoder and the Llama-Nemotron-Rerank-1B-v2 language model. The model card also notes NVIDIA Eagle 2 tiling ideas and Nemoretriever-Parse as related innovations.

Inputs and Outputs

  • Inputs: image and text, including image-only, text-only, or image+text candidate documents paired with a text query.
  • Image expectations: RGB document/page/slide screenshots, including visual documents with tables, charts, and infographics.
  • Context behavior: fine-tuned with max_input_tiles=4 and 2048-token context; evaluated with max_input_tiles=6 and up to 10240-token context for image+text.
  • Outputs: list of float logits, one per query/document pair; callers may convert logits into probabilities with sigmoid if desired.

Deployment and Evaluation

The model card lists TensorRT as the runtime engine, Linux as the supported OS, and NVIDIA Ampere, Hopper, Lovelace, and Blackwell as supported hardware microarchitectures. It is also usable from Hugging Face Transformers with transformers>=4.56.0 and optional FlashAttention.

NVIDIA reports evaluation on ViDoRe V1, V2, V3 and internal visual retrieval datasets, plus text retrieval benchmarks such as BEIR, MIRACL, MLQA, and MLDR. The model card evaluates it in an embedding-plus-reranking pipeline with the companion Llama-Nemotron-Embed-VL-1B-v2 embedding model.

Connections

Resources