Llama Nemotron Rerank 1B v2

Type: NVIDIA model Tags: NVIDIA, Llama Nemotron, NeMo Retriever, reranking, text retrieval, multilingual retrieval, cross-lingual retrieval, RAG, TensorRT, Build NVIDIA Related: NeMo-Retriever, NeMo-Retriever-Reranking-NIM, Llama-Nemotron-Embed-1B-v2, Llama-Nemotron-Embed-VL-1B-v2, Llama-Nemotron-Rerank-VL-1B-v2, NeMo-Retriever-Embedding-NIM, NVIDIA-NIM, Nemotron, NVIDIA-RAG-Blueprint, NVIDIA-AI-Q-Blueprint, NVIDIA-AI-Data-Platform, TensorRT, Triton-Inference-Server Sources: https://build.nvidia.com/nvidia/llama-nemotron-rerank-1b-v2/modelcard, https://build.nvidia.com/nvidia/llama-nemotron-rerank-1b-v2, https://huggingface.co/nvidia/llama-nemotron-rerank-1b-v2 Last Updated: 2026-04-29

Summary

Llama Nemotron Rerank 1B v2 is an NVIDIA text reranking model for multilingual and cross-lingual question-answer retrieval. It consumes query/passage pairs and returns relevance logits so retrieval pipelines can reorder candidates from dense embeddings, lexical search, or hybrid retrieval before generation.

Detail

The Build NVIDIA model card describes llama-nemotron-rerank-1b-v2 as optimized to score how relevant a document or passage is to a query. It supports long text up to 8192 tokens and was evaluated across the same 26-language multilingual/cross-lingual retrieval scope as Llama-Nemotron-Embed-1B-v2.

Architecturally, the model card describes a transformer encoder fine-tuned from meta-llama/Llama-3.2-1B. Unlike a bi-encoder embedding model, this reranker acts as a cross-encoder over query/passage pairs, letting the model use token-level interaction before producing a score. NVIDIA positions it as a candidate-ordering stage rather than a model to run over an entire corpus.

In an NVIDIA retrieval pipeline, Llama-Nemotron-Embed-1B-v2 or another dense/sparse retriever returns candidate passages first. Llama Nemotron Rerank 1B v2 then rescales and reorders those candidates, improving the context that downstream NVIDIA-NIM LLMs or agent workflows receive.

Inputs and Outputs

  • Inputs: pairs of texts, typically a query and candidate passage.
  • Retrieval role: rerank candidate passages after embedding, sparse, or hybrid retrieval.
  • Context behavior: evaluated up to 8192 tokens; longer passages should be chunked or truncated.
  • Outputs: float logits/scores, one score per passage; callers may apply sigmoid if they need probabilities.

Deployment and Evaluation

The Build NVIDIA model card lists TensorRT as the runtime/acceleration engine, Linux as the operating system, and Ampere, Hopper, and Lovelace as supported NVIDIA hardware microarchitectures. The Hugging Face model card also shows Transformers and vLLM paths, including a /rerank serving pattern with a prompt template that formats query/document pairs.

NVIDIA reports evaluation as part of a pipeline with an embedding retrieval model across NQ, HotpotQA, FiQA, TechQA, MIRACL multilingual retrieval, MLQA cross-lingual retrieval, and MLDR long-document retrieval. The published results compare embedding-only retrieval with embedding-plus-reranking pipelines and show the intended pairing with Llama-Nemotron-Embed-1B-v2.

Connections

Resources