NVIDIA Canary

Type: Model Tags: NVIDIA, ASR, Speech Recognition, NeMo, Multilingual, Translation, Audio AI Related: NVIDIA-NeMo, NVIDIA-Riva, NVIDIA-Speech-NIM-Microservices, NVIDIA-ASR-NIM, Parakeet-ASR, NVIDIA-NIM, TensorRT Sources: NVIDIA official documentation, https://docs.nvidia.com/nim/speech/latest/asr/index.html Last Updated: 2026-04-29

Summary

NVIDIA Canary is a multilingual automatic speech recognition (ASR) and speech translation model built within the NeMo framework, designed to transcribe and translate speech across multiple languages with high accuracy. Unlike Parakeet (English-only), Canary supports English, Spanish, German, and French transcription plus speech-to-text translation between these languages. It uses a novel encoder-decoder architecture with a multi-task training objective and achieves state-of-the-art results on multilingual ASR benchmarks.

Detail

Purpose

Global applications require ASR that goes beyond English. Canary addresses the multilingual gap by providing a single model capable of transcribing and translating across four major languages, making it suitable for international call centers, multilingual media, global meeting transcription, and cross-language accessibility applications.

Key Features

Multilingual support: English, Spanish, German, French (transcription and translation)
Speech-to-speech translation: transcribe in one language and output text in another
Encoder-decoder architecture: FastConformer encoder + Transformer decoder
Multi-task training: joint ASR and speech translation objective
Model sizes: Canary-1B (primary release)
State-of-the-art multilingual WER on MLS, FLEURS, CoVoST-2 benchmarks
Pnc (punctuation and capitalization) in output by default
Open-weight release under CC-BY-4.0 license on Hugging Face

Use Cases

Multilingual meeting transcription and captioning
Cross-language subtitle generation for international media
International call center analytics and compliance
Real-time speech translation for conferencing
Multilingual voice search and voice command systems
Academic research on multilingual speech processing

Hardware Requirements / Compatibility

Single A10G / A100 / H100 GPU for real-time inference
Runs efficiently on T4 for non-real-time batch workloads
TensorRT optimization available via NVIDIA Riva
Deployable on NVIDIA Jetson for on-device edge inference

Language Bindings / APIs

Python (NVIDIA NeMo: nemo.collections.asr, EncDecMultiTaskModel)
NVIDIA Riva SDK for production deployment
Hugging Face (nvidia/canary-1b)
NVIDIA-ASR-NIM containers through the current Speech NIM docs

Connections

NVIDIA-NeMo — Canary is developed and trained within the NeMo speech collection
NVIDIA-Riva — Riva provides production deployment runtime for Canary in streaming mode
NVIDIA-Speech-NIM-Microservices and NVIDIA-ASR-NIM - current docs surface listing Canary 1B for offline transcription and bidirectional translation.
Parakeet-ASR — Parakeet is English-focused, Canary extends to multilingual use cases
NVIDIA-NIM — Canary available via NIM microservices
TensorRT — ASR NIM deployment uses TensorRT/Triton inference acceleration.

AIPS BOOM

Explorer

NVIDIA-Canary

NVIDIA Canary

Summary

Detail

Purpose

Key Features

Use Cases

Hardware Requirements / Compatibility

Language Bindings / APIs

Connections

Resources

Graph View

Table of Contents

Backlinks