Parakeet ASR

Type: Model Tags: NVIDIA, ASR, Speech Recognition, NeMo, CTC, RNN-T, Audio AI Related: NVIDIA-NeMo, NVIDIA-Riva, NVIDIA-Speech-NIM-Microservices, NVIDIA-ASR-NIM, NVIDIA-Canary, TensorRT, NVIDIA-NIM Sources: NVIDIA official documentation, https://docs.nvidia.com/nim/speech/latest/asr/index.html Last Updated: 2026-04-29

Summary

Parakeet is NVIDIA’s family of state-of-the-art automatic speech recognition (ASR) models developed in collaboration with Apple, built on the NeMo framework. The Parakeet models achieve top-tier word error rates (WER) on standard English ASR benchmarks and are optimized for real-time and batch transcription workloads. They use CTC and RNN-T decoding architectures built on the FastConformer encoder, and are available as open-weight models on Hugging Face.

Detail

Purpose

High-accuracy, low-latency automatic speech recognition is a critical building block for voice assistants, meeting transcription, call center analytics, medical dictation, and accessibility tools. Parakeet addresses the need for a production-quality open-weight English ASR model that outperforms prior state-of-the-art systems while remaining efficient enough for real-time deployment.

Key Features

  • FastConformer-based encoder: efficient hybrid CNN-Transformer architecture for long-context audio
  • Multiple decoding modes: CTC (Parakeet-CTC), RNN-T (Parakeet-RNNT), and TDT (Token-and-Duration Transducer)
  • Model sizes: 0.6B (Parakeet-CTC-0.6B), 1.1B (Parakeet-TDT-1.1B), and Parakeet-RNNT-1.1B
  • Word-level timestamps: accurate start/end times for each recognized word
  • Best-in-class WER on LibriSpeech, MLS, VoxPopuli, Earnings21, and CHIME-6 benchmarks
  • Trained on 64,000+ hours of English speech data
  • Support for long-form audio transcription (hours-long audio with chunking)
  • Open-weight release on Hugging Face (nvidia/parakeet-*)

Use Cases

  • Real-time meeting transcription and captioning
  • Call center and contact center analytics
  • Voice assistant backend for English-language commands
  • Medical dictation and clinical documentation
  • Subtitle and caption generation for video content
  • Accessibility tools for the hearing-impaired
  • Podcast and audio search indexing

Hardware Requirements / Compatibility

  • Parakeet-CTC-0.6B: single GPU (T4, A10, A100) or CPU for small batches
  • Parakeet-1.1B: single A10G / A100 / H100 recommended for real-time workloads
  • TensorRT-optimized via NVIDIA Riva for production deployment
  • Runs on NVIDIA Jetson AGX Orin for edge deployment

Language Bindings / APIs

  • Python (NVIDIA NeMo: nemo.collections.asr)
  • NVIDIA Riva SDK (gRPC streaming API)
  • Hugging Face Transformers (AutoModelForCTC, pipeline)
  • NVIDIA-ASR-NIM microservices for containerized deployment through the current Speech NIM docs

Connections

  • NVIDIA-NeMo — Parakeet is trained, fine-tuned, and served via the NeMo ASR collection
  • NVIDIA-Riva — Riva provides production streaming ASR deployment for Parakeet models
  • NVIDIA-Speech-NIM-Microservices and NVIDIA-ASR-NIM - current docs surface listing Parakeet CTC, TDT, and RNNT ASR NIM options.
  • NVIDIA-Canary — Canary is NVIDIA’s multilingual complement to Parakeet’s English-focused ASR
  • TensorRT — Parakeet ASR NIM deployment uses TensorRT/Triton inference acceleration.
  • NVIDIA-NIM — Available as NIM containers for one-click ASR deployment

Resources