NVIDIA Riva

Type: Platform Tags: NVIDIA, speech AI, ASR, TTS, NLP, real-time, conversational AI, speech-to-text, text-to-speech Related: NVIDIA-NIM, NVIDIA-Speech-NIM-Microservices, NVIDIA-ASR-NIM, Nemotron-ASR-Streaming, Nemotron-3-VoiceChat, NVIDIA-TTS-NIM, NVIDIA-NMT-NIM, NVIDIA-Background-Noise-Removal-NIM, NIM-for-Maxine-Studio-Voice, NIM-for-Audio2Face-3D, NVIDIA-Tokkio-Digital-Human-Blueprint, NVIDIA-NeMo, NVIDIA-AI-Enterprise, Triton-Inference-Server, NGC, NVIDIA-Maxine, NVIDIA-Audio-Effects-SDK Sources: NVIDIA official documentation (live fetch attempted 2026-04-10; updated from https://docs.nvidia.com/ace/tokkio/latest/overview/architecture.html, https://docs.nvidia.com/maxine/afx/latest/index.html, https://docs.nvidia.com/nim/speech/latest/index.html, https://docs.nvidia.com/nim/speech/latest/asr/index.html, https://docs.nvidia.com/nim/speech/latest/tts/index.html, https://docs.nvidia.com/nim/speech/latest/nmt/index.html, https://docs.nvidia.com/nim/maxine/studio-voice/latest/overview.html, https://docs.nvidia.com/nim/digital-human/a2f-3d/latest/index.html, https://build.nvidia.com/nvidia/nemotron-voicechat/modelcard) Last Updated: 2026-04-29

Summary

NVIDIA Riva is a GPU-accelerated SDK and speech AI platform for building real-time, production-grade speech and conversational AI applications. Current NIM-specific speech deployment docs have moved under NVIDIA-Speech-NIM-Microservices, with separate NVIDIA-ASR-NIM, NVIDIA-TTS-NIM, and NVIDIA-NMT-NIM pages for the current maintained microservice surface.

Detail

Purpose

Building speech AI from scratch requires assembling acoustic models, language models, punctuation restoration, inverse text normalization, and speaker diarization — all requiring different expertise and GPU optimization. Riva packages this complexity into a turnkey SDK: plug in audio, receive text (ASR), or plug in text, receive audio (TTS), with enterprise-grade latency, accuracy, and customizability. Riva powers the speech layer in NVIDIA’s conversational AI ecosystem, integrating with NeMo for model customization and Triton for serving.

Key Features

  • Automatic Speech Recognition (ASR): High-accuracy streaming and offline ASR with support for 60+ languages; includes punctuation restoration, inverse text normalization (ITN), and speaker diarization
  • Text-to-Speech (TTS): Neural TTS with natural-sounding voices; customizable voice cloning and expressive synthesis using FastPitch + HiFi-GAN or radTTS vocoders
  • Neural Machine Translation (NMT): Real-time translation between 30+ language pairs with domain-adaptive fine-tuning capability
  • NLP Tasks: Named entity recognition (NER), intent/slot classification, and question answering pipelines
  • Parakeet ASR Models: NVIDIA’s state-of-the-art English ASR model family (Parakeet-TDT, Parakeet-CTC, Parakeet-RNNT) — competitive with Whisper on accuracy at lower latency
  • Canary Models: Multi-lingual ASR and speech translation models for 4–50+ languages
  • Custom Vocabulary & Acoustic Adaptation: Domain-specific vocabulary injection (medical, legal, financial terminology) without full model retraining; acoustic model adaptation to new speakers or noise conditions
  • Streaming & Offline Modes: Real-time streaming ASR with configurable chunk sizes for low-latency applications; batch offline mode for maximum accuracy
  • Current Speech NIM docs: NVIDIA-Speech-NIM-Microservices is the maintained docs surface for Riva-lineage ASR, TTS, and NMT NIM deployments.
  • NIM Packaging: ASR, TTS, and NMT capabilities are deployable as NIM microservices via Docker or Helm with gRPC/HTTP APIs.
  • Nemotron ASR: Nemotron-ASR-Streaming is the model-specific current English streaming ASR page for the Riva/Speech NIM path.
  • Nemotron VoiceChat: Nemotron-3-VoiceChat is NVIDIA’s full-duplex speech-to-speech model, adjacent to Riva/Speech NIM pipelines that would otherwise chain ASR, LLM, and TTS.
  • Hardware-Optimized: TensorRT-compiled models; optimized for Ampere and Hopper GPU Tensor Cores

Use Cases

  • Call center automation: real-time ASR transcription + intent classification for agent assist and self-service IVR
  • Voice assistant and chatbot voice interfaces with low-latency speech I/O
  • Digital-human speech pipelines such as NVIDIA-Tokkio-Digital-Human-Blueprint
  • Real-time meeting transcription and captioning for accessibility
  • Multilingual customer service with NMT for real-time translation in contact centers
  • Medical transcription: clinical-domain ASR with medical vocabulary and speaker diarization for multi-speaker clinical notes
  • Edge voice AI on NVIDIA Jetson for offline/on-device speech applications in robotics and embedded systems
  • Video subtitle generation and localization workflows

Hardware Requirements / Compatibility

  • GPU: NVIDIA T4, A10, A30, A100, H100 (data center); RTX A-series (workstation); Jetson AGX Orin (edge)
  • Minimum GPU Memory: 8 GB VRAM for ASR-only deployments; 16+ GB for full ASR + TTS stacks
  • CUDA: 11.8 or later; TensorRT 8.x+ for compiled inference
  • OS: Linux (Ubuntu 18.04/20.04/22.04); Docker containerized
  • Kubernetes: Helm charts available; integrates with GPU Operator for auto-provisioning

Language Bindings / APIs

  • Python gRPC Client SDK: riva-python-clients package for streaming and batch ASR/TTS from Python applications
  • gRPC API: High-performance streaming protocol ideal for real-time speech applications (low overhead vs HTTP)
  • REST/HTTP API: Available for non-streaming use cases; compatible with standard HTTP clients
  • NIM REST API: OpenAI-compatible speech endpoints (/v1/audio/transcriptions, /v1/audio/speech) for drop-in compatibility
  • C++ Client SDK: Low-latency C++ client for embedded or high-performance applications
  • WebSocket: Browser-compatible streaming via WebSocket proxy for web applications

Connections

Resources