NVIDIA Avatar Cloud Engine (ACE)

Type: Platform Tags: NVIDIA, ACE, NPC, Avatar, Digital Human, Speech AI, Animation, Gaming, Simulation Related: NVIDIA-Tokkio-Digital-Human-Blueprint, NVIDIA-Riva, NVIDIA-Speech-NIM-Microservices, NVIDIA-ASR-NIM, NVIDIA-TTS-NIM, NIM-for-Audio2Face-3D, NIM-for-Maxine-Audio2Face-2D, NVIDIA-Maxine, NVIDIA-Omniverse, NVIDIA-NIM, NVIDIA-RAG-Blueprint, NVIDIA-NeMo Sources: NVIDIA official documentation, https://docs.nvidia.com/ace/tokkio/latest/overview/overview.html, https://docs.nvidia.com/ace/tokkio/latest/overview/architecture.html, https://docs.nvidia.com/nim/speech/latest/index.html, https://docs.nvidia.com/nim/digital-human/a2f-3d/latest/index.html, https://docs.nvidia.com/ace/audio2face-3d-microservice/latest/text/getting-started/overview.html Last Updated: 2026-04-29

Summary

NVIDIA Avatar Cloud Engine (ACE) is a suite of AI microservices that enable developers to create interactive, intelligent, and lifelike digital humans and non-player characters (NPCs) for games, simulations, and enterprise applications. ACE combines speech AI (ASR and TTS via Riva), large language models for conversational intelligence, facial animation via Audio2Face, and RAG-powered knowledge bases to create NPCs that can hold natural conversations, express emotions, and respond intelligently to player or user input. ACE powers NVIDIA’s “digital human” demos and is integrated with Unreal Engine and Unity.

Detail

Purpose

Traditional game NPCs use scripted dialog trees that feel rigid and limited. ACE enables truly conversational, emotionally expressive NPCs powered by real AI — characters that understand natural speech, maintain context across a conversation, retrieve knowledge from game lore, and animate believably in real time. This applies equally to enterprise digital humans (virtual agents, customer service avatars, training simulations).

Key Features

ASR microservice: real-time speech-to-text via Riva Parakeet models
TTS microservice: expressive neural text-to-speech via Riva
NVIDIA-Tokkio-Digital-Human-Blueprint is the current NVIDIA-authored digital-human reference workflow that assembles ACE Controller, RAG/LLM knowledge sources, speech, animation, and rendering services.
Current NVIDIA-Speech-NIM-Microservices docs split ASR, TTS, and NMT into deployable NIM containers that can feed ACE-style digital human pipelines
LLM microservice: NIM-hosted LLM for conversational intelligence and intent understanding
Audio2Face: real-time facial animation driven by audio/phoneme input, including NIM-for-Audio2Face-3D for ARKit blendshape output and NIM-for-Maxine-Audio2Face-2D for portrait animation.
RAG pipeline: retrieval from lore/knowledge databases to keep NPCs factually grounded
Emotion and personality control via SteerLM-style conditioning
Low-latency pipeline: designed for real-time interactive use (<200ms end-to-end)
Plugin integrations for Unreal Engine 5 and Unity
Cloud-hosted microservices and on-device deployment options

Use Cases

AI-powered NPCs in AAA games and interactive experiences
Virtual customer service agents and digital front-desk assistants
Healthcare patient education avatars
Training simulations with interactive instructor characters
Virtual try-on and retail assistant avatars
Theme park and entertainment interactive characters

Hardware Requirements / Compatibility

Cloud: NVIDIA A100/H100 GPU-backed microservices (NIM)
On-device: RTX 4080/4090 for local deployment (RTX AI PC)
Game engine: Unreal Engine 5.1+ and Unity 2022+
Supported OS: Windows 11, Linux (for cloud services)

Language Bindings / APIs

NVIDIA NIM REST API (OpenAI-compatible for LLM component)
Riva gRPC API (ASR/TTS)
Unreal Engine ACE Plugin
Unity ACE Plugin
Python SDK for pipeline orchestration

Connections

NVIDIA-Tokkio-Digital-Human-Blueprint - current Digital Human blueprint built from ACE-style services.
NVIDIA-Riva — ACE uses Riva for ASR and TTS in the speech pipeline
NVIDIA-Speech-NIM-Microservices, NVIDIA-ASR-NIM, and NVIDIA-TTS-NIM - current speech NIM docs for deployable ASR/TTS services.
NIM-for-Audio2Face-3D - current Digital Human NIM for speech-to-facial animation and emotion-driven expressions.
NIM-for-Maxine-Audio2Face-2D - Maxine NIM for 2D portrait animation from speech audio.
NVIDIA-Maxine — Maxine handles video/audio enhancement; ACE uses Maxine for avatar video quality
NVIDIA-Omniverse — ACE avatars rendered in Omniverse for USD-based digital human pipelines
NVIDIA-NIM — all ACE microservices deploy as NIM containers
NVIDIA-RAG-Blueprint - RAG workflow that can provide grounded knowledge for digital human responses.
NVIDIA-NeMo — NeMo models (LLM + Riva) underpin ACE conversational AI

AIPS BOOM

Explorer

NVIDIA-ACE

NVIDIA Avatar Cloud Engine (ACE)

Summary

Detail

Purpose

Key Features

Use Cases

Hardware Requirements / Compatibility

Language Bindings / APIs

Connections

Resources

Graph View

Table of Contents

Backlinks