NIM for Maxine Audio2Face-2D

Type: Microservice Tags: NVIDIA, NIM, Maxine, Audio2Face-2D, A2F-2D, facial animation, portrait animation, digital human, CUDA, TensorRT, Triton Related: NVIDIA-NIM, NVIDIA-Maxine, NVIDIA-AI-for-Media-SDKs, NVIDIA-ACE, NIM-for-Audio2Face-3D, NVIDIA-Augmented-Reality-SDK, NVIDIA-Omniverse, NVIDIA-AI-Enterprise, TensorRT, Triton-Inference-Server, NVIDIA-CUDA, NGC Sources: https://docs.nvidia.com/nim/maxine/audio2face-2d/latest/overview.html, https://docs.nvidia.com/nim/maxine/audio2face-2d/latest/support-matrix.html Last Updated: 2026-04-29

Summary

NVIDIA Audio2Face-2D NIM is a Maxine NIM microservice for animating a portrait photo from driving speech audio. Current NVIDIA docs describe it as estimating mouth-motion landmarks from audio, encoding those movements into latent representations, and generating a synchronized facial animation with CUDA, TensorRT, and Triton acceleration.

Detail

Purpose

Audio2Face-2D NIM gives applications a deployable portrait-animation service for digital humans, media experiences, video messaging, and avatar workflows where the output is a 2D animated image rather than a full 3D character rig.

Current scope

  • Uses input audio plus a portrait image to generate mouth-synchronized facial animation.
  • Extracts MFCC audio features and runs LSTM-based landmark estimation for mouth articulation.
  • Supports additional facial motion controls such as blinking, gaze, mouth movement, and head pose as part of the animation pipeline.
  • Uses an encoder-decoder architecture with recurrent, convolutional, and generative model components.
  • Model ID in the current support matrix is maxine-audio2face-2d.
  • Current optimized configurations list FP16 profiles for V100, T4, A2, A10, A16, A40, L4, L40, and RTX 50-series GPUs.
  • Requires GPUs with Tensor Cores and NVENC/NVDEC support; current docs explicitly exclude A100/H100-class products for this NIM because they lack required media encode/decode hardware.

NVIDIA context

This page is the 2D portrait-animation NIM counterpart to NIM-for-Audio2Face-3D. It belongs in the NVIDIA-Maxine and NVIDIA-AI-for-Media-SDKs graph for media enhancement, and in the NVIDIA-ACE graph when a digital-human workflow can use a generated 2D avatar surface.

Connections

Source Excerpts

  • NVIDIA docs describe Audio2Face-2D NIM as creating facial animation from a portrait photo and speech audio.
  • The support matrix lists maxine-audio2face-2d as an NVIDIA-published model with CUDA, TensorRT, Triton, and DeepStream components.

Resources