NIM for Maxine Active Speaker Detection

Type: Microservice Tags: NVIDIA, NIM, Maxine, active speaker detection, video analytics, diarization, AR, CUDA, TensorRT, Triton, DeepStream, NVDEC Related: NVIDIA-NIM, NVIDIA-Maxine, NVIDIA-AI-for-Media-SDKs, NVIDIA-Augmented-Reality-SDK, NIM-for-Maxine-Eye-Contact, NVIDIA-DeepStream, NVIDIA-Video-Codec-SDK, NVIDIA-AI-Enterprise, TensorRT, Triton-Inference-Server, NVIDIA-CUDA, NGC Sources: https://docs.nvidia.com/nim/maxine/active-speaker-detection/latest/overview.html, https://docs.nvidia.com/nim/maxine/active-speaker-detection/latest/support-matrix.html Last Updated: 2026-04-29

Summary

NVIDIA Active Speaker Detection NIM is a Maxine NIM microservice for detecting and identifying active speakers in video by combining visual analysis with diarized audio data. Current NVIDIA docs describe per-frame outputs such as bounding boxes, speaker identifiers, active-speaking state, and face-detection confidence scores.

Detail

Purpose

Active Speaker Detection NIM helps meeting, media, telepresence, and video-understanding systems determine who is speaking at each point in a video stream. This is useful for speaker-aware editing, analytics, avatar interaction, accessibility, and meeting intelligence.

Current scope

  • Accepts video, audio, and diarization data inputs.
  • Uses GPU-accelerated video decoding through GStreamer/NVDEC.
  • Extracts frame-accurate audio aligned with decoded video frames.
  • Uses diarization timelines to build per-frame speaker masks.
  • Runs inference through an NVIDIA AR SDK backend on Triton.
  • Returns per-frame bounding boxes, speaker IDs, active-speaker state, and confidence values.
  • Supports streaming mode for chunked/streamable inputs and transactional mode for complete file processing.
  • Model ID in the current support matrix is active-speaker-detection.
  • Current optimized configurations list FP16 profiles for T4, A2, A10, A16, A40, L4, L40, L40S, B40, NVIDIA RTX PRO 6000 Blackwell Server Edition, RTX 4090, RTX 5090, and RTX 5080.
  • Uses NVDEC hardware acceleration; GPUs without NVDEC support may be unsuitable.

NVIDIA context

Active Speaker Detection NIM is the NIM deployment surface for a meeting/video intelligence capability that was already present in NVIDIA-Augmented-Reality-SDK. It connects Maxine media AI with NVIDIA-DeepStream, Triton-Inference-Server, and GPU media decode.

Connections

Source Excerpts

  • NVIDIA docs describe Active Speaker Detection NIM as detecting and identifying active speakers from visual and diarized audio data.
  • The architecture section lists CUDA, TensorRT, Triton, and an NVIDIA AR SDK backend for inference.

Resources