NVIDIA NVSentinel

Type: Tool Tags: NVIDIA, NVSentinel, Kubernetes, GPU monitoring, fault remediation, GPU health, DCGM, operations Related: NVIDIA-Cloud-Accelerator-NCX, NVIDIA-GPU-Operator, NVIDIA-DCGM, NVIDIA-Fleet-Intelligence, NVIDIA-Project-GPUd, NVIDIA-Resiliency-Extension, NVIDIA-AI-Cluster-Runtime Sources: https://docs.nvidia.com/ncx/index.html; https://github.com/NVIDIA/NVSentinel Last Updated: 2026-04-29

Summary

NVIDIA NVSentinel is an open-source, Kubernetes-native GPU fault detection and remediation system. NCX describes it as GPU monitoring and fault remediation for Kubernetes; the public repository describes automatic detection, classification, and remediation of hardware and software faults in GPU nodes. It monitors GPU health, system logs, and cloud provider maintenance events, then can cordon faulty nodes, drain workloads, and trigger break-fix workflows.

Detail

NVSentinel is part of NVIDIA’s operations layer for keeping GPU fleets productive. It requires Kubernetes, Helm, and NVIDIA GPU Operator, which includes DCGM for GPU monitoring. Its features include real-time detection of GPU, NVSwitch, and system-level failures; automated remediation; pluggable health monitors; high availability; event-driven processing; persistent event storage; coordinated workload eviction; and metadata enrichment.

This page should answer questions about Kubernetes-native GPU break/fix automation. It should not be conflated with NVIDIA-DCGM, which provides diagnostics and telemetry, or NVIDIA-Fleet-Intelligence, which is a managed fleet health service. NVSentinel is specifically about automated fault handling inside Kubernetes.

Connections

Source Excerpts

  • “GPU Fault Detection and Remediation for Kubernetes”
  • “cordoning faulty nodes, draining workloads, and triggering break-fix workflows”