NVPL (NVIDIA Performance Libraries)

Type: Technology Tags: CUDA, NVIDIA, GPU, CPU, HPC, Math Libraries, Arm, Grace, CUDA-X Related: NVIDIA-HPC-SDK, NVIDIA-HPC-Compilers, cuBLAS, cuFFT, cuSOLVER, cuSPARSE, nvmath-python Sources: NVIDIA official documentation, developer.nvidia.com/nvpl Last Updated: 2026-04-09

Summary

NVPL (NVIDIA Performance Libraries) is a collection of high-performance CPU math libraries optimized specifically for the NVIDIA Grace CPU (Arm Neoverse V2 architecture) and Grace Hopper / Grace Blackwell Superchips. NVPL provides CPU-side counterparts to CUDA-X GPU libraries — including NVPL BLAS, NVPL LAPACK, NVPL FFT, NVPL RAND, NVPL ScaLAPACK, and NVPL Sparse — enabling applications on Grace-based systems to achieve maximum CPU performance while interoperating seamlessly with GPU-accelerated CUDA-X counterparts.

Detail

Purpose

NVPL addresses the need for CPU math libraries specifically tuned for NVIDIA’s Arm-based Grace CPU, which powers the Grace Hopper and Grace Blackwell Superchips. Existing x86-optimized libraries (Intel MKL) do not run on Grace; NVPL provides equivalent functionality with performance optimized for the Arm Neoverse V2 microarchitecture, SVE (Scalable Vector Extension) SIMD, and the NVLink-C2C CPU-GPU interconnect unique to Grace Hopper.

Key Features

  • NVPL BLAS: CPU BLAS (Level 1, 2, 3) for Grace — SGEMM, DGEMM, ZGEMM with Neoverse V2 + SVE optimization
  • NVPL LAPACK: CPU dense linear algebra (eigensolvers, SVD, LU, QR, Cholesky) for Grace
  • NVPL FFT: CPU FFT library matching cuFFT API patterns, optimized for Grace cache hierarchy
  • NVPL RAND: CPU random number generation matching cuRAND host API
  • NVPL ScaLAPACK: distributed CPU dense linear algebra via MPI for Grace-based clusters
  • NVPL Sparse: CPU sparse linear algebra matching cuSPARSE patterns
  • OpenBLAS and LAPACK compatible interfaces for drop-in compatibility
  • Optimized for SVE2 vector instructions and Neoverse V2 microarchitecture
  • Thread-parallel: OpenMP-backed multithreaded implementations
  • Interoperable with CUDA-X counterparts via unified memory on Grace Hopper NVLink-C2C

Use Cases

  • HPC applications on NVIDIA Grace Hopper Superchip CPU partition
  • CPU fallback implementations that mirror GPU CUDA-X behavior
  • Mixed CPU-GPU workflows on Grace Hopper where some steps are CPU-resident
  • Porting Intel MKL-dependent HPC codes to NVIDIA Grace platforms
  • Climate and weather models that have CPU-resident compute phases
  • Linear algebra heavy scientific codes (quantum chemistry, FEM)

Hardware Requirements

  • Primarily designed for NVIDIA Grace CPU (Arm Neoverse V2)
  • Grace Hopper Superchip (GH200): Grace + H100 connected via NVLink-C2C
  • Grace Blackwell Superchip (GB200): Grace + B200 connected via NVLink-C2C
  • Can run on other Arm Neoverse systems with reduced optimization
  • CUDA not required (CPU-only library, but works alongside CUDA on Grace Hopper)

Language Bindings

  • Fortran (BLAS/LAPACK standard Fortran interfaces)
  • C (CBLAS and LAPACKE C interfaces)
  • C++ (via C interfaces)
  • Python (via SciPy/NumPy configured with NVPL as BLAS/LAPACK backend)

Connections

  • cuBLAS — NVPL BLAS is the CPU-side companion to cuBLAS on Grace Hopper systems
  • NVIDIA-HPC-SDK - current HPC SDK documentation hub lists NVPL alongside CUDA math libraries.
  • NVIDIA-HPC-Compilers - compiler stack that links CPU/GPU math workflows in HPC applications.
  • cuFFT — NVPL FFT mirrors the cuFFT API for CPU-side transforms on Grace
  • cuSOLVER — NVPL LAPACK provides the CPU equivalents of cuSOLVER dense solvers
  • cuSPARSE — NVPL Sparse provides CPU sparse linear algebra complementing cuSPARSE
  • nvmath-python — nvmath-python’s Python abstraction covers both CUDA-X GPU and can be paired with NVPL CPU backends

Resources