CUB

Type: Technology Tags: CUDA, NVIDIA, GPU, Parallel Primitives, C++, Templates, Low-Level, HPC Related: Thrust, CUTLASS, cuBLAS, cuDF Sources: NVIDIA official documentation, GitHub Last Updated: 2026-04-09

Summary

CUB is NVIDIA’s C++ template library providing reusable, optimized cooperative primitives for CUDA programming across all levels of the GPU hierarchy — thread, warp, block, and device-wide. It is now maintained as part of the unified NVIDIA CCCL (C++ Core Compute Libraries) repository alongside Thrust and libcu++. CUB is included in the CUDA Toolkit and NVIDIA HPC SDK, and serves as the foundation upon which Thrust and many other GPU libraries are built.

Detail

Purpose

Efficient GPU programming requires careful management of data across threads, warps, and thread blocks. CUB provides reusable, architecture-aware implementations of common parallel patterns (sort, scan, reduce, histogram) at each level of the GPU hierarchy, so developers don’t need to re-implement these primitives from scratch. It is the “assembly language” of parallel GPU algorithms — Thrust sits above CUB, and CUB sits above raw CUDA kernels.

Key Features

Device-wide primitives: sort, prefix scan, reduction, histograms (compatible with CUDA dynamic parallelism)
Block-level collective operations: data loading/storing, sorting, scanning, reducing across thread blocks
Warp-level primitives: prefix scans and reductions at warp granularity
Utility components: PTX intrinsics, device introspection, memory allocators, texture-caching iterators
Architecture-aware and safe implementations (optimized per GPU generation)
Now part of NVIDIA CCCL (unified with Thrust and libcu++)
Included in CUDA Toolkit 12.0+ (CUB 2.0.1)
Compiler support: NVCC 11.0+, GCC 5+, Clang 7+, MSVC 2019+

Use Cases

Building custom GPU library kernels requiring efficient parallel reductions/scans
Performance-critical CUDA kernel development
Implementing histogram and sort operations at scale
Foundation for higher-level libraries (Thrust, cuDF, cuML)
Research into novel GPU parallel algorithms

Hardware Requirements

NVIDIA GPU with CUDA support
CUDA Toolkit 11.0+; CUDA 12.0 recommended
All modern NVIDIA GPU architectures (Volta through Blackwell)

Language Bindings

C++ (template library — primary interface)

Connections

Thrust — Thrust is built on CUB; CUB provides the underlying block/device primitives Thrust uses
CUTLASS — CUTLASS and CUB are complementary low-level GPU libraries in the CCCL ecosystem
cuDF — RAPIDS cuDF uses CUB device-wide primitives for sort, scan, and reduction operations
cuBLAS — cuBLAS uses CUB-like primitives internally for GPU memory management

AIPS BOOM

Explorer

CUB

CUB

Summary

Detail

Purpose

Key Features

Use Cases

Hardware Requirements

Language Bindings

Connections

Resources

Graph View

Table of Contents

Backlinks