Score: 0

Microbenchmarking NVIDIA's Blackwell Architecture: An in-depth Architectural Analysis

Published: December 1, 2025 | arXiv ID: 2512.02189v1

By: Aaron Jarmusch, Sunita Chandrasekaran

Potential Business Impact:

Makes computers learn and work much faster.

Business Areas:

GPU Hardware

As GPU architectures rapidly evolve to meet the overcoming demands of exascale computing and machine learning, the performance implications of architectural innovations remain poorly understood across diverse workloads. NVIDIA's Blackwell (B200) generation introduce significant architectural advances including the 5th generation tensor cores, tensor memory (TMEM), decompression engine (DE), and dual chips; however systematic methodologies for quantifying these improvements lag behind hardware development cycles. We contribute an open-source microbenchmark suite that offers practical insights into optimizing workloads to fully utilize the rich feature sets of the modern GPU architecture. This work aims to enable application developers make informed architectural decisions and guide future GPU design directions. Our work studies Blackwell GPUs, compares them to H200 generation with regards to the memory subsystem, tensor core pipeline and floating-point precisions (FP32, FP16, FP8, FP6, FP4). Our systematic evaluation of dense/sparse GEMM, transformer inference, and training workloads demonstrate that B200's tensor core enhancements achieves 1.56x higher mixed-precision throughput and 42% better energy efficiency than H200. Our memory analysis reveals 58% reduction in memory access latency in cache-misses, fundamentally changing optimal algorithm design strategies.

Dissecting the NVIDIA Hopper Architecture through Microbenchmarking and Multiple Level Analysis

Distributed, Parallel, and Cluster Computing

Makes computers do math much faster for AI.

21 Jan 2025 1

87%

Analyzing Modern NVIDIA GPU cores

Hardware Architecture

Makes computer graphics run much faster.

26 Mar 2025 1

87%

Striking the Balance: GEMM Performance Optimization Across Generations of Ryzen AI NPUs

Hardware Architecture

Makes AI run much faster on new computer chips.

15 Dec 2025 2

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

12 pages

Microbenchmarking NVIDIA's Blackwell Architecture: An in-depth Architectural Analysis

Makes computers learn and work much faster.

Technical Abstract

Dissecting the NVIDIA Hopper Architecture through Microbenchmarking and Multiple Level Analysis

Analyzing Modern NVIDIA GPU cores

Striking the Balance: GEMM Performance Optimization Across Generations of Ryzen AI NPUs