Score: 0

A Configurable Mixed-Precision Fused Dot Product Unit for GPGPU Tensor Computation

Published: November 19, 2025 | arXiv ID: 2512.00053v1

By: Nikhil Rout, Blaise Tine

Potential Business Impact:

Speeds up AI learning by combining math types.

Business Areas:

Field-Programmable Gate Array (FPGA) Hardware

Efficient mixed-precision MMA operations are critical for accelerating Deep Learning workloads on GPGPUs. However, existing open-source RTL implementations of inner dot products rely on discrete arithmetic units, leading to suboptimal throughput and poor resource utilization. To address these challenges, we propose a scalable mixed-precision dot product unit that integrates floating-point and integer arithmetic pipelines within a singular fused architecture, implemented as part of the open-source RISC-V based Vortex GPGPU's Tensor Core Unit extension. Our design supports low-precision multiplication in (FP16/BF16/FP8/BF8/INT8/UINT4) formats and higher-precision accumulation in (FP32/INT32), with an extensible framework for adding and evaluating other custom representations in the future. Experimental results demonstrate 4-cycle operation latency at 306.6 MHz clock frequency on the AMD Xilinx Alveo U55C FPGA, delivering an ideal filled pipeline throughput of 9.812 GFLOPS in a 4-thread per warp configuration.

The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference

Computation and Language

Makes computers do math faster for AI.

13 Jun 2025 0

88%

Decoupled Control Flow and Data Access in RISC-V GPGPUs

Hardware Architecture

Makes computer graphics chips faster for AI.

10 Nov 2025 1

88%

Guaranteed DGEMM Accuracy While Using Reduced Precision Tensor Cores Through Extensions of the Ozaki Scheme

Distributed, Parallel, and Cluster Computing

Makes computers do hard math faster and more accurately.

16 Nov 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

3 pages

A Configurable Mixed-Precision Fused Dot Product Unit for GPGPU Tensor Computation

Speeds up AI learning by combining math types.

Technical Abstract

The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference

Decoupled Control Flow and Data Access in RISC-V GPGPUs

Guaranteed DGEMM Accuracy While Using Reduced Precision Tensor Cores Through Extensions of the Ozaki Scheme