A Configurable Mixed-Precision Fused Dot Product Unit for GPGPU Tensor Computation
By: Nikhil Rout, Blaise Tine
Potential Business Impact:
Speeds up AI learning by combining math types.
Efficient mixed-precision MMA operations are critical for accelerating Deep Learning workloads on GPGPUs. However, existing open-source RTL implementations of inner dot products rely on discrete arithmetic units, leading to suboptimal throughput and poor resource utilization. To address these challenges, we propose a scalable mixed-precision dot product unit that integrates floating-point and integer arithmetic pipelines within a singular fused architecture, implemented as part of the open-source RISC-V based Vortex GPGPU's Tensor Core Unit extension. Our design supports low-precision multiplication in (FP16/BF16/FP8/BF8/INT8/UINT4) formats and higher-precision accumulation in (FP32/INT32), with an extensible framework for adding and evaluating other custom representations in the future. Experimental results demonstrate 4-cycle operation latency at 306.6 MHz clock frequency on the AMD Xilinx Alveo U55C FPGA, delivering an ideal filled pipeline throughput of 9.812 GFLOPS in a 4-thread per warp configuration.
Similar Papers
The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference
Computation and Language
Makes computers do math faster for AI.
Decoupled Control Flow and Data Access in RISC-V GPGPUs
Hardware Architecture
Makes computer graphics chips faster for AI.
Guaranteed DGEMM Accuracy While Using Reduced Precision Tensor Cores Through Extensions of the Ozaki Scheme
Distributed, Parallel, and Cluster Computing
Makes computers do hard math faster and more accurately.