Score: 0

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines

Published: August 6, 2025 | arXiv ID: 2508.03984v2

By: Yuki Uchino, Katsuhisa Ozaki, Toshiyuki Imamura

Potential Business Impact:

Makes AI learn faster and use less power.

Recent architectures integrate high-performance and power-efficient matrix engines. These engines demonstrate remarkable performance in low-precision matrix multiplication, which is crucial in deep learning. Several techniques have been proposed to emulate single- and double-precision general matrix-matrix multiplication (SGEMM and DGEMM, respectively) by leveraging such low-precision matrix engines. In this study, we present emulation methods that significantly outperforms conventional approaches. On a GH200 Grace Hopper Superchip, the proposed DGEMM emulation achieves a 1.4x speedup and a 43% improvement in power efficiency compared to native DGEMM for sufficiently large problems. The proposed SGEMM emulation achieves a 3.0x speedup and a 154% improvement in power efficiency compared to native SGEMM for sufficiently large problems. Furthermore, compared to conventional emulation methods, the proposed emulation achieves more than 2x higher performance and superior power efficiency.

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines

Distributed, Parallel, and Cluster Computing

Makes AI learn faster and use less power.

6 Aug 2025 0

92%

Emulation of Complex Matrix Multiplication based on the Chinese Remainder Theorem

Distributed, Parallel, and Cluster Computing

Makes computers do math faster and use less power.

9 Dec 2025 1

89%

Leveraging Hardware-Aware Computation in Mixed-Precision Matrix Multiply: A Tile-Centric Approach

Distributed, Parallel, and Cluster Computing

Makes computers solve problems faster and use less power.

20 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇯🇵 Japan

Page Count

8 pages

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines

Makes AI learn faster and use less power.

Technical Abstract

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines

Emulation of Complex Matrix Multiplication based on the Chinese Remainder Theorem

Leveraging Hardware-Aware Computation in Mixed-Precision Matrix Multiply: A Tile-Centric Approach