Score: 1

Emulation of Complex Matrix Multiplication based on the Chinese Remainder Theorem

Published: December 9, 2025 | arXiv ID: 2512.08321v1

By: Yuki Uchino , Qianxiang Ma , Toshiyuki Imamura and more

Potential Business Impact:

Makes computers do math faster and use less power.

Business Areas:

Quantum Computing Science and Engineering

Modern computing architectures feature low-precision matrix multiplication units that achieve substantially higher throughput than their high-precision counterparts. Motivated by this architectural trend, the emulation of high-precision matrix multiplication using low-precision hardware has attracted significant interest in the high-performance computing community. Ozaki, Uchino, and Imamura introduced the Ozaki-II scheme as a general framework for emulating matrix multiplication. Building on this framework, Uchino, Ozaki, and Imamura developed high-performance and power-efficient techniques for emulating single- and double-precision real matrix multiplication on INT8 matrix engines. Extending this line of research, the present study proposes high-performance emulation methods for single- and double-precision complex matrix multiplication on INT8 matrix engines, based on the Ozaki-II scheme. On an NVIDIA B200 GPU, the proposed methods achieve 4.0x--5.6x and 4.4x--6.5x speedups over the native single- and double-precision complex matrix multiplication routines from cuBLAS, respectively, for sufficiently large problem sizes. When lower accuracy than that of the standard routine is acceptable, the proposed methods can operate at even higher speed. Conversely, with only a modest increase in computation time, they can also deliver higher accuracy than the standard routines. These properties suggest that the proposed approach has the potential to serve as a default algorithm across a wide range of applications.

Ozaki Scheme II: A GEMM-oriented emulation of floating-point matrix multiplication using an integer modular technique

Mathematical Software

Makes computers do math faster and more accurately.

10 Apr 2025 0

92%

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines

Distributed, Parallel, and Cluster Computing

Makes AI learn faster and use less power.

6 Aug 2025 0

92%

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines

Distributed, Parallel, and Cluster Computing

Makes AI learn faster and use less power.

6 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇫🇷 🇯🇵 France, Japan

Page Count

11 pages

Emulation of Complex Matrix Multiplication based on the Chinese Remainder Theorem

Makes computers do math faster and use less power.

Technical Abstract

Ozaki Scheme II: A GEMM-oriented emulation of floating-point matrix multiplication using an integer modular technique

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines