Score: 1

A Flexible Instruction Set Architecture for Efficient GEMMs

Published: July 4, 2025 | arXiv ID: 2507.03522v1

By: Alexandre de Limas Santana , Adrià Armejach , Francesc Martinez and more

Potential Business Impact:

Makes computers do math faster for AI.

GEneral Matrix Multiplications (GEMMs) are recurrent in high-performance computing and deep learning workloads. Typically, high-end CPUs accelerate GEMM workloads with Single-Instruction Multiple Data (SIMD) or vector Instruction Set Architectures (ISAs). Since these ISAs face significant issues when running GEMM workloads, particularly when dealing with small, tall, or skinny matrices, matrix ISAs have been proposed and implemented by major hardware vendors in the last years. Although these matrix ISAs deliver larger throughput when running GEMMs than their SIMD/vector counterparts, they are rigid solutions unable to dynamically adapt themselves to application-specific aspects like the data format. This paper demonstrates that the state-of-the-art matrix ISAs deliver suboptimal performance when running the most commonly used convolution and transformer models. This paper proposes the Matrix Tile Extension (MTE), the first matrix ISA that completely decouples the instruction set architecture from the microarchitecture and seamlessly interacts with existing vector ISAs. MTE incurs minimal implementation overhead since it only requires a few additional instructions and a 64-bit Control Status Register (CSR) to keep its state. Specifically, MTE can i) vectorize GEMMs across the three dimensions M, N, and K; ii) leverage the capacity of the existing vector register file; and iii) decouple the tile shape from the underlying microarchitecture. MTE achieves speed-ups of 1.35x over the best state-of-the-art matrix ISA.

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines

Distributed, Parallel, and Cluster Computing

Makes AI learn faster and use less power.

6 Aug 2025 0

89%

Exploring the Performance Improvement of Tensor Processing Engines through Transformation in the Bit-weight Dimension of MACs

Hardware Architecture

Makes AI calculations faster and use less power.

8 Mar 2025 1

89%

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines

Distributed, Parallel, and Cluster Computing

Makes AI learn faster and use less power.

6 Aug 2025 0

View PDF Login to Bookmark

Page Count

14 pages

A Flexible Instruction Set Architecture for Efficient GEMMs

Makes computers do math faster for AI.

Technical Abstract

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines

Exploring the Performance Improvement of Tensor Processing Engines through Transformation in the Bit-weight Dimension of MACs

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines