A Flexible Instruction Set Architecture for Efficient GEMMs
By: Alexandre de Limas Santana , Adrià Armejach , Francesc Martinez and more
Potential Business Impact:
Makes computers do math faster for AI.
GEneral Matrix Multiplications (GEMMs) are recurrent in high-performance computing and deep learning workloads. Typically, high-end CPUs accelerate GEMM workloads with Single-Instruction Multiple Data (SIMD) or vector Instruction Set Architectures (ISAs). Since these ISAs face significant issues when running GEMM workloads, particularly when dealing with small, tall, or skinny matrices, matrix ISAs have been proposed and implemented by major hardware vendors in the last years. Although these matrix ISAs deliver larger throughput when running GEMMs than their SIMD/vector counterparts, they are rigid solutions unable to dynamically adapt themselves to application-specific aspects like the data format. This paper demonstrates that the state-of-the-art matrix ISAs deliver suboptimal performance when running the most commonly used convolution and transformer models. This paper proposes the Matrix Tile Extension (MTE), the first matrix ISA that completely decouples the instruction set architecture from the microarchitecture and seamlessly interacts with existing vector ISAs. MTE incurs minimal implementation overhead since it only requires a few additional instructions and a 64-bit Control Status Register (CSR) to keep its state. Specifically, MTE can i) vectorize GEMMs across the three dimensions M, N, and K; ii) leverage the capacity of the existing vector register file; and iii) decouple the tile shape from the underlying microarchitecture. MTE achieves speed-ups of 1.35x over the best state-of-the-art matrix ISA.
Similar Papers
High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines
Distributed, Parallel, and Cluster Computing
Makes AI learn faster and use less power.
Exploring the Performance Improvement of Tensor Processing Engines through Transformation in the Bit-weight Dimension of MACs
Hardware Architecture
Makes AI calculations faster and use less power.
High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines
Distributed, Parallel, and Cluster Computing
Makes AI learn faster and use less power.