Score: 0

Demystifying ARM SME to Optimize General Matrix Multiplications

Published: December 25, 2025 | arXiv ID: 2512.21473v1

By: Chencheng Deng , Weiling Yang , Jianbin Fang and more

General Matrix Multiplication (GEMM) is a critical kernel in high-performance computing and deep learning. While modern architectures like ARM's Scalable Matrix Extension (SME) introduce dedicated hardware for matrix operations, existing linear algebra libraries fail to fully exploit its potential, particularly for large matrices. This paper presents MpGEMM, an open-source library that leverages key architectural features of SME to optimize GEMM across multiple precisions. Through a systematic characterization of SME, we derive optimization guidelines that inform our design. MpGEMM employs cache-aware partitioning, efficient data packing with on-the-fly transposition, and specialized micro-kernels that utilize multi-vector loads and all available tile registers. Evaluated on an Apple M4 Pro with real-world workloads from DeepSeek and LLaMA, MpGEMM achieves an average speedup of 1.23x over the vendor-optimized Apple Accelerate library and significantly outperforms other open-source alternatives.

Leveraging Hardware-Aware Computation in Mixed-Precision Matrix Multiply: A Tile-Centric Approach

Distributed, Parallel, and Cluster Computing

Makes computers solve problems faster and use less power.

20 Aug 2025 0

90%

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines

Distributed, Parallel, and Cluster Computing

Makes AI learn faster and use less power.

6 Aug 2025 0

90%

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines

Distributed, Parallel, and Cluster Computing

Makes AI learn faster and use less power.

6 Aug 2025 0

View PDF Login to Bookmark

Demystifying ARM SME to Optimize General Matrix Multiplications

Technical Abstract

Leveraging Hardware-Aware Computation in Mixed-Precision Matrix Multiply: A Tile-Centric Approach

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines