Score: 0

A Machine Learning Approach Towards Runtime Optimisation of Matrix Multiplication

Published: January 14, 2026 | arXiv ID: 2601.09114v1

By: Yufan Xia , Marco De La Pierre , Amanda S. Barnard and more

The GEneral Matrix Multiplication (GEMM) is one of the essential algorithms in scientific computing. Single-thread GEMM implementations are well-optimised with techniques like blocking and autotuning. However, due to the complexity of modern multi-core shared memory systems, it is challenging to determine the number of threads that minimises the multi-thread GEMM runtime. We present a proof-of-concept approach to building an Architecture and Data-Structure Aware Linear Algebra (ADSALA) software library that uses machine learning to optimise the runtime performance of BLAS routines. More specifically, our method uses a machine learning model on-the-fly to automatically select the optimal number of threads for a given GEMM task based on the collected training data. Test results on two different HPC node architectures, one based on a two-socket Intel Cascade Lake and the other on a two-socket AMD Zen 3, revealed a 25 to 40 per cent speedup compared to traditional GEMM implementations in BLAS when using GEMM of memory usage within 100 MB.

Demystifying ARM SME to Optimize General Matrix Multiplications

Distributed, Parallel, and Cluster Computing

Makes computer math much faster for AI.

25 Dec 2025 0

92%

Leveraging Hardware-Aware Computation in Mixed-Precision Matrix Multiply: A Tile-Centric Approach

Distributed, Parallel, and Cluster Computing

Makes computers solve problems faster and use less power.

20 Aug 2025 0

90%

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines

Distributed, Parallel, and Cluster Computing

Makes AI learn faster and use less power.

6 Aug 2025 0

View PDF Login to Bookmark

A Machine Learning Approach Towards Runtime Optimisation of Matrix Multiplication

Technical Abstract

Demystifying ARM SME to Optimize General Matrix Multiplications

Leveraging Hardware-Aware Computation in Mixed-Precision Matrix Multiply: A Tile-Centric Approach

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines