LOw-cOst yet High-Performant Sparse Matrix-Matrix Multiplication on Arm SME Architectures
By: Kelun Lei , Hailong Yang , Kaige Zhang and more
Potential Business Impact:
Makes computer math problems run much faster.
Sparse matrix-dense matrix multiplication (SpMM) is a critical kernel in both scientific computing and emerging graph learning workloads. The recent Armv9 architecture introduces Scalable Matrix Extension (SME), enabling tile-based matrix operations with high throughput. However, effectively exploiting both SME and traditional SIMD resources for unstructured sparse workloads remains an open challenge. To address this, we propose LOOPS, a hybrid execution framework that combines row-wise CSR-part with vector-wise BCSR-part layout, enabling cooperative utilization of vector instructions (NEON) and Scalable Matrix Extension (SME) resources. LOOPS supports multi-precision SpMM across FP64, FP32, and FP16 via an adaptive two-level parallelization scheme guided by a lightweight performance model. Experimental results on the entire SuiteSparse on an Apple's M4Pro CPU show that LOOPS achieves average speedups of 9.93$\times$ (FP32)/14.4$\times$ (FP64) against the CPU baseline TACO and 71.3$\times$ (FP32)/54.8$\times$ (FP64) with respect to Armadillo. A comparison of LOOPS running on the same CPU with two GPU methods (cuSPARSE, Magicube) executed on an NVIDIA A100 GPU show average speedups for LOOPS between 19.8$\times$ and 33.5$\times$, depending on the precision. Notably, LOOPS delivers significantly better energy efficiency than the GPU codes on the A100 GPU.
Similar Papers
Toward Efficient SpMV in Sparse LLMs via Block Extraction and Compressed Storage
Distributed, Parallel, and Cluster Computing
Makes AI models run much faster and smaller.
A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs
Programming Languages
Makes AI learn faster on computers.
NM-SpMM: Accelerating Matrix Multiplication Using N:M Sparsity with GPGPU
Distributed, Parallel, and Cluster Computing
Makes smart computer programs run much faster.