Accelerating Sparse Ternary GEMM for Quantized LLM inference on Apple Silicon
By: Baraq Lipshitz , Alessio Melone , Charalampos Maraziaris and more
Potential Business Impact:
Makes Apple computers run math problems much faster.
Sparse Ternary General Matrix-Matrix Multiplication (GEMM) remains under-optimized in existing libraries for Apple Silicon CPUs. We present a Sparse Ternary GEMM kernel optimized specifically for Apple's M-series processors. We propose a set of architecture-aware optimizations, including a novel blocked and interleaved sparse data format to improve memory locality, strategies to increase Instruction-Level Parallelism (ILP), and NEON-based Single Instruction Multiple Data (SIMD) vectorization to exploit data-level parallelism. Our scalar implementation achieves up to a 5.98x performance increase over a traditional Ternary Compressed Sparse Column (TCSC) baseline for large matrices with 50% ternary nonzero values (sparsity), reaching up to a 50.2% of the processor's theoretical peak performance, and remains stable across varying sparsity levels. Our vectorized implementation delivers up to a 5.59x performance increase for large matrices with 25% sparsity, and remains stable across varying sparsity levels.
Similar Papers
Accelerating Sparse Ternary GEMM for Quantized ML on Apple Silicon
Performance
Makes Apple computers do math much faster.
Exploring the Performance Improvement of Tensor Processing Engines through Transformation in the Bit-weight Dimension of MACs
Hardware Architecture
Makes AI calculations faster and use less power.
Parallel GPU-Enabled Algorithms for SpGEMM on Arbitrary Semirings with Hybrid Communication
Distributed, Parallel, and Cluster Computing
Speeds up computer calculations for science and games.