Accelerating Sparse Ternary GEMM for Quantized ML on Apple Silicon
By: Baraq Lipshitz , Alessio Melone , Charalampos Maraziaris and more
Potential Business Impact:
Makes Apple computers do math much faster.
Sparse Ternary General Matrix-Matrix Multiplication (GEMM) remains under-optimized in existing libraries for Apple Silicon CPUs. We present a Sparse Ternary GEMM kernel optimized specifically for Apple's M-series processors. We propose a set of architecture-aware optimizations, including a novel blocked and interleaved sparse data format to improve memory locality, strategies to increase Instruction-Level Parallelism (ILP), and NEON-based Single Instruction Multiple Data (SIMD) vectorization to exploit data-level parallelism. Our scalar implementation achieves up to a 5.98x performance increase over a traditional Ternary Compressed Sparse Column (TCSC) baseline for large matrices with 50% ternary nonzero values (sparsity), reaching up to a 50.2% of the processor's theoretical peak performance, and remains stable across varying sparsity levels. Our vectorized implementation delivers up to a 5.59x performance increase for large matrices with 25% sparsity, and remains stable across varying sparsity levels.
Similar Papers
Accelerating Sparse Ternary GEMM for Quantized LLM inference on Apple Silicon
Performance
Makes Apple computers run math problems much faster.
Accelerating Sparse Matrix-Matrix Multiplication on GPUs with Processing Near HBMs
Distributed, Parallel, and Cluster Computing
Makes computers solve hard math problems much faster.
Demystifying ARM SME to Optimize General Matrix Multiplications
Distributed, Parallel, and Cluster Computing
Makes computer math much faster for AI.