MACKO: Sparse Matrix-Vector Multiplication for Low Sparsity
By: Vladimír Macko, Vladimír Boža
Potential Business Impact:
Makes AI models use less memory and run faster.
Sparse Matrix-Vector Multiplication (SpMV) is a fundamental operation in the inference of sparse Large Language Models (LLMs). Because existing SpMV methods perform poorly under the low and unstructured sparsity (30-90%) commonly observed in pruned LLMs, unstructured pruning provided only limited memory reduction and speedup. We propose MACKO-SpMV, a GPU-optimized format and kernel co-designed to reduce storage overhead while preserving compatibility with the GPU's execution model. This enables efficient SpMV for unstructured sparsity without specialized hardware units (e.g., tensor cores) or format-specific precomputation. Empirical results show that at sparsity 50%, MACKO is the first approach with significant 1.5x memory reduction and 1.2-1.5x speedup over dense representation. Speedups over other SpMV baselines: 2.8-13.0x over cuSPARSE, 1.9-2.6x over Sputnik, and 2.2-2.5x over DASP. Applied to Llama2-7B pruned with Wanda to sparsity 50%, it delivers 1.5x memory reduction and 1.5x faster inference at fp16 precision. Thanks to MACKO, unstructured pruning at 50% sparsity is now justified in real-world LLM workloads.
Similar Papers
Toward Efficient SpMV in Sparse LLMs via Block Extraction and Compressed Storage
Distributed, Parallel, and Cluster Computing
Makes AI models run much faster and smaller.
Verification Challenges in Sparse Matrix Vector Multiplication in High Performance Computing: Part I
Logic in Computer Science
Speeds up computer math for science.
LOw-cOst yet High-Performant Sparse Matrix-Matrix Multiplication on Arm SME Architectures
Distributed, Parallel, and Cluster Computing
Makes computer math problems run much faster.