Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM
By: Ryan Solgi , Parsa Madinei , Jiayi Tian and more
Potential Business Impact:
Makes smart computer programs smaller and faster.
Large language models (LLM) and vision-language models (VLM) have achieved state-of-the-art performance, but they impose significant memory and computing challenges in deployment. We present a novel low-rank compression framework to address this challenge. First, we upper bound the change of network loss via layer-wise activation-based compression errors, filling a theoretical gap in the literature. We then formulate low-rank model compression as a bi-objective optimization and prove that a single uniform tolerance yields surrogate Pareto-optimal heterogeneous ranks. Based on our theoretical insights, we propose Pareto-Guided Singular Value Decomposition (PGSVD), a zero-shot pipeline that improves activation-aware compression via Pareto-guided rank selection and alternating least-squares implementation. We apply PGSVD to both LLM and VLM, showing better accuracy at the same compression levels and inference speedup.
Similar Papers
Large Language Model Compression via the Nested Activation-Aware Decomposition
Machine Learning (CS)
Makes big AI models smaller and faster.
Globally optimized SVD compression of LLMs via Fermi-function-based rank selection and gauge fixing
Machine Learning (CS)
Makes big computer brains smaller and faster.
1+1>2: A Synergistic Sparse and Low-Rank Compression Method for Large Language Models
Computation and Language
Makes big AI models smaller and faster.