COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens
By: Eugene Kwek, Wenpeng Yin
Potential Business Impact:
Makes AI models smaller and faster to run.
Making LLMs more efficient in memory, latency, and serving cost is crucial for edge deployment, interactive applications, and sustainable inference at scale. Pruning is a key technique toward this goal. However, prior pruning methods are limited: width pruning often breaks the standard transformer layout or requires custom inference code, while depth pruning removes entire layers and can cause abrupt accuracy drops. In this work, we propose COMPACT, which jointly (i) prunes rare vocabulary to shrink embedding/unembedding and (ii) prunes FFN intermediate channels using common-token-weighted activations, aligning importance with the post-pruning token distribution. COMPACT enjoys merits of both depth and width pruning, such as: deployment-friendliness (keeps a standard transformer architecture), scale-adaptivity (trade off vocab vs. FFN pruning), training-free operation with competitive pruning time, and strong memory savings alongside throughput gains. Experiments across Qwen, LLaMA, and Gemma families (0.5B-70B) show state-of-the-art downstream task performance at similar or higher pruning ratios, with substantial reductions in parameters, GPU memory, and end-to-end latency.
Similar Papers
Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation
CV and Pattern Recognition
Makes big AI models smaller without losing smarts.
Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation
Computation and Language
Makes big AI models smaller without losing smarts.
Compressing CNN models for resource-constrained systems by channel and layer pruning
Machine Learning (CS)
Makes smart computer programs smaller and faster.