Score: 1

COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens

Published: September 8, 2025 | arXiv ID: 2509.06836v1

By: Eugene Kwek, Wenpeng Yin

Potential Business Impact:

Makes AI models smaller and faster to run.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Making LLMs more efficient in memory, latency, and serving cost is crucial for edge deployment, interactive applications, and sustainable inference at scale. Pruning is a key technique toward this goal. However, prior pruning methods are limited: width pruning often breaks the standard transformer layout or requires custom inference code, while depth pruning removes entire layers and can cause abrupt accuracy drops. In this work, we propose COMPACT, which jointly (i) prunes rare vocabulary to shrink embedding/unembedding and (ii) prunes FFN intermediate channels using common-token-weighted activations, aligning importance with the post-pruning token distribution. COMPACT enjoys merits of both depth and width pruning, such as: deployment-friendliness (keeps a standard transformer architecture), scale-adaptivity (trade off vocab vs. FFN pruning), training-free operation with competitive pruning time, and strong memory savings alongside throughput gains. Experiments across Qwen, LLaMA, and Gemma families (0.5B-70B) show state-of-the-art downstream task performance at similar or higher pruning ratios, with substantial reductions in parameters, GPU memory, and end-to-end latency.

Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation

CV and Pattern Recognition

Makes big AI models smaller without losing smarts.

17 Oct 2025 2

88%

Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation

Computation and Language

Makes big AI models smaller without losing smarts.

24 Jul 2025 0

87%

Compressing CNN models for resource-constrained systems by channel and layer pruning

Machine Learning (CS)

Makes smart computer programs smaller and faster.

10 Sep 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

15 pages

COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens

Makes AI models smaller and faster to run.

Technical Abstract

Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation

Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation

Compressing CNN models for resource-constrained systems by channel and layer pruning